Textbook Computational Chemogenomics J B Brown Ebook All Chapter PDF
Textbook Computational Chemogenomics J B Brown Ebook All Chapter PDF
Textbook Computational Chemogenomics J B Brown Ebook All Chapter PDF
Brown
Visit to download the full and correct content document:
https://textbookfull.com/product/computational-chemogenomics-j-b-brown/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...
https://textbookfull.com/product/the-physics-of-solids-1st-
edition-j-b-ketterson/
https://textbookfull.com/product/how-to-end-the-autism-epidemic-
j-b-handley/
https://textbookfull.com/product/my-beautiful-
monsters-02-0-monster-song-1st-edition-j-b-trepagnier/
https://textbookfull.com/product/my-beautiful-
monsters-01-0-monster-whisperer-1st-edition-j-b-trepagnier/
System Brown
https://textbookfull.com/product/system-brown/
https://textbookfull.com/product/my-beautiful-monsters-03-0-the-
call-of-monsters-1st-edition-j-b-trepagnier/
https://textbookfull.com/product/the-tower-of-pisa-history-
construction-and-geotechnical-stabilization-first-edition-j-b-
burland/
https://textbookfull.com/product/binary-system-brown/
https://textbookfull.com/product/new-senior-mathematics-advanced-
for-years-11-12-student-book-3rd-edition-j-b-fitzpatrick/
Methods in
Molecular Biology 1825
Computational
Chemogenomics
METHODS IN MOLECULAR BIOLOGY
Series Editor
John M. Walker
School of Life and Medical Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK
Edited by
J.B. Brown
Life Science Informatics Research Unit, Laboratory of Molecular Biosciences,
Kyoto University Graduate School of Medicine, Kyoto, Japan
Editor
J.B. Brown
Life Science Informatics Research Unit
Laboratory of Molecular Biosciences
Kyoto University Graduate School of Medicine
Kyoto, Japan
Cover Illustration: The cover image shows the inhibitory activity of compounds against aromatase, a critical hormone-
processing enzyme in many organisms. Each point represents one compound. Green and yellow colors indicate highly
weak or micromolar activity, red points represent strong activity, and purple points indicate single-digit nanomolar
activity or stronger. Compounds are positioned by relative distance using multi-dimensional scaling. Activity cliffs can be
seen where large changes in activity occur between closely spaced compounds, which are often analogs.
This Humana Press imprint is published by the registered company Springer Science+Business Media, LLC part of
Springer Nature.
The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.
Preface
This book provides a collection of techniques used in the emerging field of computational
chemogenomics. It covers practical processes to execute research and analyses in the field,
which is an integration of chemoinformatics, bioinformatics, computer science, statistics,
automated pattern recognition and modeling, database usage with data retrieval, and
systems integration. Clearly, to master the field of computational chemogenomics requires
a considerable variety of knowledge and data processing skills, and this text hopes to get the
interested reader acquainted with and capable of many of the practical skills used in the field.
The target audience is both those from experimental sciences who are novices to data
processing and modeling, and those with computationally oriented backgrounds wishing
to engage in this scientific area, which is continually growing and now expected to contrib-
ute to industry, academic, and government research projects.
Historically, testing for chemical effects on biological processes, whether at the level of
organism response, organ response (e.g., organ toxicity), cellular response (e.g., apoptosis),
or individual target protein response in cell lines (e.g., inhibition), has required a large and
orchestrated effort; confirmation of chemical purity, preparation of chemicals at a span of
concentrations, application of those concentration-specific chemical stocks to the process or
target, and precise recording of the outcome have typically been executed and recorded
manually. At the same time, methods in genetic manipulation, gene sequence determina-
tion, gene expression measurement, and protein expression measurement have similarly
required substantial investments in human resources and facilities.
The development of specialized equipment for automated high-content and high-
throughput screening as well as parallel automation developments in genetics and proteo-
mics made it possible to have chemical activity data for thousands of compounds instead of
hundreds, as well as to expand measurement of gene expression from a few genes to tens,
hundreds, or thousands. As a result, the technologies needed to systematically unlock the
interface between chemistry and biology on a large scale had arrived. Finally, in 2001,
worldwide efforts to create the first draft version of the full human genome were completed,
and with such in hand, the stage was set to integrate the technologies for chemistry-biology
interface exploration with our newfound knowledge about the genetic underpinnings of
human physiology.
Only months after the sequencing of the human genome, the idea of exploring the
protein products of a genome from a chemical perspective was proposed, and the term
chemogenomics was born. This term bears resemblance to two other chemically driven
scientific fields, and the reader should be aware of differences in terminology. First, scientists
are also often in need of knowing the effect of a chemical on an organism when that
organism contains a genetic defect such as a mutation or complete knockout, and this
field is known as chemical genetics. Second, scientists may want to understand the functional
impact of chemicals on coordinated processes occurring within cells encoded by genomes,
for example the multiprotein signaling response to toxic chemicals measured in a variety of
organisms. This field, chemical genomics, is more concerned with chemistry and genomics
at a systems science level, compared with the chemogenomic focus of chemical modulation
of individual proteins.
v
vi Preface
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
ix
x Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
Contributors
xi
xii Contributors
Abstract
Chemogenomics is a comparatively nascent branch dealing with the effects of drugs and chemicals on
molecular level systems. With the emergence of this new epoch, the quantity of data sources is also
unprecedentedly increasing. Despite having a plethora of a databases, the variation in bioactivity measure-
ment as well as bias toward specific protein studies, varied computational procedures and redundant
information make data mining tedious, especially for newcomers in the field. In this chapter, we give an
overview of hands-on data collection and domains of applicability from some useful Web-based chemoge-
nomic resources that are accessible with nothing more than a Web browser. This overview can help assist
users in acquiring chemogenomic datasets for their project at hand.
Key words Chemogenomic resources, World Wide Web, Ligand-target data, ChemProt, STITCH,
PubChem, ChEMBL, ChEBI, ChemSpider, PharmGKB
1 Introduction
Rasel Al Mahmud, Rifat Ara Najnin, and Ahsan Habib Polash contributed equally to the chapter.
J.B. Brown (ed.), Computational Chemogenomics, Methods in Molecular Biology, vol. 1825,
https://doi.org/10.1007/978-1-4939-8639-2_1, © Springer Science+Business Media, LLC, part of Springer Nature 2018
3
4 Rasel Al Mahmud et al.
TopII inhibitor highly often used in clinical practice which can lead
to abortive catalysis of the enzyme and generates an increased level
of TopII–DNA complex (Fig. 1). This abnormal complex structure
is known as a TopII adduct and persistence of this type of interme-
diate renders a lesion of the genome, impairing the DNA repair
pathway as well as gene expression, which ultimately can lead to
cancer and other diseases. Therefore, since etoposide and other Top
II inhibitors stall DNA synthesis and the cell cycle, etoposide is a
well-known chemotherapeutic agent for cancer patients.
Given a brief history of the developments in bioactivity data-
bases and a practical molecular biology context in which the data-
bases can be utilized, we provide in this chapter resources and
protocols for mining of data in several prominent and progressive
online databases with a special emphasis on examples useful for
chemogenomic research (consider Note 1).
2 Materials
Table 1
Computational chemogenomic data sources reviewed
2.1 ChemProt The conventional drug design paradigm, i.e., one drug selectively
interacts with one or two target molecules, has drastically changed
in recent times. Most of the drugs are now known to be involved in
multiple pathways with diverse interaction partners. To identify the
broad spectrum interactome of drugs and targets, an integrative
tool which could analyze the whole set of interactions on a single
platform has become a necessity. ChemProt 3 [3] is such a
Web-based disease-oriented chemical biology tool which can dis-
play multiple interactions of both chemical–protein and protein–-
protein on a single heatmap. By aggregating data from related
databases such as CheMBL, DrugBank, BindingDB, STITCH,
PharmGKB and IUPHAR, ChemProt can assist in the in silico
evaluation of small molecules (drugs, environmental chemicals,
and natural products) with the integration of molecular and cellular
level phenotypes. Moreover, it enables pharmacological space navi-
gation for small molecules based on a similarity ensemble approach
(SEA) [17] to relate protein pharmacology with respect to ligand
bioactivity profile. SEA organizes proteins by clustering them based
on their bioactivities with respect to a set of ligands, and can be
viewed in one sense as a chemical version of the well-known BLAST
approach for generating a score of protein homology.
2.2 STITCH Interaction patterns of proteins and small molecules are a pivotal
point for understanding metabolism, signaling, and development
of drugs. Although a myriad of data is stored in several databases
regarding chemical–protein or chemical–chemical interaction, their
discrete nature, varied precision (see above regarding protein bias
and measurement consistency) and focus make it cumbersome to
assemble a full picture of all available information. STITCH (stitch.
embl.de) is a consolidated search tool which aggregates high-
throughput experimental data, manually curated datasets, and the
results of several prediction methods into a single global network of
protein–protein and protein–chemical interactions (STITCH 4 and
STITCH 5) (STITCH does not include chemical–chemical inter-
action links).
2.3 PubChem PubChem is one of the prominent public databases with a special
emphasis on providing information about chemical substances
along with their specific compound structure as well as biological
activities for the scientific research community. This database com-
menced in 2004 as a public repository hosted by the National
Center for Biotechnology Information (NCBI), a research center
of the National Library of Medicine, which is part of the US
National Institutes of Health (NIH). Over the continued progres-
sive growth period of more than a decade by deposition of data
from worldwide researchers at academia, industry and government
agencies, the volume of the database has become massive. Thus, at
present PubChem comprises three component databases; though
8 Rasel Al Mahmud et al.
2.3.1 PubChem BioAssay The PubChem BioAssay database contains bioactivity screens of
Database small-molecules and RNAi screening data. The bioactivities stored
in each bioassay are indexed by an assay ID (AID) serving as the
primary accession. At present it is a vital and highly comprehensive
information resource for biological screening results contributed by
the NIH Molecular Library Program, other public research orga-
nizations, and industrial companies to aid in drug discovery and
chemical biology research. It is integrated with all other databases
at the NCBI including PubMed, Protein, Gene, and so forth for a
unified approach to data exploration and discovery. Several recent
developments of PubChem BioAssay include the expansion of the
sources of bioactivity data, resynchronization of BioAssay record
page, addition of a new BioAssay classification browser (Fig. 2a), as
well as new features for its upload system to facilitate data sharing.
The database is equipped with many services to execute and display
analyses of bioactivity data from within a Web browser (Table 2).
Fig. 2 (a) PubChem BioAssay classification browser. (b) Snapshots of “limit search” and “advanced search”
interfaces both in PubChem Substance and PubChem Compound databases
Web-Based Chemogenomic Resources 9
Fig. 2 (continued)
2.3.2 PubChem The PubChem Substance database contains the storage of informa-
Substance and Compound tion provided by a depositor, thus a PubChem Substance sum-
Databases mary page is based on the data submitted by an individual
depositor. A depositor may include a pharmaceutical company, an
academic laboratory, or governmental research institute, to name a
few. The raw deposition of data is not subject to quality control or
review before public release. The data includes a chemical structure,
that is, the arrangement of atoms and bonds between atoms, and it
may include other packaging or delivery-related information, such
as the salt form of the substance that is used. In contrast, internally
reviewed chemical information is stored in PubChem Compound
to clarify substances in PubChem Substance. In addition, structures
are preclustered and cross-referenced by identity and similarity
groups in the PubChem Compound Database. In this compound
database, a compound summary page is dedicated to display data
organized by NCBI automated data processing, which in turn
serves as a hub of information for each unique chemical structure.
The primary identifiers for a substance and a compound are SID
and CID, respectively. A substance identifier (SID) is the
Table 2
10
BioAssay FTP ftp://ftp.ncbi.nlm.nih.gov/pubchem/Bioassay/ FTP for all PubChem BioAssay records and related information
BioAssay data ftp://ftp.ncbi.nlm.nih.gov/pubchem/data_spec/ Standard XML data specification for PubChem, BioAssay data
model
BioAssay classification https://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?p= To browse BioAssay classification tree
classification
Bioactivity data tool https://pubchem.ncbi.nlm.nih.gov/assay/ To retrieve a full data table from a single bioassay record
Structure–activity analysis https://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?p=heat To analyze and visualize structure–activity relationship with
(SAR) clustering tools and a heatmap-style display
Dose–response curve tool https://pubchem.ncbi.nlm.nih.gov/assay/plot.cgi? To analyze bioassay test results and visualize dose–response
Plottype=1 curve
BioActivity summary - https://pubchem.ncbi.nlm.nih.gov/assay/bioactivity.cgi? To summarize and analyze bioactivity data for a set of records,
compound-centric tab=1 presented from the compound point of view
BioActivity summary - https://pubchem.ncbi.nlm.nih.gov/assay/bioactivity.cgi? To summarize and analyze bioactivity data for a set of records,
assay-centric tab=2 presented from the assay point of view
BioActivity summary - https://pubchem.ncbi.nlm.nih.gov/assay/bioactivity.cgi? To summarize and analyze bioactivity data for a set of records,
target-centric tab=3 presented from the target point of view
Web-Based Chemogenomic Resources 11
2.4.1 Data Content The data content of this online resource grows continuously; release
22 published in August 2016 contains information that is extracted
from more than 65,000 scientific articles, along with 50 stored data
sets (Table 4). To be more specific, this resource at present organizes
1,686,695 distinct compounds of which 1,678,393 (99.5%) have
molecular structure stored and available. In addition, the newest
release represents more than 14 million activity values from
Table 3
12
Chemical structure https://pubchem.ncbi.nlm.nih.gov/search/search.cgi Allows users to query the PubChem compound database by chemical structure or
search chemical structure pattern.
Chemical structure https://pubchem.ncbi.nlm.nih.gov/edit/ A platform-independent 2D molecule drawer, compatible with major web browsers.
sketcher
Standardization https://pubchem.ncbi.nlm.nih.gov/standardize/ Validates and normalizes an input chemical structure in the same way as PubChem
service standardization process.
Rasel Al Mahmud et al.
Classification https://pubchem.ncbi.nlm.nih.gov/classification/ Allows users to browse PubChem data using a classification of interest, or search for
browser records annotated with the desired classification/term.
Identifier exchange https://pubchem.ncbi.nlm.nih.gov/idexchange/ Converts one type of identifiers for a given set of chemical structures into a different
service type of identifiers for identical or similar chemical structures.
Score matrix service https://pubchem.ncbi.nlm.nih.gov/score_matrix/ Computes matrices of 2D and 3D similarity scores for a given set of compounds.
Structure clustering https://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi? Clusters compounds/substances based on their structural similarity using the single
p¼clustering linkage algorithm.
Widgets https://pubchem.ncbi.nlm.nih.gov/widget/docs/ Provides a rapid way to display some commonly requested PubChem data views.
Web-based 3D https://pubchem.ncbi.nlm.nih.gov/vw3d/ An interactive web-based viewer for 3D conformations of molecules, which visualizes
viewer 3D information available within PubChem.
Pc3D viewer https://pubchem.ncbi.nlm.nih.gov/pc3d/ An interactive 3D molecular viewer that can be downloaded and installed on local
machines.
Structure download https://pubchem.ncbi.nlm.nih.gov/pc_fetch/ Downloads a set of substance or compound records in PubChem.
Power user gateway, https://pubchem.ncbi.nlm.nih.gov/pug/pughelp.html Provides programmatic access to PubChem services via a single common gateway
PUG) interface (CGI), called “pug.Cgi”.
PUG-REST https://pubchem.ncbi.nlm.nih.gov/pug_rest/ A representational state transfer (REST)-full style web service access layer to
PubChem.
PUG-SOAP https://pubchem.ncbi.nlm.nih.gov/pug_soap/ A web service access method that uses the simple object access protocol (SOAP).
PubChemRDF https://pubchem.ncbi.nlm.nih.gov/rdf/ The RDF-based resource compatible with semantic web standards and technologies.
Web-Based Chemogenomic Resources 13
Table 4
Data sources included in the ChEMBL release 22
2.4.2 Data Access ChEMBL is accessible from The European Bioinformatics Institute
(EMBL-EBI) home page under the service section on Tools and
Databases (see Table 1). The ChEMBL interface is accessible
through simple browsing using ChEMBL with keyword text
searches (Fig. 3). This interface provides versatile tools such as the
primary ChEMBL database which provides bioactivity data to facil-
itate drug discovery, SureChEMBL dedicated for chemical struc-
tures from patents, while UniChem is useful for the chemical
structure integration through different number of public sources.
In addition, The SARfari collections deal with the system-level
views of kinases, GPCRS, and ADME biology, and DrugEBIlity
provides a way for drug target prioritization for the users. Thus,
these versatile tools make the data access, exploration, retrieval, and
analysis procedure more user friendly and systematic for com-
pounds, targets, or assays deposited in ChEMBL.
14 Rasel Al Mahmud et al.
2.5 ChEBI Chemical Entities of Biological Interest also known as ChEBI [12]
is maintained by EMBL-EBI. This database manually annotates
2.5.1 Overview
small molecular entities where a molecular entity is defined as any
of Database
constitutionally or isotopically distinct atom, molecule, ion, ion
pair, radical, radical ion, complex, conformer, etc. identifiable as a
separately distinguishable entity.
This database provides information of molecules based on such
chemical structure and nomenclature. Ontology is used to describe
the relation among different molecules. For example, if A, B, and C
are three compounds, there might be the relations that A is a
conjugate acid of B, and B is a tautomer of C. For the nomenclature
and terminology determination, ChEBI follows the guideline of
the International Union of Pure and Applied Chemistry (IUPAC)
and the International Union of Biochemistry and Molecular Biol-
ogy (NC-IUBMB).
Searching ChEBI There are two types of search in ChEBI. One is the quick search,
where simply a keyword for a compound is provided as input, e.g.,
“etoposide.” This is the most convenient one. The other type of
Web-Based Chemogenomic Resources 15
2.6 ChemSpider ChemSpider [13] was initially developed with a goal to accumulate
and index the available sources of chemical structures and their
2.6.1 Overview
respective information in a single database.
of Database
After being started in 2007 to focus on building a structure-
oriented platform for chemists, ChemSpider currently deposits
more than 58 million unique chemical structures derived from
484 sources ranging from chemical vendors to commercial database
vendors and publishers, and members of the Open Notebook
Science community. By using interlinked connections ChemSpider
can provide important data beyond chemical structure including
interactive spectra, crystallographic data, patents, and so forth.
2.6.2 Database Access For accessing the database a Web browser is needed, and visiting
the following link will take the user to ChemSpider home page:
http://www.chemspider.com/.
3 Methods
3.1.1 Searching Data A user can search for a query in ChemProt such as by typing a
compound in the “compound” field, by either protein sequence or
Uniprot identifier, by a common disease name, by a side effect, or
by ATC (Anatomical Therapeutic Chemical Classification System)
code (Fig. 5a). The outcome of the data varies according to the
searching option; for instance, if etoposide is searched as a query
compound, ChemProt automatically looks for similar compounds
in the database (based on SEA) and displays these data in conjuga-
tion with etoposide. In Fig. 5b, the heatmap represents the com-
bined data for etoposide and protein interaction where the
horizontal axis represents associated proteins and vertical axis
represents bioactivity data. The color of the heatmap represents
the strength of interaction, i.e., blue and orange color represent
weak and strong interaction respectively (Fig. 5b). Please see Note 3
for generating a new heatmap based on substructures within a
query compound and target collection.
On the other hand, searching by side effect or ATC code will
return all chemicals in the database associated with such a side effect
or ATC code respectively. Similarly, searching for a disease will
Web-Based Chemogenomic Resources 17
Fig. 5 (a) Home page of ChemProt with etoposide as a query compound. (b) The etoposide–protein interaction
heatmap for disease-associated proteins. Here the horizontal axis represents associated proteins and the
vertical axis represents bioactivity data. Colors of the heatmap represent the strength of interaction, i.e., blue
and orange colors represent weak and strong interactions, respectively
18 Rasel Al Mahmud et al.
Fig. 5 (continued)
3.1.2 Analyze The heatmap in Fig. 5b displays the association of a related protein
the Heatmap Data and a disease interaction with Etoposide. To navigate the func-
tional or pathway related protein association with the query com-
pound, the user has to select these two options respectively from
the annotated protein bar.
By clicking on the “flag” logo next to a compound name, a user
can get access to the chemical structure of the compound and upon
selecting the specific structure from the structure list, detailed
chemical information for the queried compound will appear (see
Note 4). Here for the Etoposide example, sets of chemical infor-
mation are found as shown in Fig. 6.
By clicking on the “fingerprint” logo in the vicinity of the
compound name, a chemical structure similarity profiling can be
performed, enabling the user to visualize and to navigate within
that chemical space.
A detailed bioactivity profile is available for each of the enlisted
compounds of ChemProt based on Ki, AC50, or IC50 value. For
the etoposide example the bioactivity information available includ-
ing the total number of associated proteins and interactions with
etoposide is as shown in Fig. 7.
Web-Based Chemogenomic Resources 19
Step-3: Data Acquisition From the “Download list” icon, a user can download all of the
available data in CSV format; this covers the sources of data,
ChemProt ID, chemical formula in SMILES form, UniProt
name, SEA values, and other related information for the queried
compound as well as other similar compounds listed.
Another random document with
no related content on Scribd:
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright law in
the United States and you are located in the United States, we do
not claim a right to prevent you from copying, distributing,
performing, displaying or creating derivative works based on the
work as long as all references to Project Gutenberg are removed. Of
course, we hope that you will support the Project Gutenberg™
mission of promoting free access to electronic works by freely
sharing Project Gutenberg™ works in compliance with the terms of
this agreement for keeping the Project Gutenberg™ name
associated with the work. You can easily comply with the terms of
this agreement by keeping this work in the same format with its
attached full Project Gutenberg™ License when you share it without
charge with others.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the terms
of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.
• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
1.F.4. Except for the limited right of replacement or refund set forth in
paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.