Biological Databases

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Introduction to Biological Databases

1. Introduction

As biology has increasingly turned into a data-rich science, the need for storing and
communicating large datasets has grown tremendously. The obvious examples are the
nucleotide sequences, the protein sequences, and the 3D structural data produced by X-ray
crystallography and macromolecular NMR. A new field of science dealing with issues,
challenges and new possibilities created by these databases has emerged: bioinformatics.

Bioinformatics is the application of Information technology to store, organize and analyze
the vast amount of biological data which is available in the form of sequences and
structures of proteins (the building blocks of organisms) and nucleic acids (the information
carrier). The biological information of nucleic acids is available as sequences while the
data of proteins is available as sequences and structures. Sequences are represented in
single dimension where as the structure contains the three dimensional data of sequences.

Sequences and structures are only among the several different types of data required in the
practice of the modern molecular biology. Other important data types includes metabolic
pathways and molecular interactions, mutations and polymorphism in molecular sequences
and structures as well as organelle structures and tissue types, genetic maps,
physiochemical data, gene expression profiles, two dimensional DNA chip images of
mRNA expression, two dimensional gel electrophoresis images of protein expression, data
A biological database is a collection of data that is organized so that its contents can easily
be accessed, managed, and updated. There are two main functions of biological databases:

Make biological data available to scientists.

o As much as possible of a particular type of information should be available
in one single place (book, site, and database). Published data may be
difficult to find or access and collecting it from the literature is very time-
consuming. And not all data is actually published explicitly in an article
(genome sequences!).

To make biological data available in computer-readable form.

o Since analysis of biological data almost always involves computers, having
the data in computer-readable form (rather than printed on paper) is a
necessary first step.
Data Domains

Types of data generated by molecular biology research:
Nucleotide sequences (DNA and mRNA)
Protein sequences
3-D protein structures
Complete genomes and maps

Introduction to Biological Databases

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
323

Also now have:
Gene expression
Genetic variation (polymorphisms)
2. Biological Databases
When Sanger first discovered the method to sequence proteins, there was a lot of
excitement in the field of Molecular Biology. Initial interest in Bioinformatics was
propelled by the necessity to create databases of biological sequences.

Biological databases can be broadly classified into sequence and structure databases.
Sequence databases are applicable to both nucleic acid sequences and protein sequences,
whereas structure database is applicable to only Proteins. The first database was created
within a short period after the Insulin protein sequence was made available in 1956.
Incidentally, Insulin is the first protein to be sequenced. The sequence of Insulin consisted
of just 51 residues (analogous to alphabets in a sentence) which characterize the sequence.
Around mid nineteen sixties, the first nucleic acid sequence of Yeast tRNA with 77 bases
(individual units of nucleic acids) was found out. During this period, three dimensional
structures of proteins were studied and the well known Protein Data Bank was developed
as the first protein structure database with only 10 entries in 1972. This has now grown in
to a large database with over 10,000 entries. While the initial databases of protein
sequences were maintained at the individual laboratories, the development of a
consolidated formal database known as SWISS-PROT protein sequence database was
initiated in 1986 which now has about 70,000 protein sequences from more than 5000
model organisms, a small fraction of all known organisms. These huge varieties of
divergent data resources are now available for study and research by both academic
institutions and industries. These are made available as public domain information in the
larger interest of research community through Internet (www.ncbi.nlm.nih.gov) and
CDROMs (on request from www.rcsb.org). These databases are constantly updated with
additional entries.

Databases in general can be classified in to primary, secondary and composite databases.
A primary database contains information of the sequence or structure alone. Examples of
these include Swiss-Prot & PIR for protein sequences, GenBank & DDBJ for Genome
sequences and the Protein Databank for protein structures.

A secondary database contains derived information from the primary database. A
secondary sequence database contains information like the conserved sequence, signature
sequence and active site residues of the protein families arrived by multiple sequence
alignment of a set of related proteins. A secondary structure database contains entries of
the PDB in an organized way. These contain entries that are classified according to their
structure like all alpha proteins, all beta proteins, etc. These also contain information on
conserved secondary structure motifs of a particular protein. Some of the secondary
database created and hosted by various researchers at their individual laboratories includes
SCOP, developed at Cambridge University; CATH developed at University College of
London, PROSITE of Swiss Institute of Bioinformatics, eMOTIF at Stanford.

Composite database amalgamates a variety of different primary database sources, which
obviates the need to search multiple resources. Different composite database use different
primary database and different criteria in their search algorithm. Various options for search
Introduction to Biological Databases

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
324

have also been incorporated in the composite database. The National Center for
Biotechnology Information (NCBI) which hosts these nucleotide and protein databases in
their large high available redundant array of computer servers, provides free access to the
various persons involved in research. This also has link to OMIM (Online Mendelian
Inheritance in Man) which contains information about the proteins involved in genetic
diseases.

2.1 Primary Nucleotide Sequence Repository GenBank, EMBL, DDBJ

These are three chief databases that store and make available raw nucleic acid sequences.
GenBank is physically located in the USA and is accessible through NCBI portal over
internet. EMBL (European Molecular Biology Laboratory) is in UK and DDJ B (DNA
databank of J apan) is in J apan. They have uniform data formats (but not identical) and
exchange data on daily basis. Here we will describe one of the database formats, GenBank,
in detail. The access to GenBank, as to all databases at NCBI is through the Entrez search
program. This front end search interface allows a great variety of search options.



Introduction to Biological Databases

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
325





Introduction to Biological Databases

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
326



The word accession number defines a field containing unique identification numbers. The
sequence and the other information may be retrieved from the database simple by
searching for a given accession number. Taking the field names in order, we have first all
the word LOCUS. This is a GenBank title that names the sequence entry. Apart for
accession number, it also specifies the number of bases in the entry, a nucleic acid type, a
codeword PRI that indicates the sequence is from primate, and the date on which the entry
was made. PRI is one of the 17 keyword search that are used to classify the data. The next
line of the file contains the definition of the entry, giving the name of the sequence. The
unique accession number came next, followed by a version number in case the entries have
gone through more than one version.




The next item is a list of specially defined keywords that used to index the entries. Next
come a set of SOURCE records which describe the organism from which sequence was
extracted. The complete scientific classification is given. This is followed by publication
details.
Introduction to Biological Databases

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
327






In the beginning, sequences were extracted from the published literature and painstaking
entered in the database. Each entry was therefore associated with a publication. The
features table includes coding region, exons, introns, promoters, alternate splice patterns,
mutation, variations and a translation into protein sequence, if it code for one. Each feature
may be accompanied by a cross-reference to another database. After the feature table, a
single line gives the base count statistics for the sequence. Finally comes the sequence
itself. The sequence is typed in lower cases, and for ease of reading, each line is divided
into six columns of ten bases each. A single number on the left numbers the bases.




Introduction to Biological Databases

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
328




The above description does not cover all the fields used in GenBank, but only the more
important ones.

2.2 Primary Protein Sequence Repositories

PIR-PSD or protein information resource protein sequence database, at the NBRF
(National Biomedical Research Foundation, USA), and SWISS-PROT at the SBI (Swiss
Biotechnology Institute), Switzerland are protein sequence databases.

The PIR-PSD is a collaborative endeavour between the PIR, the MIPS (Munich
Information Centre for Protein Sequences, Germany) and the J IPID (J apan International
Protein Information Database, J apan). The PIR-PSD is now a comprehensive, non-
redundant, expertly annotated, object relational DBMS. It is available at
http://pir.georgetown.edu/pirww. A unique characteristic of the PIR-PSD is its
classification of protein sequences based on the super family concept. Sequence in PIR-
PSD is also classified based on homology domain and sequence motifs. Homology
domains may correspond to evolutionary building blocks, while sequence motifs represent
functional sites or conserved regions. The classification approach allows a more complete
understanding of sequence function structure relationship.

Introduction to Biological Databases

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
329

The other well known and extensively used protein database is SWISS-
PROT(http://www.expasy.ch/sprot). Like the PIR-PSD, this curated proteins sequence
database also provides a high level of annotation. The data in each entry can considered
separately as core data and annotation. The core data consists of the sequences entered in
common single letter amino acid code, and the related references and bibliography. The
taxonomy of the organism from which the sequence was obtained also forms part of this
core information. The annotation contains information on the function or functions of the
protein, post-translational modification such as phosphorylation, acetylation, etc.,
functional and structural domains and sites, such as calcium binding regions, ATP-binding
sites, zinc fingers, etc., known secondary structural features as for examples alpha helix,
beta sheet, etc., the quaternary structure of the protein, similarities to other protein if any,
and diseases that may rise due to different authors publishing different sequences for the
same protein, or due to mutations in different strains of an described as part of the
annotation.
Lines of code in SWISS-PROT database:

Code Expansion Remarks
ID Identification Occurs at the beginning of the entry. Contains
a unique name for the entry, plus information
on the status of the entry. If it has been
checked and conforms to SWISS-PROT
standards, it is called STANDARD.
AC Accession numbers This is a stable way of identifying the entry.
The name may change but not the AC. If the
line has more than one number, it means that
the entry was constituted by merging other
entries.
DT Date There are three dates corresponding to the
creation date of the entry and modification
dates of the sequence and the annotation
respectively
DE Description Lines that start with the identifier contain
general description about the sequence.
GN Gene name The name of the gene ( or genes) that codes
for the protein
OS,
OG,OC
Organism name,
Organelle, Organism
classification
The name and taxonomy of the organism, and
information regarding the organelle containing
the gene e.g. mitochondria or chloroplast, etc.
RN,
RP,RX,RA
RT,RL
Reference number,
Position, comments,
cross-reference,
authors, title and
location.
Bibliographic reference to the sequence. This
includes information (following the code RP)
on the extent of work carried out b the
authors.
CC Comments These are free text comments that provide any
relevant information pertaining to the entry.
DR Database cross-
reference
This line gives cross-references to other
databases where information regarding this
entry is also found. As for example to
structural information for the protein in the
PDB.
Introduction to Biological Databases

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
330

KW Keywords This line gives a list of keywords that can be
used in indexes. Search programs very often
simply go through such indices to identify
required information
FT Features Table These lines describe regions or sites of interest
in the sequence, e.g. post-transitional
modifications, binding sites, enzyme active
sites and local secondary structures
SQ Sequence Header This line indicates the beginning of the
sequence data and gives a brief summary of its
contents.




Both PIR-PSD and SWISS-PROT have software that enables the user to easily search
through the database to obtain only the required information. SWISS-PROT has the SRS or
the sequence retrieval system that searches also through the other relevant databases on the
site, such as TrEMBL.

TrEMBL (for Translated EMBL) is a computer-annotated protein sequence database that is
released as a supplement to SWISS-PROT. It contains the translation of all coding
sequences present in the EMBL Nucleotide database, which have not been fully annotated.
Thus it may contain the sequence of proteins that are never expressed and never actually
identified in the organisms.

2.3 Derived or Secondary databases of nucleotide sequences

Many of the secondary databases are simply sub-collection of sequences culled from one
or the other of the primary databases such as GenBank or EMBL. There is also usually a
great deal of value addition in terms of annotation, software, presentation of the
information and the cross-references. There are other secondary databases that do not
present sequences at all, but only information gathered from sequences databases.

An example of the former type of database is the FlyBase or The Bereley Drosophila
Genome Project ( http://www.fruitfly.org). A consortium sequenced the entire genome of
the fruit fly D. Melanogaster to a high degree of completeness and quality.
Introduction to Biological Databases

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
331


Another database that focuses on a single organism is ACeDB. More than a database, this
is a database management system that was originally developed for the C. Elegans ( a
nematode worm) genome project. It is a repository of not only the sequence, but also the
genetic map as well as phenotypic information about the C. Elegans nematode worm.

The comprehensive Microbial Resource maintained by TIGR (The Institute for Genomic
Research) at http://www.tigr.org allows access to a database called Omniome. This
contains all the focus on one organism. Omniome has not only the sequence and annotation
of each of the completed genomes, but also has associated information about the organisms
(such as taxon and gram stain pattern), the structure and composition of their DNA
molecules, and many other attributes of the protein sequences predicted from the DNA
sequences. The presence of all microbial genomes in a single database facilitated
meaningful multi-genome searches and analysis, for instance, alignment of entire genomes,
and comparison of the physical proper of proteins and genes from different genomes etc.

A database of the genomes of mitochondria and other such organelles is available at the
Organelle Genome Database at the University of Montreal, Canada, and is called
GOBASE (http://megasun.bch.umontreal.ca/gobase).

2.4 Derived or Secondary databases of amino acid sequences - Subcollections

Another family of a database focussed on a particular family protein is GPCRGB
(http://rose.man.pozen.pl/aars/). These are transmembrane protein used by cells to
communicate with the outside world. They are involved in vision, smell, hearing, taste and
feeling.GPCRGB is in fact more than a collection of sequences of the protein family. It
includes additional data on multiple sequences alignments. Ligands and ligands binding
data, 3D models, mutation data, literature reference, disease patterns, cell lines, protocols,
vectors etc. It is fully integrated information system with data, and browsing and query
tools.

MHCPep ( http://wehih.wehi.edu.au/mhcpep/) is a database comprising over 13000
peptide sequences known to bind the Major Histocompatibilty Complex of the immune
system. Each entry in the database contains not only the peptide sequence, which may be 8
to 10 amino acid long, but in addition has information on the specific MHC molecules to
which it binds, the experimental method used to assay the peptide, the degree of activity
and the binding affinity observed , the source protein that, when broken down gave rise to
this peptide along with other, the positions along the peptide where it anchors on the MHC
molecules and references and cross links to other information.

The CluSTr (Cluster of SWISS-PROT and TrEMBL proteins at http://ebi.ac.uk.clustr)
database offers an automatic classification of the entries in the SWISS-PROT and
TrEMBL databases into groups of related proteins. The clustering is based on the analysis
of all pair wise comparisons between protein sequences.

Similar to CluSTRr is the COGS or Cluster of Orthologous Groups of database that is
accessible at htp://ncbi.nlm.nih.gov/COG. An orthologous group of proteins is one in
which the members are related to each other by evolutionary descent. Such orthology may
not be just from one protein to another, and then to another and so on down the line. It may
involve one-to-many ad many-to-many evolutionary relationships, and hence the term
groups. COGS is thus a database of phylogenetic relationships. The approximately 2500
Introduction to Biological Databases

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
332

groups have been divided into 17 broad categories. The utility of COGS, as of CluSTr, is
that it helps assign function to new protein sequences without going through tedious
biochemical discovery processes.

2.5 Derived or Secondary databases of amino acid sequences Patterns and
Signature

A set of databases collects together patterns found in protein sequences rather than the
complete sequences. The patterns are identified with particular functional and/or structural
domains in the protein, such as for example, ATP binding site or the recognition site of a
particular substrate. The patterns are usually obtained by first aligning a multitude of
sequences through multiple alignment techniques. This is followed by further processing
by different methods, depending on the particular database.

PROSITE is one such pattern database, which is accessible at
http://www.expasy.ch/prosite. The protein motif and pattern are encoded as regular
expressions. The information corresponding to each entry in PROSITE is of the two
forms the patterns and the related descriptive text. The regular expression is placed in a
format reminiscent of the SWISS-PROT entries, with a two letter identifier at beginning of
the each line specifying the type of information the line contains. The expression itself is
placed on line identified by PA. The entry also contains references and links to all the
proteins sequences that contains that pattern. The related descriptive text is placed in a
documentation file with the accession number making the connection to the expression
data.

In the PRINTS database (http://www.bioinfo.man.ac.uk/dbbrowser/PRINTS), the protein
sequence patterns are stored as fingerprints. A finger print is a set of motifs or patterns
rather than a single one. The information contained in the PRINT entry may be divided
into three sections. In addition to entry name, accession number and number of motifs, the
first section contains cross links to other databases that have more information about the
characterized family. The second section provides a table showing how many of the motifs
that make up the finger print occurs in the how many of the sequences in that family. The
last section of the entry contains the actual finger prints that are stored as multiply aligned
set of sequences, the alignment being made without gaps. There is therefore one set of
aligned sequences for each motif.

The ProDom protein domain database ( http://www.toulouse.inrs.fr/prodom.html) is a
compilation of homologous domains that have been automatically identified sequence
comparison and clustering methods using the program PSI-BLAST. No identification of
patterns is made.. The focus is here to look for complete and self-contained structural
domains and the search methods includes signals for such features. A graphical user
interface allows easy interactive analysis of structural and therefore functional homology
relationships among protein sequences.

A database called Pfam contains the profiles used using Hidden markov models
(http://www.sanger.ac.uk/Software/Pfam). HMMs build the model of the pattern as a series
of match, substitute, insert or delete states, with scores assigned for alignment to go from
one state to another. Each family or pattern defined in the Pfam consists of the four
elements. The first is the annotation, which has the information on the source to make the
entry, the method used and some numbers that serve as figures of merit. The second is the
seed alignment that is used to bootstrap the rest of the sequences into the multiple
Introduction to Biological Databases

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
333

alignments and then the family. The third is the HMM profile. The fourth element is
complete alignment of all the sequences identified in that family.

2.6 Structure Databases

Structure databases like sequence databases comes in two varieties, primary and
secondary. Strictly speaking there is only one database that stores primary structural data
of biological molecules, namely the PDB. In the context of this database, term
macromolecule stretches to cover three orders of magnitude of molecular weight from
1000 Daltons to 1000 kilo Daltons Small biological and organic molecules have their
structures stored in another primary structure database the CSD, which is also widely used
in biological studies. This contains the three dimensional structure of drugs, inhibitors and
fragments or monomers of the macromolecule.

2.6.1 The primary structure database - PDB and CSD

PDB stands for Protein Databank. In spite of the name, PDB archive the three-dimensional
structures of not only proteins but also all biologically important molecules, such as
nucleic acid fragments, RNA molecules, large peptides such as antibiotic gramicidin and
complexes of protein and nucleic acids. The database holds data derived from mainly
three sources. Structure determined by X-ray crystallography form the large majority of the
entries. This is followed by structures arrived at by NMR experiments. There are also
structures obtained by molecular modelling. The data in the PDB is organized as flat files,
one to a structure, which usually means that each file contain one molecule, or one
molecular complex.

The Cambridge Structural Database (CSD) was originally a project of the University of
Cambridge, which is set up to collect together the published three-dimensional structure of
small organic molecules. This excludes proteins and medium sized nucleic acid fragments,
but small peptides such as neuropeptides, and monomer and dimmers of nucleic acid finds
a place in the CSD. Currently CSD holds crystal structures information for about 2.5 lakhs
organic and metal organic compounds. All these crystal structures have been obtained
using X-ray or neuron diffraction technique. For each entry in the CSD there are three
distinct types of information stored. These are categorized as bibliographic information,
chemical connectivity information and the three- dimensional coordinates. The annotation
data field incorporates all of the bibliographic material for the particular entry and
summarized the structural and experimental information for the crystal structure.

2.6.1.1 Derived or Secondary databases of bimolecular structures

NDB stands for Nucleic acid data bases. It is a relational database of three-dimensional
structures containing nucleic acid. This encompasses DNA and RNA fragments, including
those with unusual chemistry such as NDB, and collections of patterns and motifs such as
SCOP, PALI etc. The structures are the same as those found in the PDB and therefore the
NDB qualifies to be called a specialized sub collection. However a substantial amount,
and, unlike the PDB, the NDB is much more than just a collection of files. The structure of
DNA has been classified into A, B and Z polymorphic forms, based on the information
specified by authors. Other classes include RNA structures, unusual structures and
protein-nucleic acid complexes. These classes of structures are arranged in the form of an
ATLAS of Nucleic Acid Containing Structures, which can be browse and searched to
obtain the structure or structures required. Each entry in the atlas has information on the
Introduction to Biological Databases

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
334

sequence, crystallisation condition, references and details of the parameters and the figures
of the merit used in structure solution. The entry has links not only to the coordinated but
also to automatically generated graphical views of the molecule. NDB also has also have
archives of structural geometries calculated for all the structures or for a subset of them.
And finally, the database stores average geometrical parameters for nucleic acids, obtained
by statistical analysis of the structures. These parameters are widely used in computer
simulations of nucleic acids and their interactions. The NDB may be accessed at
http://ndbserve.rutgers.edu/NDB/.

The SCOP database (Structural Classification of Proteins: http://scop.mrc-
lmb.cam.ac.uk/scop/ ) is a manual classification of protein structures in a hierarchical
scheme with many levels. The principal classes are the family, the super family and the
fold. SCOP is a searchable and browsable database. In other words, one may either enter
SCOP at the top of the hierarchy or examine different folds and families as one pleases, or
one may supply a keyword or a phrase to be search the database and retrieve corresponding
entries. Once a structure, or a set of structures, has been selected, they may be obtained or
viewed wither as graphical images. Each entry also has other annotation regarding
function, etc., and links to other databases, including other structural classification such as
CATH.

CATH stands for Class, Architecture, Topology and Homologous super family. The name
reflects the classification hierarchy used in the database. The structures chosen for
classification are a subset of PDB, consisting of those that have been determined to a high
degree of accuracy.

Conclusion

The present challenge is to handle a huge volume of data, such as the ones generated by the
human genome project, to improve database design, develop software for database access
and manipulation, and device data-entry procedures to compensate for the varied computer
procedures and systems used in different laboratories. There is no doubt that
Bioinformatics tools for efficient research will have significant impact in biological
sciences and betterment of human lives.

You might also like