Bio in For Ma Tics
Bio in For Ma Tics
Bio in For Ma Tics
ABSTRACT:
Bioinformatics is the field of integrating biology and computers to improve the discovery
of new medical breakthroughs. A very wide ranging area, bioinfomatics includes
developing systems, databases, tools, algorithms to collect, model, and analyze the
enormous amount of data available about biology today.
Background
During the last decade ,molecular biology has witnessed an information revolution as a
result both of the development of rapid DNA sequencing techniques and of the
corresponding progress in computer-based technologies, which are allowing us to cope
with this information deluge in increasingly efficient ways. The broad term that was
coined in the mid-1980s to encompass computer applications in biological sciences is
bioinformatics.
The term bioinformatics has been commandeered by several different disciplines to mean
rather different things. In its broadest sense, the term can be considered to mean
information technology applied to the management and analysis of biological data; this
has implications in diverse areas, ranging from artificial intelligence and robotics to
genome analysis. In the context of genome initiatives, the term was originally applied to
the computational manipulation and analysis of biological sequence data (DNA and/or
protein). However, in view of the recent rapid accumulation of available protein
structures, the term now tends also to be used to embrace the manipulation and analysis
of three –dimensional (3D) structural data.
At the beginning of 1998,in publicly available, non-redundant databases, more than
300000 protein sequences have been deposited, and the number of partial sequences in
public (Boguski et al., 1994) and proprietary Expressed Sequence Tag (EST) databases
is estimated to run in millions. By contrast, the number of unique 3D structures in the
protein data bank (PDB) is still less than 1500.although structural information is far more
complex to derive, store and manipulate than are sequence data, these figures
nevertheless highlights an enormous information deficit; this situation is likely to get
worse as the various Genome Projects around the world begin to bear fruit.
This graph illustrates the non-redundant growth of sequence data during the last decade (-
) and the corresponding growth in the number of unique structures (-).
In the mid 1980s, the United States Department of Energy (DoE) initiated a number of
projects to construct detailed genetic and physical maps of the human genome, to
determine its complete nucleotides sequence, and to localize its estimated 100 000 genes.
Work on this scale required the development of new techniques and instrumentation for
detecting and analyzing DNA .in April 1998, although the sequencing projects of only a
small number of relatively small genomes had been completed.
What is bioinformatics?
Bioinformatics is about:
In these goals bioinfomatics takes the help of information technology for storage,
retrieval & analysis of data and to simulate biological processes.
The understanding of the bioinfomatics requires the a little basic knowledge of biology
especially genetics. So what is a genome? The complete set of instructions for making an
organism is called its genome. It contains the master blueprint for all cellular structures
and activities for the lifetime of the cell or organism. Found in every nucleus of a person's
many trillions of cells, the human genome consists of tightly coiled threads of
deoxyribonucleic acid (DNA) and associated protein molecules. If unwound and tied
together, the strands of DNA would stretch more than 5 feet but would be only 50
trillionths of an inch wide. For each organism, the components of these slender threads
encode all the information necessary for building and maintaining life, from simple
bacteria to remarkably complex human beings. Four different bases are present in DNA:
adenine (A), thiamine (T), cytosine (C), and guanine (G). The particular sequence of
these bases specifies the exact genetic instructions required to create a particular
organism with its own unique traits.
DNA STRUCTURE
On June 26th, 2000, President Clinton, leaders of the Human Genome Project (HGP) and
representatives of the biotechnology company Celera announced the completion of a
"working draft" reference DNA sequence of the human genome. Corporate and
government-led scientists have already compiled the three gigabytes of paired A's, C's,
T's and G's that spell out the human genetic code. But that is just the initial trickle of the
flood of information to be tapped from the human genome. Researchers are generating
gigantic databases containing the details of when and in which tissues of the body various
genes are turned on, the shapes of the proteins the genes encode, how the proteins interact
with one another and the role those interactions play in disease. Gene Myers, Jr., vice
president of informatics research at Celera Genomics, calls the data generated "a tsunami
of information."
This new discipline of bioinformatics seeks to make sense of this tsunami of information.
In so doing, it is destined to change the face of biomedicine. "For the next two to three
years, the amount of information will be phenomenal, and everyone will be overwhelmed
by it," Myers predicts. "The race and competition will be who can mine it best. There will
be such a wealth of riches."
The amount of information may be enormous but the divining the importance of the data
is the job of bioinformatics. The field got its start in the early 1980s with a database
called GenBank, which was originated by the U.S. Department of Energy to hold the
short stretches of DNA sequence that scientists were just beginning to obtain from a
range of organisms. In the early days of GenBank a roomful of technicians sat at
keyboards consisting of only the four letters A, C, T and G, tediously entering the DNA-
sequence information published in academic journals. As the years went on, newer
communication technologies enabled researchers to dial up GenBank and dump in their
sequence data directly, and the administration of GenBank was transferred to the U.S
National Institutes of Health's National Center for Biotechnology Information (NCBI).
After the advent of the World Wide Web, researchers could access the data in GenBank
for free from around the globe. Once the Human Genome Project (HGP) officially got off
the ground in 1990, the volume of DNA-sequence data in GenBank began to grow
exponentially. With the introduction in the 1990s of high-throughput sequencing (an
approach using robotics, automated DNA-sequencing machines and computers) additions
to GenBank skyrocketed.
One of the most basic operations in bioinformatics involves searching for similarities, or
homologies, between a newly sequenced piece of DNA and previously sequenced DNA
segments from various organisms. Finding near-matches allows researchers to predict the
type of protein the new sequence encodes. This not only yields leads for drug targets
early in drug development but also weeds out many targets that would have turned out to
be dead ends.
A popular set of software programs for comparing DNA sequences is BLAST (for Basic
Local Alignment Search Tool), which first emerged in 1990. BLAST is part of a suite of
DNA- and protein-sequence search tools accessible in various customized versions from
many database providers or directly through NCBI. NCBI also offers Entrez, a so-called
meta-search tool that covers most of NCBI's databases, including those housing three-
dimensional protein structures, the complete genomes of organisms such as yeast, and
references to scientific journals that back up the database entries
Large pharmaceutical companies have also sought to leverage their genomics efforts with
in-house bioinformatics investments. Many have established entire departments to
integrate and service computer software and facilitate database access across multiple
departments, including new product development, formulation, toxicology and clinical
testing. The old model of drug development often compartmentalized these functions,
isolating data that might have been useful to other researchers. Bioinformatics allows
researchers across a company to see the same thing while still manipulating the data
individually.
In addition to making drug discovery more efficient, in-house bioinformatics can also
save drug companies money in software support. GlaxoWellcome, is replacing individual
packages used by various investigators and departments to access and manipulate
databases with a single software platform. GlaxoWellcome, estimates that this will save
approximately $800,000 in staffing support over a three- to five-year period.
But with all this variety comes the potential for miscommunication. Getting various
databases to interoperate is becoming more and more important. An obvious solution
would be annotation, which is tagging data with names that are cross-referenced across
databases and naming systems. This has worked to a degree in companies like Roche
Bioscience. But this problem becomes more acute as the understanding of the biology
and the ability to conduct computational analysis becomes more sophisticated.
Systematic improvements will help, but progress and ultimately profit still relies on the
ingenuity of the end user, according to David J. Lipman, director of NCBI. "It's about
brain ware, not hardware or software."
Linus Pauling, the chemist, vitamin C-ist and anti atom-bombist determined the structure
of the other type of molecule, the protein molecule - that is chains made up of things
called amino acids
This work inspired James Watson and Francis Crick in 1953 to elucidate the structure of
DNA - the ABC of all known living matter. To cut a long story short over the next years
many people pieced the puzzle together: The building blocks of life are the 20 amino
acids that make up proteins; DNA contains the blueprints for these structures in its own
structure. It is a long strand made of 4 nucleotides - this is the code of life. It goes
ACGTTCCTCCCGGGCTCC, and so on, and so on, and so on. If you know the code you
know the structure of all living things, at least in theory.
GUANINE
Restless technology has produced means of reading genes (DNA) almost like bar - code.
The problem is that life is a complicated business, and therefore the code to describe even
the smallest of God's creatures would fill many books. But scientists are very ambitious
people and do lots of over-time. They have started to decode "themselves" in the Human
Genome Project - HUGO for short. In fact, a sort of "average" human is decoded
sampling DNA from unknown donors. But the difference in DNA between any human,
and another one (or a scientist...) is almost null. Nevertheless, an average human scientist
is made up of about 2.9 billion (2.9*109) nucleotides!
This orgy of reductionism presents problems which only big brother can solve: How do I
store all this information in a form, which is universally accessible and retrievable? What
started as a Cartesian dream is turning out to Bill Gates' satisfaction: Computers are
needed!
Vast computer data banks accessible to you and me store this vast quantity of
information. There are a lot of different data banks where DNA and protein sequence
information are stored. Three examples are listed in the table below.
Number of
Name of data bank Type of sequences stored
sequences (1996)
EMBL / GENBANK Nucleotide sequences 827174
SWISSPROT Protein sequences 52205
PDB Protein structures 4525
An advantage of these data banks is their flexibility. All this information can be ordered
and combined according to different patterns and tell us an awful lot.
The motto goes: don't just store it, analyze it! By comparing sequences, one can find out
about things like
ancestors of organisms
phylogenetic trees
protein structures
protein function
Phylogenetic trees are genealogical trees which are built up with information gained
from the comparison of the amino acid sequences of a protein like cytochrome C,
sampled from different species. Proteins like Beta-amylase or Hemoglobin cannot be
chosen to get the "full picture", that is the full tree, because they don't occur throughout
the living matter. Due to Darwinian Evolution, the protein has a slightly different amino
acid sequence for each of the species. One phylogenetic tree was created for instance
with the sequences of cytochrome C from several plants, animals and fungi. Below, part
of this phylogenetic tree is shown.
Drawing of a phylogenetic tree based on the amino acid sequence data of cytocrome C
Sequence comparison is a very powerful tool in molecular biology, genetics and protein
chemistry. Frequently it is unknown for which proteins a new DNA sequence codes or if
it codes for any protein at all. If you compare a new coding sequence with all known
sequences there is a high probability to find a similar sequence. Often it is already known
which role the protein in the data bank plays in the cell. If you assume that a similar
sequence implies a similar function, you now have much more knowledge about your
new sequence than before. Proteins of one class often show a few amino acids that
always occur at the same positions in the amino acid sequence. By looking for "patterns"
you will be able to gain information about the activity of a protein of which only the gene
(DNA) is known. Evaluation of such patterns yields information about the architecture of
proteins. Often these patterns are involved in active sites, which are the workbenchs of
proteins.
A lot of complicated algorithms have been created. There are tools to scan data banks for
sequences as FASTA and BLAST are. There are programs like Clustal and MSA for
comparing sequences. There are hundreds more. Although the development of new tools
is more transparent because of the possibilities of the Internet, it is not easy to keep up
with everything. Exploitation of these possibilities requires a new breed of scientist: those
versed in information technology AND biology, and they may enable us go where no
man has gone before. Through a new surge of interdisciplinarity it may be possible to
transcend the limits of reductionism; from the vast quantities of bytes and pieces, the
contours of complex structures and relationships might emerge from the genetic alphabet
soup as life itself once emerged from the primordial soup.
In the field that has been dominated by structural biology for the last 20-30 years, we are
now witnessing a dramatic change of focus towards sequence analysis, spurred on by the
advent of the genome projects and the resultant sequence/structure deficit. The central
challenge of bioinformatics is the rationalization of the mass of sequence information,
with a view not only to deriving more efficient means of data storage, but also to
designing more incisive tools .the imperative that derive this analytical process is the
need to convert sequence information into biochemical and biophysical knowledge; to
decipher the structural, functional and evolutionary clues encoded in the language of
biological sequences.
It is clear that mere acquisition of sequences conveys little more about the intricate
biology of the system from which they are derived than company phone directory can
reveal about the complexities of the company’s business. To extract biological meaning
From sequence information is exacting science. In essence, we are faced with the task of
decoding an language an unknown language. This language may be decompose into
sentences (proteins), words (motifs), and letters – its alphabet- (amino acids), and the
code may be tackled at a variety of these levels. By themselves, the letters have no higher
meaning, but their particular combination into words is important. Sometimes, the most
suitable of changes, a single letter within a word perhaps, can change its meaning (e.g.
hog-hag), and hence the meaning of entire sentence; so it is vital to decipher the code
correctly. Consider, the example, the single base change in the human hemoglobin A
chain codon for glutamic acid (GAA) to valine (GUA) ;in homozygous individuals ,this
minute difference results in a change from a normal healthy state to sickle cell anemia.
To understand the words in a sequence sentence that form a particular protein structure,
and perhaps one day to be able to write sentences (design proteins) of our own. Today,
application of computational methods allows us to be recognize words that form
characteristics patterns or signatures, but we do not yet understand the intricate syntax
required to piece the patterns together and build complete protein structures.
In investigating the meaning of sequences, two distinct analytical themes have emerged:
in the first approach, pattern recognition techniques are used to detect similarity between
sequences and hence to infer related structures and functions; in the second, ab intio
prediction methods are used to deduce 3D structure, and ultimately to infer function,
directly from the linear sequence. The development of more powerful pattern recognition
and structure prediction techniques will continue to be dominant themes in bioinformatics
Research while the number of experimentally determined protein structures remains
small.
Scope in Bioinformatics
The need of trained manpower in this area is sharply on the rise but there are very few
training institutions in the world where such training is provided. National Bioinformatics
Institute is one of the few such institutions in the world.
In short, Bioinformatics deals with database creation, data analysis and modeling. Data
capturing is done not only from printed material but also from network resources.
Databases in biology are generally in the multimedia form organized in relational
database model. Modeling is done not only on single biological molecule but also on
multiple systems thus requiring a use of high performance computing systems.
1. Skills that have great value on the current bioinformatics-related job market are:
sequence analysis, molecular modeling, Perl programming, Web interface design, data
mining, communication skills, Internet literacy, integration of heterogeneous and
distributed resources and tools, user support, virtual reality systems (esp. for real-time
communication), visualization, UNIX, database retrieval,...
SOME FACTS: