Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
Bioinformatics
Research, Development, or Application of Computational Tools and Approaches for Expanding the use of Biological, Medical, Behavioral, or Health Data, including those to Acquire, Store Organize, Archive, Analyze, or Visualize Such Data
(Working Definition of NIH Biomedical Information Science & Technology Initiative Consortium)
2
What is a database ?
Convenient method of collecting vast amount of information Allows for proper storing, searching & retrieving of data. Before analyzing them we need to assemble them into central, shareable resources
Why databases ?
Means to handle and share large volumes of biological data Support large-scale analysis efforts Make data access easy and updated Link knowledge obtained from various fields of biology and medicine
4
Biological Databases
libraries of life sciences information, collected from scientific experiments, published literature, high throughput experiment technology, and computational analyses. information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information includes gene function, structure, localization (both cellular and chromosomal), clinical effects of mutations as well as similarities of biological sequences and structures. 5
Features
Most of the databases have a webinterface to search for data Common mode to search is by Keywords User can choose to view the data or save to your computer Cross-references help to navigate from one database to another easily
6
Biological Databases
Type of databases Information they contain
Bibliographic databases Taxonomic databases Nucleic acid databases Genomic databases Protein databases Protein families, domains and functional sites Enzymes/ metabolic pathways
Literature Classification DNA information Gene level information Protein information Classification of proteins and identifying domains
Metabolic pathways
7
10
NCBI Genbank
Bethesda, MD
established in November 4, 1988 as a division of the National Library of Medicine (NLM) at the National Institutes of Health (NIH), United States. 14
Accepts submissions of primary data Develops tools to analyze these data Creates derivative databases based on the primary data Provides free search, link, and retrieval of these data, primarily through the Entrez system
15
16
Secondary databases
sometimes known as pattern databases Contain results from the analysis of the sequences in the primary databases
17
Composite databases
Combine different sources of primary databases. Make querying and searching efficient and without the need to go to each of the primary databases. Example - nrDB Non-Redundant DataBase
18
DNA
RNA
Protein
cDNA
DNA databases derived from GenBank containing data for a single gene -Non-redundant (nr) -dbGSS (genome survey sequences) -dbHTGS (high throughput) -dbSTS (sequence tagged site) -LocusLink* -RefSeq
RNA (cDNA) databases derived from GenBank containing data for a single gene - dbEST (expressed sequence tag) - UniGene - LocusLink* - RefSeq
19
ORFfinder
Electronic PCR VecScreen Spidey Nucleotide BLAST
23
Map Viewer
Complete genome maps, from cytogenetic and physical maps down to the sequence level Accessible for 110 organisms
Vertebrates-17 Invertebrates-12 Protozoa-18 Plants-46 Fungi-17
http://www.ncbi.nlm.nih.gov/mapview/
24
25
26
27
28
maps can be sequence-based or not (e.g., cytogenetic maps or radiationhybrid maps) it is possible to access a map view and zoom into progressively more detailed views Maps are linked to several resources, such as UniGene clusters, Evidence Viewer, and Model Maker
29
Model Maker
used for the construction of transcript models by the assembly of putative exons exons may be derived from predictions or from alignments of ESTs or mRNAs to the genomic sequence Once the transcript is created, potential ORFs (open reading frames) and their translation are shown
30
31
32
SAGEmap
on-line resource to store, retrieve, and compare Serial Analysis of Gene Expression (SAGE) profiles SAGE libraries are derived from the Cancer Genome Anatomy Project (CGAP) as well as from GenBank SAGE tags SAGEmap accepts user-submitted libraries
UniGene then stores, for a given organism, tissue, organ or pathological condition, libraries of clustered ESTs 34
35
36
37
ProtEST
a tool that uses BLASTX to search through sequence databases (Swissprot, PIR, PDB, PRF) with possible translations of UniGene clusters Proteomes from eight organisms (human, mice, rat, Drosophila, Caenorhabditis elegans, Saccharomyces cerevisiae, Arabidopsis thaliana, Escherichia coli ) are used for the comparison, and the best match in each organism is presented to the user
38
39
41
42
ORFfinder
http://www.ncbi.nlm.nih.gov/gorf /gorf.html
tool for the identification of all ORFs in a user-submitted sequence or in a sequence in the GenBank database
If an open reading frame is found, the amino acid translation can be used for similarity search by means of BLAST or in the COGs database.
43
44
Electronic PCR
45
http://www.ncbi.nlm.nih.gov/sutils/e-pcr/
looks for potential STSs given a pair of PCR primers and a DNA sequence
looks for DNA subsequences that are closely similar to the primers, and checks if order, orientation, and spacing are correct
46
Two ways : 1. Forward (searching a STS database with a sequence) - useful to map a sequence on a genome using a large database of known STSs (UniSTS) 2. Reverse (searching a sequence database with a STS) - for the prediction of PCR products in a selected genome given one or more pairs of primers
47
VecScreen
http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html 48
a system for the identification of segments of a nucleic acid sequence that may be the result of a contamination, of vector origin (plasmid, phage, cosmid, YAC DNA) as well as linkers, adapters, and primers minimize the incidence and impact of such contaminations in public sequence databases
49
Contd
Spidey
tool for the alignment of one or more mRNA (FASTA format sequences or accession numbers) on a single eukaryotic genomic sequence, determining the exon/intron structure of the query messenger
http://www.ncbi.nlm.nih.gov/IEB/Research/Ostell/Spidey/
50
uses BLAST searches to identify a genome window that covers the entire mRNA length, then refines the alignment to align each exon, taking into account predicted splice sites four splice-site matrices can be used, i.e., vertebrate, Drosophila, C. elegans, and plant
Spidey output is an alignment for each exon, each one evaluated for its quality
51
Blast Implementations
BLAST
Basic Local Alignment Search Tool Program for sequence similarity searching developed at NCBI Instrumental in identifying genes and genetic features Executes sequence searches against the database of stored sequences
53
55
FASTA vs BLAST
BLAST is faster than FASTA Similar search strategy SensitivityProtein searches: BLAST and FASTA are comparable Nucleotide searches: FASTA is more sensitive
56
BLAST USES
Provides the identity and function of query sequence Helps to direct experimental design to prove function of the sequence Finds similar sequences in other organisms Compares genomes against each other to find similarities and differences
57
Database:
DNA
Protein
BlastN - nt versus nt database. BlastP - protein versus protein database. BlastX - translated nt versus protein database. tBlastN - protein versus translated nt database. tBlastX - translated nt versus translated nt database.
58
59
60
Nucleotide BLAST
BLASTn
General purpose nucleotide search and alignment program that is sensitive and can be used to align tRNA or rRNA sequences as well as mRNA or genomic DNA sequences containing a mix of coding and noncoding regions.
62
MegaBLAST
10 times faster than blastn
designed to align sequences that are nearly identical, differing by only a few percent from one another
allows the rapid mapping of a transcript onto a typical 3 billion base mammalian genome in seconds, and is useful for processing large batches of sequences
63
discontiguous MegaBLAST
uses a discontiguous template to define an initial word in which characters in some positions, such as those in the wobble base position of codons, need not match
allows rapid cross-species mappings involving coding regions in cases where species differences in codon usage would prevent alignments using the original MegaBLAST program
64
65
66
BLAST OUTPUT
67
Results 1- Distribution
Graphical representation of hits
68
BLASTn
69
MEGABLAST
70
Discontiguous Megablast
71
6e-62=6 X 10-62
Link to entrez
Results 3 alignments
73
blastn is more sensitive than MEGABLAST because it uses a shorter default word size. Because of this, blastn is better than MEGABLAST at finding alignments to related nucleotide sequences from other organisms MEGABLAST is the tool of choice to identify a nucleotide sequence (MEGABLAST is specifically designed to efficiently find long alignments between very similar sequences ) Discontiguous MEGABLAST is better at finding nucleotide sequences similar, but not identical, to nucleotide query
75
76