Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 75

Bioinformatics Tools for Nucleotide Sequence Analysis and Database exploration

Varij Nayan and Anuradha Bhardwaj

Bioinformatics
Research, Development, or Application of Computational Tools and Approaches for Expanding the use of Biological, Medical, Behavioral, or Health Data, including those to Acquire, Store Organize, Archive, Analyze, or Visualize Such Data

(Working Definition of NIH Biomedical Information Science & Technology Initiative Consortium)
2

What is a database ?
Convenient method of collecting vast amount of information Allows for proper storing, searching & retrieving of data. Before analyzing them we need to assemble them into central, shareable resources

Why databases ?
Means to handle and share large volumes of biological data Support large-scale analysis efforts Make data access easy and updated Link knowledge obtained from various fields of biology and medicine
4

Biological Databases
libraries of life sciences information, collected from scientific experiments, published literature, high throughput experiment technology, and computational analyses. information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information includes gene function, structure, localization (both cellular and chromosomal), clinical effects of mutations as well as similarities of biological sequences and structures. 5

Features
Most of the databases have a webinterface to search for data Common mode to search is by Keywords User can choose to view the data or save to your computer Cross-references help to navigate from one database to another easily
6

Biological Databases
Type of databases Information they contain

Bibliographic databases Taxonomic databases Nucleic acid databases Genomic databases Protein databases Protein families, domains and functional sites Enzymes/ metabolic pathways

Literature Classification DNA information Gene level information Protein information Classification of proteins and identifying domains
Metabolic pathways
7

Types Of Biological Databases Accessible


Primary databases
Secondary databases Composite databases

Primary databases (archival/annotated)


Contain sequence data such as nucleic acid or protein
Annotation implies extraction, definition and interpretation of features on the genome sequence

Examples of nucleic acid database areEMBL, DDBJ and NCBI GenBank.


9

10

International Nucleotide Sequence Database Collaboration


DDBJ: DNA Data Bank of Japan CIB-DDBJ: Center for Information Biology and DNA Data Bank of Japan NIG: National Institute of Genetics EBI: European Bioinformatics Institute EMBL: European Molecular Biology Laboratory NCBI: National Center for Biotechnology Information NLM: National Library of Medicine IAC: International Advisory Committee ICM: International Collaborative Meeting
11

EMBL Nucleotide Sequence Database


EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary nucleotide sequence resource EMBL nucleotide sequence database is part of the The Protein and Nucleotide Database Group (PANDA) www.ebi.ac.uk/embl/
12

DNA Data Bank of Japan (DDBJ)


DDBJ (DNA Data Bank of Japan) began DNA data bank activities in earnest in 1986 at the National Institute of Genetics (NIG) with the endorsement of the Ministry of Education, Science, Sport and Culture sole DNA data bank in Japan, which is officially certified to collect DNA sequences from researchers and to issue the internationally recognized accession number to data submitters www.ddbj.nig.ac.jp 13

NCBI Genbank
Bethesda, MD

established in November 4, 1988 as a division of the National Library of Medicine (NLM) at the National Institutes of Health (NIH), United States. 14

The National Center for Biotechnology Information

Accepts submissions of primary data Develops tools to analyze these data Creates derivative databases based on the primary data Provides free search, link, and retrieval of these data, primarily through the Entrez system

15

Secondary databases Curated and Composite databases

16

Secondary databases
sometimes known as pattern databases Contain results from the analysis of the sequences in the primary databases

17

Composite databases
Combine different sources of primary databases. Make querying and searching efficient and without the need to go to each of the primary databases. Example - nrDB Non-Redundant DataBase
18

Secondary Databases and Composite Databases

DNA

RNA

Protein

cDNA

DNA databases derived from GenBank containing data for a single gene -Non-redundant (nr) -dbGSS (genome survey sequences) -dbHTGS (high throughput) -dbSTS (sequence tagged site) -LocusLink* -RefSeq

RNA (cDNA) databases derived from GenBank containing data for a single gene - dbEST (expressed sequence tag) - UniGene - LocusLink* - RefSeq

19

RefSeq (Reference Sequence)


Curated collection of DNA, RNA, and protein sequences built by NCBI Unlike GenBank, RefSeq provides only one example of each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes. limited to major organisms for which sufficient data is available
20

GenBank versus RefSeq


GenBank Not curated Author submits Only author can revise Multiple records for same loci common Curated NCBI creates from existing data NCBI revises as new data emerge Single records for each molecule of major organisms RefSeq

No limit to species included


Data exchanged among INSDC members Akin to primary literature

Limited to model organisms


Exclusive NCBI database Akin to review articles

Proteins identified and linked


Access via NCBI Nucleotide databases

Proteins and transcripts identified


and linked Access via Nucleotide & Protein databases
21

Other nucleotide sequence databases


UniGene

SGD (Saccharomyces Genome Database)


EBI Genomes - for the completed genomes, and information about ongoing projects Genome Biology - available complete genomes Ensembl - joint project between EMBL-EBI and the Sanger Centre 22

Nucleotide sequence analysis


Map viewer Model maker SAGEmap UniGene, ProtEST, and DDD

ORFfinder
Electronic PCR VecScreen Spidey Nucleotide BLAST
23

Map Viewer
Complete genome maps, from cytogenetic and physical maps down to the sequence level Accessible for 110 organisms
Vertebrates-17 Invertebrates-12 Protozoa-18 Plants-46 Fungi-17

http://www.ncbi.nlm.nih.gov/mapview/
24

25

26

27

Human PAPP-A Gene


(Spotted on Chromosome 9 using Map Viewer)

28

maps can be sequence-based or not (e.g., cytogenetic maps or radiationhybrid maps) it is possible to access a map view and zoom into progressively more detailed views Maps are linked to several resources, such as UniGene clusters, Evidence Viewer, and Model Maker
29

Model Maker
used for the construction of transcript models by the assembly of putative exons exons may be derived from predictions or from alignments of ESTs or mRNAs to the genomic sequence Once the transcript is created, potential ORFs (open reading frames) and their translation are shown
30

31

32

SAGEmap
on-line resource to store, retrieve, and compare Serial Analysis of Gene Expression (SAGE) profiles SAGE libraries are derived from the Cancer Genome Anatomy Project (CGAP) as well as from GenBank SAGE tags SAGEmap accepts user-submitted libraries

Finally, different libraries can be compared


http://www.ncbi.nlm.nih.gov/SAGE/
33

UniGene, ProtEST, and DDD


UniGene: http://www.ncbi.nlm.nih.gov/UniGene is a system for the automatic clustering of GenBank sequences and ESTs into nonredundant groups UniGene project tries to identify all ESTs generated from the same genes, overcoming problems due to the EST sequence errors

UniGene then stores, for a given organism, tissue, organ or pathological condition, libraries of clustered ESTs 34

35

36

37

ProtEST
a tool that uses BLASTX to search through sequence databases (Swissprot, PIR, PDB, PRF) with possible translations of UniGene clusters Proteomes from eight organisms (human, mice, rat, Drosophila, Caenorhabditis elegans, Saccharomyces cerevisiae, Arabidopsis thaliana, Escherichia coli ) are used for the comparison, and the best match in each organism is presented to the user
38

39

DDD (Digital Differential Display)


tool for comparing EST-based expression profiles among different UniGene libraries Aim: finding genes related 1. to tissue-specific or organ-specific processes 2. specific pathologies 3. different development stages
40

41

42

ORFfinder
http://www.ncbi.nlm.nih.gov/gorf /gorf.html

tool for the identification of all ORFs in a user-submitted sequence or in a sequence in the GenBank database
If an open reading frame is found, the amino acid translation can be used for similarity search by means of BLAST or in the COGs database.
43

44

Electronic PCR

45

http://www.ncbi.nlm.nih.gov/sutils/e-pcr/

looks for potential STSs given a pair of PCR primers and a DNA sequence
looks for DNA subsequences that are closely similar to the primers, and checks if order, orientation, and spacing are correct
46

Two ways : 1. Forward (searching a STS database with a sequence) - useful to map a sequence on a genome using a large database of known STSs (UniSTS) 2. Reverse (searching a sequence database with a STS) - for the prediction of PCR products in a selected genome given one or more pairs of primers
47

VecScreen

http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html 48

a system for the identification of segments of a nucleic acid sequence that may be the result of a contamination, of vector origin (plasmid, phage, cosmid, YAC DNA) as well as linkers, adapters, and primers minimize the incidence and impact of such contaminations in public sequence databases
49

Contd

Spidey
tool for the alignment of one or more mRNA (FASTA format sequences or accession numbers) on a single eukaryotic genomic sequence, determining the exon/intron structure of the query messenger
http://www.ncbi.nlm.nih.gov/IEB/Research/Ostell/Spidey/
50

uses BLAST searches to identify a genome window that covers the entire mRNA length, then refines the alignment to align each exon, taking into account predicted splice sites four splice-site matrices can be used, i.e., vertebrate, Drosophila, C. elegans, and plant
Spidey output is an alignment for each exon, each one evaluated for its quality
51

Blast Implementations

BLAST
Basic Local Alignment Search Tool Program for sequence similarity searching developed at NCBI Instrumental in identifying genes and genetic features Executes sequence searches against the database of stored sequences
53

Local and global alignments


Global
Local

55

FASTA vs BLAST
BLAST is faster than FASTA Similar search strategy SensitivityProtein searches: BLAST and FASTA are comparable Nucleotide searches: FASTA is more sensitive

S-W is the most sensitive, but time consuming

56

BLAST USES
Provides the identity and function of query sequence Helps to direct experimental design to prove function of the sequence Finds similar sequences in other organisms Compares genomes against each other to find similarities and differences
57

Blast: A Family of Programs


Query: DNA Protein

Database:

DNA

Protein

BlastN - nt versus nt database. BlastP - protein versus protein database. BlastX - translated nt versus protein database. tBlastN - protein versus translated nt database. tBlastX - translated nt versus translated nt database.
58

59

60

Nucleotide BLAST

61 Compares a nucleotide sequence against a database of nucleotide sequences

BLASTn
General purpose nucleotide search and alignment program that is sensitive and can be used to align tRNA or rRNA sequences as well as mRNA or genomic DNA sequences containing a mix of coding and noncoding regions.

62

MegaBLAST
10 times faster than blastn

designed to align sequences that are nearly identical, differing by only a few percent from one another
allows the rapid mapping of a transcript onto a typical 3 billion base mammalian genome in seconds, and is useful for processing large batches of sequences
63

discontiguous MegaBLAST
uses a discontiguous template to define an initial word in which characters in some positions, such as those in the wobble base position of codons, need not match

allows rapid cross-species mappings involving coding regions in cases where species differences in codon usage would prevent alignments using the original MegaBLAST program
64

How to run a BLAST query


FASTA format
Query DNA or protein sequence must be in FASTA format
FASTA definition line ("def line") that begins with a >, followed by some text that briefly describes the query sequence on a single line up to 80 nucleotide bases or amino acids per line
>DinoDNA "Dinosaur DNA" from Crichton's JURASSIC PARK p. 103 nt 1-1200 GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGC GGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCG TGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGC TGCTCACGCTGTACCTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTG CCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAA AGTAGGACAGGTGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAG ATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGTCACT CCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCT GGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATGATTCTTCTCGCTTCCGGCGG CCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAA CGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCG CACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAA CAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA GCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGG CTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTG ACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCA ACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCC GCGGTGCATGGAGCCGGGCCACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGG CCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGG CCATCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT

65

How to run a BLAST query


Select nucleotide blast Paste sequence into search box Select database Click

66

BLAST OUTPUT

67

Results 1- Distribution
Graphical representation of hits

68

BLASTn

69

MEGABLAST

70

Discontiguous Megablast

71

Results 2 sequences with specific alignments Description Links to relevant


records in other databases

6e-62=6 X 10-62

Link to entrez

Estimate of statistical significance 72

Results 3 alignments

Shows the actual alignments

73

What do the numbers mean?


Bit score:
Indicates how good the alignment is; the higher the score, the better the alignment. Score is calculated from a formula which takes into account the alignment of similar or identical residues, as well as any gaps introduced to align the sequences

E-value: Expect value


Describes the # of hits one can expect to see by chance when searching a database of a particular size. Essentially, the E-value describes the random background noise that exists for matches between sequences. The lower the E-value, or the closer it is to 0, the higher is the significance of the match. Searches with short sequences can be virtually identical and have relatively high E-value. This is because shorter 74 sequences have a high probability of occurring in the database purely by chance.

blastn is more sensitive than MEGABLAST because it uses a shorter default word size. Because of this, blastn is better than MEGABLAST at finding alignments to related nucleotide sequences from other organisms MEGABLAST is the tool of choice to identify a nucleotide sequence (MEGABLAST is specifically designed to efficiently find long alignments between very similar sequences ) Discontiguous MEGABLAST is better at finding nucleotide sequences similar, but not identical, to nucleotide query
75

76

You might also like