BLAST

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

BLAST tool

With the increase in DNA and protein sequence databases, there is a growing need for more
faster and efficient methods to analyze this large amount of data. One of the most commonly
used bioinformatics tools today to study DNA and protein sequences is called BLAST.
BLAST stands for Basic Local Alignment Search Tool. It is a widely used bioinformatics
program that was first introduced by Stephen Altschul et al. in 1990 and has since become one of
the most popular tools for sequence similarity search.
BLAST is a powerful tool for analyzing biological sequence data. Since the initial release of
BLAST in 1990, it has undergone continuous updates to improve its speed and accuracy. BLAST
is now considered a crucial and widely used tool in the field of bioinformatics. It has played a
vital role in numerous research studies and has paved the way for the development of other
sequence comparison tools.

Types of BLAST
There are five types (variants) of BLAST that are differentiated based on the type of sequence
(DNA or protein) of the query and database sequences.
BLASTN compares a nucleotide query sequence to a nucleotide sequence database. Primarily, it
is used to identify similarities and locate homologous regions in DNA sequences.
BLASTP compares a protein query sequence to a protein sequence database. It facilitates the
identification of similar protein sequences, which can shed light on protein function, structure,
and evolution.
BLASTX compares a nucleotide query sequence to a protein sequence database by translating
the query sequence into its six possible reading frames and aligning them with the protein
sequences. BLASTx is particularly effective when searching DNA sequences for protein-coding
genes.
TBLASTN compares a protein query sequence to a nucleotide sequence database by translating
the nucleotide sequences in all six reading frames and aligning them with the protein sequence.
When searching for potential protein homologs in DNA sequences, tBLASTn is frequently
employed.
TBLASTX compares a nucleotide query sequence to a nucleotide sequence database by
translating the query sequence in all six reading frames and aligning them with the nucleotide
sequences. It translates both the query and database sequences in all six reading frames,
compares the resulting amino acid sequences, and provides information regarding any possible
similarities. When searching for similarities between two DNA nucleotide sequences, tBLASTx
is frequently utilized.

How BLAST Works


BLAST works by comparing a query sequence to a database of sequences to find regions of
similarity. It uses a heuristic approach to search for similarities in the database, making it faster
and more efficient.
BLAST performs sequence alignment through the following steps.
Step 1: The first step is to create a lookup table or list of words from the query sequence. This
step is also called seeding. First, BLAST takes the query sequence and breaks it into short
segments called words. For protein sequences, each word is usually three amino acids long, and
for DNA sequences, each word is usually eleven nucleotides long.
Step 2: The second step is to search a database of known sequences to find any sequences that
contain the same words as the query sequence. This is done to identify database sequences
containing the matching words.
Step 3: BLAST then scores the similarity of the matching words. The matching of the words is
scored by a given substitution matrix. If a word is above a certain threshold, it is considered a
match.
Two commonly used substitution matrices for protein sequences are PAM (Percent Accepted
Mutations) and BLOSUM (Blocks Substitution Matrix). For nucleotide sequences, the scoring
matrix is based on match-mismatch scoring.
Step 4: The fourth step involves pairwise alignment by extending the words in both directions
while counting the alignment score using the same substitution matrix. If the score drops below a
certain threshold due to differences in the sequences or mismatches, the alignment stops. The
resulting aligned segment pair without gaps is called the high-scoring segment pair (HSP).
BLAST also calculates a statistical significance value for each alignment. It is called E-value or
Expect value. The E-value represents the probability of obtaining a sequence match by random
chance. A lower E-value indicates that the sequence match is less likely to be a result of random
occurrence. Hence, the lower the E-value, the higher the level of significance.

BLAST Scores and Statistics


Once BLAST has identified a similar sequence to the query in the database, it is useful to
determine whether the alignment is “good” and whether it depicts a possible biological
relationship, or whether the similarity observed is due to chance alone.
BLAST employs statistical theory to generate a bit score and expect value (E-value) for each
alignment pair (query to match) using statistical theory. The bit score indicates the quality of the
alignment; the higher the score, the higher the quality of the alignment. In general, this score is
computed using a formula that considers the alignment of similar or identical residues, as well as
any voids introduced to align the sequences.
The “substitution matrix,” which assigns a score for aligning any possible pair of residues, is a
crucial component of this calculation. The exceptions to this are blastn and MegaBLAST, which
perform nucleotide–nucleotide comparisons and therefore do not use protein-specific matrices.
Bit scores are normalized, allowing bit scores from various alignments to be compared despite
the use of different scoring matrices. The E-value indicates the statistical significance of a given
pairwise alignment and reflects the database size and scoring system employed.
Lesser the E-value, greater the significance of the impact. An E-value of 0.05 for a sequence
alignment indicates that this similarity has a 5 in 100 (1 in 20) probability of occurring by chance
alone.
Although a statistician may consider this to be significant, it may not represent a biologically
meaningful result; an alignment analysis (see below) is required to ascertain “biological”
significance.

Characteristics of BLAST
Several key features of BLAST make it a widely used tool in bioinformatics. Some of these are:
 Speed and Efficiency: BLAST is designed to perform sequence similarity searches
quickly and efficiently. It utilizes heuristic algorithms and indexing techniques to
expedite the identification of local alignments, making it suitable for searching large
sequence databases in a reasonable amount of time.
 Sensitivity and Specificity: In sequence comparisons, BLAST establishes a balance
between sensitivity and specificity. It is highly sensitive which allows the identification
of even small similarities between sequences.
 Focus on Local Alignments: BLAST focuses on identifying local rather than global
alignments. It aims to identify regions of local similarity between the query sequence and
the database sequence, rather than attempting to align the entire sequences.
 Iterative Method: Some BLAST variants, such as PSI-BLAST, employ an iterative
method. They conduct multiple cycles of searching and alignment to refine the query and
database sequences progressively. This iterative procedure facilitates the detection of
more distant homologs and increases sensitivity.
 Flexibility: BLAST is versatile and can be applied to numerous categories of biological
sequences, such as DNA, RNA, and proteins. Different BLAST variants are tailored to
specific sequence types and search criteria, allowing for versatility in sequence analysis
duties.
 User-Friendly Interface: BLAST tools typically feature user-friendly interfaces that
enable researchers to readily input query sequences, select databases, and configure
search parameters. This accessibility enables users with differing degrees of
bioinformatics knowledge to conduct efficient sequence similarity searches.
 Extensive Database Compatibility: BLAST is compatible with a vast array of sequence
databases, including public databases such as GenBank, UniProt, and the NCBI’s non-
redundant (nr) database. This compatibility enables researchers to compare their
sequences to exhaustive collections of previously identified sequences.
 Community Support and Updates: BLAST has a sizable user community, which has
aided in its ongoing development and updates. Regular updates and issue fixes ensure
that BLAST remains a trustworthy and current sequence analysis tool.
Report

 The majority of BLAST users are acquainted with the “traditional” BLAST report. The
report is divided into three sections: (1) the database header, which comprises
information about the query sequence. On the Internet, there is also a graphical overview;
(2) one-line descriptions of each database sequence found to match the query sequence;
these provide a quick overview for browsing; and (3) alignments for each database
sequence matched (there may be multiple alignments for a database sequence it matches).

Applications of BLAST
BLAST has a wide range of applications. Some of the most common applications are:
 BLAST can be used to identify unknown sequences by comparing them with known
sequences in a database which helps in predicting the functions of proteins or genes.
 BLAST can also be used in phylogenetic analysis which is important for understanding
the evolutionary relationships between different species.
 BLAST can also be used to identify functionally conserved domains within proteins
which is important for predicting the functions of proteins.

You might also like