A substitution matrix is a scoring system that quantifies the likelihood of one amino acid being substituted for another in an alignment. It is derived from statistical analysis of reliable alignments of highly related protein sequences. There are two main types of amino acid substitution matrices: those based on amino acid properties and those empirically derived like PAM and BLOSUM matrices from actual sequence alignments. PAM and BLOSUM matrices assign scores reflecting the observed frequency of substitutions versus the expected random frequency, converted to log-odds ratios. Higher scores indicate more evolutionarily conserved substitutions.
A substitution matrix is a scoring system that quantifies the likelihood of one amino acid being substituted for another in an alignment. It is derived from statistical analysis of reliable alignments of highly related protein sequences. There are two main types of amino acid substitution matrices: those based on amino acid properties and those empirically derived like PAM and BLOSUM matrices from actual sequence alignments. PAM and BLOSUM matrices assign scores reflecting the observed frequency of substitutions versus the expected random frequency, converted to log-odds ratios. Higher scores indicate more evolutionarily conserved substitutions.
A substitution matrix is a scoring system that quantifies the likelihood of one amino acid being substituted for another in an alignment. It is derived from statistical analysis of reliable alignments of highly related protein sequences. There are two main types of amino acid substitution matrices: those based on amino acid properties and those empirically derived like PAM and BLOSUM matrices from actual sequence alignments. PAM and BLOSUM matrices assign scores reflecting the observed frequency of substitutions versus the expected random frequency, converted to log-odds ratios. Higher scores indicate more evolutionarily conserved substitutions.
A substitution matrix is a scoring system that quantifies the likelihood of one amino acid being substituted for another in an alignment. It is derived from statistical analysis of reliable alignments of highly related protein sequences. There are two main types of amino acid substitution matrices: those based on amino acid properties and those empirically derived like PAM and BLOSUM matrices from actual sequence alignments. PAM and BLOSUM matrices assign scores reflecting the observed frequency of substitutions versus the expected random frequency, converted to log-odds ratios. Higher scores indicate more evolutionarily conserved substitutions.
Download as PPTX, PDF, TXT or read online from Scribd
Download as pptx, pdf, or txt
You are on page 1of 10
Substitution Matrix
It is a scoring system which entails a set of values for quantifying the
likelihood of one residue being substituted by another in an alignment. It is derived from statistical analysis of residue substitution data from sets of reliable alignments of highly related sequences. SCORING MATRICES • Scoring matrices for nucleotide sequences are relatively simple. • A positive value or high score is given for a match and a negative value or low score for a mismatch. • Scoring matrices are based on the assumption that the frequencies of mutation are equal for all bases. • However, this assumption may not be realistic; observations show that transitions (substitutions between purines and purines or between pyrimidines and pyrimidines) occur more frequently than transversions (substitutions between purines and pyrimidines). SCORING MATRICES • Scoring matrices for amino acids are more complicated because scoring has to reflect the physicochemical properties of amino acid residues, as well as the likelihood of certain residues being substituted among true homologous sequences. • Certain amino acids with similar physicochemical properties can be more easily substituted than those without similar characteristics. • Substitutions among similar residues are likely to preserve the essential functional and structural features. However, substitutions between residues of different physicochemical properties are more likely to cause disruptions to the structure and function. This type of disruptive substitution is less likely to be selected in evolution because it renders nonfunctional proteins. Amino Acid Scoring Matrices • Amino acid substitution matrices, which are 20 × 20 matrices, have been devised to reflect the likelihood of residue substitutions. • There are essentially two types of amino acid substitution matrices. • One type is based on interchangeability of the genetic code or amino acid properties, and the other is derived from empirical studies of amino acid substitutions. Although the two different approaches coincide to a certain extent, the first approach, which is based on the genetic code or the physicochemical features of amino acids, has been shown to be less accurate than the second approach, which is based on surveys of actual amino acid substitutions among related proteins. • The empirical matrices, which include PAM and BLOSUM matrices, are derived from actual alignments of highly similar sequences. By analyzing the probabilities of amino acid substitutions in these alignments, a scoring system can be developed by giving a high score for a more likely substitution and a low score for a rare substitution. • For a given substitution matrix, a positive score means that the frequency of amino acid substitutions found in a data set of homologous sequences is greater than would have occurred by random chance. A zero score means that the frequency of amino acid substitutions found in the homologous sequence data set is equal to that expected by chance. A negative score means that the frequency of amino acid substitutions found in the homologous sequence data set is less than would have occurred by random chance. Log-odds Ratio • The substitution matrices apply logarithmic conversions to describe the probability of amino acid substitutions. • The converted values are the so-called log-odds scores (or log-odds ratios), which are logarithmic ratios of the observed mutation frequency divided by the probability of substitution expected by random chance. • The conversion can be either to the base of 10 or to the base of 2. • For example, in an alignment that involves ten sequences, each having only one aligned position, nine of the sequences are F (phenylalanine) and the remaining one I (isoleucine). The observed frequency of I being substituted by F is one in ten (0.1), whereas the probability of I being substituted by F by random chance is one in twenty (0.05). Thus, the ratio of the two probabilities is 2 (0.1/0.05). After taking this ratio to the logarithm to the base of 2, this makes the log odds equal to 1. PAM Matrices • PAM stands for “point accepted mutation”. It was first constructed by Margaret Dayhoff, who compiled alignments of seventy-one groups of very closely related protein sequences. • One PAM unit is defined as 1% of the amino acid positions that have been changed. • Construction of the PAM1 matrix involves alignment of full-length sequences and subsequent construction of phylogenetic trees using the parsimony principle. This allows computation of ancestral sequences for each internal node of the trees. • Ancestral sequence information is used to count the number of substitutions along each branch of a tree. • The PAM score for a particular residue pair is derived from a multistep procedure involving calculations of relative mutability (which is the number of mutational changes from a common ancestor for a particular amino acid residue divided by the total number of such residues occurring in an alignment), normalization of the expected residue substitution frequencies by random chance, and logarithmic transformation to the base of 10 of the normalize mutability value divided by the frequency of a particular residue. • The resulting value is rounded to the nearest integer and entered into the substitution matrix, which reflects the likelihood of amino acid substitutions. This completes the log-odds score computation. • After compiling all substitution probabilities of possible amino acid mutations, a 20 × 20 PAM matrix is established. • Positive scores in the matrix denote substitutions occurring more frequently than expected among evolutionarily conserved replacements. Negative scores correspond to substitutions that occur less frequently than expected. PAM Matrices
• A PAM unit is defined as 1% amino acid change or
one mutation per 100 residues. The increasing PAM numbers correlate with increasing PAM units and thus evolutionary distances of protein sequences. For example, PAM250, which corresponds to 20% amino acid identity, represents 250 mutations per 100 residues. In theory, the number of evolutionary changes approximately corresponds to an expected evolutionary span of 2,500 million years. Thus, the PAM250 matrix is normally used for divergent sequences. Accordingly, PAM matrices with lower serial numbers are more suitable for aligning more closely related sequences. The extrapolated values of the PAM250 amino acid substitution matrix are shown in Figure. BLOSUM Matrices • This is the series of blocks amino acid substitution matrices (BLOSUM), all of which are derived based on direct observation for every possible amino acid substitution in multiple sequence alignments. • These were constructed based on more than 2,000 conserved amino acid patterns representing 500 groups of protein sequences. The sequence patterns, also called blocks, are ungapped alignments of less than sixty amino acid residues in length. • The frequencies of amino acid substitutions of the residues in these blocks are calculated to produce a numerical table, or block substitution matrix. • Instead of using the extrapolation function, the BLOSUM matrices are actual percentage identity values of sequences selected for construction of the matrices. For example, BLOSUM62 indicates that the sequences selected for constructing the matrix share an average identity value of 62%. • Other BLOSUM matrices based on sequence groups of various identity levels have also been constructed. In the reversing order as the PAM numbering system, the lower the BLOSUM number, the more divergent sequences they represent. • The BLOSUM score for a particular residue pair is derived from the log ratio of observed residue substitution frequency versus the expected probability of a particular residue. The log odds is taken to the base of 2 instead of 10 as in the PAM matrices. • The resulting value is rounded to the nearest integer and entered into the substitution matrix. As in the PAM matrices, positive and negative values correspond to substitutions that occur more or less frequently than expected among evolutionarily conserved replacements. The values of the BLOSUM62 matrix are shown in Figure.