Three Problems of Hidden Markov Models: 1) Scoring Problem

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

.

Three Problems Of Hidden Markov Models

1) Scoring Problem

We want to find the probability of an observed sequence given an HMM. It can be seen that one method of

calculating the probability of the observed sequence would be to find each possible sequence of the hidden

states, and sum these probabilities. We use the Forward Algorithm for this.

I0
I1
M1
M3
M2

Consider the HMM shown above. In this figure several paths exist for the protein sequence ACCY.

The Forward algorithm employs a matrix, shown below. The columns of the matrix are indexed by the states

in the model, and the rows are indexed by the sequence. The elements of the matrix are initialized to zero and

then computed with these steps:

1. The probability that the amino acid A was generated by state I0 is computed and entered as the first element

of the matrix. This is .4*.3 = .12

2. The probabilities that C is emitted in state M1 (multiplied by the probability of the most likely transition to

state M1 from state I0) and in state I1 (multiplied by the most likely transition to state I1 from state I0) are

entered into the matrix element indexed by C and I1/M1.

3. The sum of the two probabilities, sum(I1, M1), is calculated.

4. A pointer is set from the winner back to state I0.


5. Steps 2-4 are repeated until the matrix is filled.

The probability of the sequence is found by summing the probabilities in the last column.

Matrix for the Forward algorithm

2) Alignment Problem

We often wish to take a particular HMM, and determine from an observation sequence the most likely sequence

of underlying hidden states that might have generated it. This is the alignment problem and the Viterbi

Algorithm is used to solve this problem.

The Viterbi algorithm is similar to the forward algorithm. However in step 3, maximum rather than a sum is

calculated. The most likely path through the model can now be found by following the back-pointers.

Matrix for the Viterbi algorithm

1
Once the most probable path through the model is known, the probability of a sequence given the model can

be computed by multiplying all probabilities along the path.

3) Training Problem

Another tricky problem is how to create an HMM in the first place, given a particular set of related training

sequences. It is necessary to estimate the amino acid emission distributions in each state and all state-to-state

transition probabilities from a set of related training sequences. This is done by using the Baum-Welch

Algorithm or the Forward Backward Algorithm.

The algorithm proceeds by making an initial guess of the parameters (which may well be entirely wrong) and

then refining it by assessing its worth, and attempting to reduce the errors it provokes when fitted to the given

data. In this sense, it is performing a form of gradient descent, looking for a minimum of an error measure.

2
1. Applications of HMM’s

In this section, we will delve into greater depth at specific problems in the area of computational biology and

examine the role of HMM's.

1.1. Gene finding and prediction

We introduce here the gene-prediction HMMs that can be used to predict the structure of the gene. Our

objective is to find the coding and non-coding regions of an unlabeled string of DNA nucleotides.

The motivation behind this is to

 assist in the annotation of genomic data produced by genome sequencing methods

 gain insight into the mechanisms involved in transcription, splicing and other processes

As shown in the diagram above, a string of DNA nucleotides containing a gene will have separate regions

 Introns – non-coding regions within a gene

3
 Exons – coding regions

These regions are separated by functional sites

 Start and stop codons

 Splice sites – acceptors and donors

In the process of transcription, only the exons are left to form the protein sequence as depicted below.

Many problems in biological sequence analysis have a grammatical structure . HMMs are very useful in modeling

grammar. The input to such a HMM is the genomic DNA sequence and the output, in the simplest case is a

parse tree of exons and introns on the DNA sequence.

Shown below is a simple model for unspliced genes that recognizes the start codon, stop codon (only one of

the three possible stop codons are shown) and the coding/non-coding regions. This model has been trained

with a test set of gene data.

Having such a model, how can we predict genes in a sequence of anonymous DNA ? We simply use the Viterbi

algorithm to find the most probable path through the model

4
1.2. Protein- Profile HMMs

As we have seen earlier, protein structural similarities make it possible to create a statistical model of a protein

family which is called a profile. The idea is, given a single amino acid target sequence of unknown structure,

we want to infer the structure of the resulting protein. The profile HMM is built by analyzing the distribution

of amino-acids in a training set of related proteins. This HMM in a natural way can model positional dependant

gap penalties.

The basic topology of a profile HMM is shown above. Each position, or module, in the model has three states.

Matching Deletion

Insertion
A state shown as a rectangular box is a match state that models the distribution of letters in the corresponding

column of an alignment. A state shown by a diamond-shaped box models insertions of random letters between

two alignment positions, and a state shown by a circle models a deletion, corresponding to a gap in an

alignment. States of neighboring positions are connected, as shown by lines. For each of these lines there is an

associated `transition probability', which is the probability of going from one state to the other.

The match state represents a consensus amino acid for this position in the protein family. The delete state is a

non-emitting state, and represents skipping this consensus position in the multiple alignment. Finally, the insert

state models the insertion of any number of residues after this consensus position.

A repository of protein profile HMMs can be found in the PFAM Database (http://pfam.wustl.edu).

5
Building profiles from a family of proteins(or DNA) a profile HMM can be made for searching a database for

other members of the family. As we have seen before in the section on HMM problems, profile HMM’s can

also be used for the following

Scoring a sequence

We are calculating the probability of a sequence given a profile by simply multiplying emmision and transition

probabilities along the path.

Classifying sequences in a database

Given a HMM for a protein family and some unknown sequences, we are trying to find a path through the

model where the new sequence fits in or we are tying to ‘align’ the sequence to the model. Alignment to the

model is an assignment of states to each residue in the sequence. There are many such alignments and the

Vitterbi’s algorithm is used to give the probability of the sequence for that alignment.

Creating Multiple sequence alignment

HMMs can be used to automatically create a multiple alignment from a group of unaligned sequences. By taking

a close look at the alignment, we can see the history of evolution. One great advantage of HMMs is that they

can be estimated from sequences, without having to align the sequences first. The sequences used to estimate

or train the model are called the training sequences, and any reserved sequences used to evaluate the model are

called the test sequences. The model estimation is done with the forward-backward algorithm, also known as

the Baum-Welch algorithm. It is an iterative algorithm that maximizes the likelihood of the training sequences.

1.3. Prediction of protein secondary structure using HMM’s

Prediction of secondary structures is need for the prediction of protein function. As an alternative method to

direct X-ray analysis, a HMM is used to

 Analyze the amino-acid sequences of proteins

6
 Learn secondary structures such as helix, sheet and turn

 Predict the secondary structures of sequences

The method is to train the four HMMs of secondary structure – helix, sheet, turn and other – by training

sequences. The Baum-Welch method is used to train the HMMs. So, the HMM of helix is able to produce

helix-like sequences with high probabilities. Now, these HMMs can be used to predict the secondary structure

of the test sequence. The forward-backward algorithm is used to compute the probabilities of these HMMs

outputting the test sequence. The sequence has the secondary structure whose HMM showed the highest

probability to output the sequence.

2. HMM implementation

These are the two publicly available HMM implementation software.

HMMER - http://hmmer.wustl.edu/

SAM system - http://www.cse.ucsc.edu/research/compbio/sam.html

3. Advantages of HMMs

 HMM’s can accommodate variable-length sequence.

Because most biological data has variable-length properties, machine learning techniques which require a fixed-

length input, such as neural networks or support vector machines, are less successful in biological sequence

analysis

 Allows position dependant gap penalties.

HMM’s treat insertions and deletions is a statistical manner that is dependant on position.

4. Limitations of HMMs

 Linear Model

So, they are unable to capture higher order correlations among amino-acids.

7
 Markov Chain assumption of independent events

Probabilities of states are supposed to be independent which is not true of biology

Eg, P(y) must be independent of P(x), and vice versa

P(x … P(y

 Standard Machine Learning Problems

In the training problem, we need to watch out for local maxima and so model may not converge to a truly

optimal parameter set for a given training set. Secondly, since the model is only as good as your training set,

this may lead to over-fitting.

5. Open areas for research in HMMs in biology

 Integration of structural information into profile HMMs.

Despite the almost obvious application of using structural information on a member protein family when one

exists to better the parameterization of the HMM, this has been extremely hard to achieve in practice.

 Model architecture

The architectures of HMMs have largely been chosen to be the simplest architectures that can fit the observed

data. Is this the best architecture to use? Can one use protein structure knowledge to make better architecture

decisions, or, in limited regions, to learn the architecture directly from the data? Will these implied architectures

have implications for our structural understanding?

 Biological mechanism

In gene prediction, the HMM’s may be getting close to replicating the same sort of accuracy as the biological

machine (the HMM’s have the additional task of finding the gene in the genomic DNA context, which is not

handled by the biological machine that processes the RNA). What constraints does our statistical model place

8
on the biological mechanism— in particular, can we consider a biological mechanism that could use the same

information as the HMM?

6. References

1. L. R. Rabiner and B. H. Juang, An Introduction to Hidden Markov Models, IEEE ASSP Magazine, January 1986,

pp. 1-16.

2. K. Asai, S. Hayamizu and H. Handa, Prediction of protein secondary structures by hidden Markon models, Computer

Application in the Biosciences (CABIOS), Vol 9, No 2, 1993, pp. 141-146.

3. Krogh, A., M. Brown, I.S. Mian, K. Sjolander, and D. Haussler. Hidden Markov Models in Computational

Biology: Applications to Protein Modeling, J. Mol. Biol., Vol. 235, pp.1501-1531, 1994.

4. S. Eddy. Profile hidden Markov models. Bioinformatics, 14:755--763, 1998.

5. L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition , Proceedings

of the IEEE, 77 , no. 2, 257--285, February 1989.

6. Baldi, P., Chauvin, Y., Hunkapiller, T. and McClure, M.A. (1993) Hidden Markov Models in Molecular Biology:

New Algorithms and Applications, In Advances in Neural Information Processing Systems 5, Eds. S.J. Hanson,

J.D. Cowan and C. Lee Giles, (Morgan Kaufmann) pp 747-754

7. Baldi, P., Chauvin, Y., Hunkapiller, T., and McClure, M.A. (1994) Hidden Markov Models of Biological Primary

Sequence Information, Proceedings of the National Academy of Science, USA 91: 1059-1063.

8. David Kulp, David Haussler, Martin G. Reese, and Frank H. Eeckman, A generalized hidden markov model for

the recognition of human genes in DNA, Procedings of the Fourth International Conference on Intelligent

Systems for Molecular Biology (Menlo Park, CA), AAAI Press, 1996.

9. R. Hughey and A. Krogh. Hidden Markov models for sequence analysis: extension and analysis of the basic

method. Computer Applications in the Biosciences, 12:95-107.1996

http://www.cse.ucsc.edu/research/compbio/html_format_papers/hughkrogh96/cabios.html

10. Henderson,J., Salzberg,S. and Fasman,K. Finding genes in human DNA with a hidden Markov model. Journal of

Computational Biology 1997, 4, 127--141.

9
11. An Introduction to Hidden Markov Models for Biological Sequences by A. Krogh. In S. L. Salzberg et al., eds.,

Computational Methods in Molecular Biology, 45-63. Elsevier, 1998.

12. E Birney. Hidden Markov models in biological sequence analysis. Deep computing for the life sciences.

Volume 45, Numbers ¾ 2001. http://www.research.ibm.com/journal/rd/453/birney.html

13. HMMer - http://hmmer.wustl.edu/

14. SAM system - http://www.cse.ucsc.edu/research/compbio/sam.html

10

You might also like