Example of The Baum-Welch Algorithm
Example of The Baum-Welch Algorithm
Example of The Baum-Welch Algorithm
Larry Moss
1 Our corpus c
We start with a very simple corpus. We take the set Y of unanalyzed words to be {ABBA, BAB}, and c to
be given by c(ABBA) = 10, c(BAB) = 20.
P
Note that the total value of the corpus is u∈Y c(u) = 10 + 20 = 30.
+ 89:;
?>=< 89:;
?>=<
.3 .7 .9
(
s h t s
.1
Starting probability of s is .85, of t is .15. In s, Pr(A) = .4, Pr(B) = .6. In t, Pr(A) = .5, Pr(B) = .5.
3 α(y, j, s)
Let y ∈ Y , and let n be the length of y. For 1 ≤ j ≤ n and s one of our states, we define α(y, j, s) to be the
probability in the space of analyzed words that the first j symbols match those of y, and the ending state is s.
This is related to the computations in the Forward Algorithm because the overall probability of y in the
P
HMM h is u∈S α(y, n, u). This is number is written as Prh (y).
Writing y as A1 A2 · · · An , we have
α(y, 1, s) = start(s)out(s,
X
A1 )
α(y, j + 1, s) = α(y, j, t)go(t, s)out(s, Aj+1 )
t∈S
ABBA
α(ABBA, 1, s) = (.85)(.4) = 0.34.
α(ABBA, 1, t) = (.15)(.5) = 0.08.
α(ABBA, 2, s) = (0.34)(.3)(.6) + (0.08)(.1)(.6) = 0.06120 + 0.00480 = 0.06600.
α(ABBA, 2, t) = (0.34)(.7)(.5) + (0.08)(.9)(.5) = 0.11900 + 0.03600 = 0.15500.
α(ABBA, 3, s) = (0.06600)(.3)(.6) + (0.15500)(.1)(.6) = 0.01188 + 0.00930 = 0.02118.
1
α(ABBA, 3, t) = (0.06600)(.7)(.5) + (0.15500)(.9)(.5) = 0.02310 + 0.06975 = 0.09285.
α(ABBA, 4, s) = (0.02118)(.3)(.4) + (0.09285)(.1)(.4) = 0.00254 + 0.00371 = 0.00625.
α(ABBA, 4, t) = (0.02118)(.7)(.5) + (0.09285)(.9)(.5) = 0.00741 + 0.04178 = 0.04919.
Total probability of ABBA is 0.00625 + 0.04919 = 0.05544.
BAB
α(BAB, 1, s) = (.85)(.6) = 0.51.
α(BAB, 1, t) = (.15)(.5) = 0.08.
α(BAB, 2, s) = (0.51)(.3)(.4) + (0.08)(.1)(.4) = 0.0612 + 0.0032 = 0.0644.
α(BAB, 2, t) = (0.51)(.7)(.5) + (0.08)(.9)(.5) = 0.1785 + 0.0360 = 0.2145.
α(BAB, 3, s) = (0.06600)(.3)(.6) + (0.15500)(.1)(.6) = 0.01188 + 0.00930 = 0.0209.
α(BAB, 3, t) = (0.0644)(.7)(.5) + (0.2145)(.9)(.5) = 0.0225 + 0.0965 = 0.1190.
Total probability of BAB is 0.0209 + 0.1190 = 0.1399.
2
4.2 β(BAB, j, s) for 1 ≤ j ≤ 3
β(BAB, 3, s) = 1.
β(BAB, 3, t) = 1.
X
β(BAB, 2, s) = go(s, u)out(u, B)β(BAB, 3, u) = (.3)(.4)(1) + (.7)(.5)(1) = 0.53000
u∈S
X
β(BAB, 2, t) = go(t, u)out(u, B)β(BAB, 3, u) = (.1)(.4)(1) + (.9)(.5)(1) = 0.51000
u∈S
X
β(BAB, 1, s) = go(s, u)out(u, A)β(BAB, 2, u) = (.3)(.6)(0.53000) + (.7)(.5)(0.51000) = 0.24210
u∈S
X
β(BAB, 1, t) = go(t, u)out(u, A)β(BAB, 2, u) = (.1)(.6)(0.53000) + (.9)(.5)(0.51000) = 0.25070
u∈S
5 γ(y, j, s, t)
Let y ∈ Y , and write y as A1 · · · An . We want the probability in the subspace A(y) that an analyzed word
has s as its jth state, (Aj+1 as its (j + 1)st symbol), and t as its (j + 1)st state. (This only makes sense when
1 ≤ j < n.)
This probability is called γ(y, j, s, t). It is given by
In other words, γ(y, j, s, t) is the probability that a word in A(y) has an s as its jth symbol and a t as its
(j + 1)st symbol.
It is important to see that for different unanalyzed words, say y and z, γ(y, j, s, t) and γ(z, j, s, t) are
probabilities in different spaces.
For example,
α(ABBA, 1, t)go(t, s)out(s, B)β(ABBA, 2, s) 0.08 ∗ .1 ∗ .6 ∗ 0.25610
γ(ABBA, 1, t, s) = = = 0.02217.
Prh (ABBA) 0.05544
The values are
γ(ABBA, 1, s, s) = 0.28271 γ(ABBA, 2, s, s) = 0.10071 γ(ABBA, 3, s, s) = 0.04584
γ(ABBA, 1, s, t) = 0.53383 γ(ABBA, 2, s, t) = 0.20417 γ(ABBA, 3, s, t) = 0.13371
γ(ABBA, 1, t, s) = 0.02217 γ(ABBA, 2, t, s) = 0.07884 γ(ABBA, 3, t, s) = 0.06699
γ(ABBA, 1, t, t) = 0.16149 γ(ABBA, 2, t, t) = 0.61648 γ(ABBA, 3, t, t) = 0.75365
6 δ(y, j, s)
This is the probability of an analyzed word in A(y) that the jth state is s. For j < length(y), δ(y, j, s) =
P
u∈S γ(y, j, s, u). Also, δ(y, n, s) = α(y, n, s)/ Prh (y).
3
So here we have
So the new value of go(s, s) is 0.298. Similarly, the new value of go(s, t) is 0.702. The probability of going
from state t to state s will be M/(M + N ), where
So the new value of go(t, s) is 0.106. Similarly, the new value of go(t, t) is 0.894.
4
Turning to the outputs, the probability that in state s we output A is K/(K + L), where
Thus the probability is 0.357. Similarly, the probability that we output B in state s is 0.643.
The probability that in state t we output A is M/(M + N ), where
Thus the probability is 0.4292. Similarly, the probability that we output B in state s is N/(M + N ), 0.5708.
+ 89:;
?>=< 89:;
?>=<
0.298 0.702 0.894
(
s h t s
0.106
8 Again
At this point, we do all the calculations over again. I have hidden them, and only report the probabilities of
the elements of Y and the log likelihood of the corpus.
Total probability of ABBA is 0.00635 + 0.04690 = 0.05325.
Total probability of BAB is 0.0223 + 0.1250 = 0.1473.
5
8.2 Again a new model
Using h2 , we then do all the calculations and construct a new HMM which we call h3 :
+ 89:;
?>=< 89:;
?>=<
0.292 0.708 0.891
(
s h t s
0.109
9 Again Again
Total probability of ABBA is 0.00653 + 0.04672 = 0.05325.
Total probability of BAB is 0.0223 + 0.1254 = 0.1477.
+ 89:;
?>=< 89:;
?>=<
0.287 0.713 0.889
(
s h t s
0.111
Starting probability of s is 0.841, of t is 0.159. In s, Pr(A) = 0.3637, Pr(B) = 0.6363. In t, Pr(A) = 0.4243,
Pr(B) = 0.5757.
10 The likelihoods
The likelihood of c in h1 was −68.2611.
The likelihood of c in h2 was −67.6333.
The likelihood of c in h3 was −67.5790.
In playing around with different starting values, I found that the likelihood on h3 sometimes was worse than
that of h2 (contrary to what we’ll prove in class). I believe this is due to rounding errors in the calculations of
the starting probabilities in the different states. I also noticed that most of the updating was actually to those
starting probabilities, with the others changing only a little.