N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
Copyright
c 2019. All
rights reserved. Draft of October 2, 2019.
CHAPTER
Predicting is difficult—especially about the future, as the old quip goes. But how
about predicting something that seems much easier, like the next few words someone
is going to say? What word, for example, is likely to follow
Please turn your homework ...
Hopefully, most of you concluded that a very likely word is in, or possibly over,
but probably not refrigerator or the. In the following sections we will formalize
this intuition by introducing models that assign a probability to each possible next
word. The same models will also serve to assign a probability to an entire sentence.
Such a model, for example, could predict that the following sequence has a much
higher probability of appearing in a text:
all of a sudden I notice three guys standing on the sidewalk
than does this same set of words in a different order:
Why would you want to predict upcoming words, or assign probabilities to sen-
tences? Probabilities are essential in any task in which we have to identify words in
noisy, ambiguous input, like speech recognition. For a speech recognizer to realize
that you said I will be back soonish and not I will be bassoon dish, it helps to know
that back soonish is a much more probable sequence than bassoon dish. For writing
tools like spelling correction or grammatical error correction, we need to find and
correct errors in writing like Their are two midterms, in which There was mistyped
as Their, or Everything has improve, in which improve should have been improved.
The phrase There are will be much more probable than Their are, and has improved
than has improve, allowing us to help users by detecting and correcting these errors.
Assigning probabilities to sequences of words is also essential in machine trans-
lation. Suppose we are translating a Chinese source sentence:
他 向 记者 介绍了 主要 内容
He to reporters introduced main content
As part of the process we might have built the following set of potential rough
English translations:
he introduced reporters to the main contents of the statement
he briefed to reporters the main contents of the statement
he briefed reporters on the main contents of the statement
2 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
3.1 N-Grams
Let’s begin with the task of computing P(w|h), the probability of a word w given
some history h. Suppose the history h is “its water is so transparent that” and we
want to know the probability that the next word is the:
One way to estimate this probability is from relative frequency counts: take a
very large corpus, count the number of times we see its water is so transparent that,
and count the number of times this is followed by the. This would be answering the
question “Out of the times we saw the history h, how many times was it followed by
the word w”, as follows:
With a large enough corpus, such as the web, we can compute these counts and
estimate the probability from Eq. 3.2. You should pause now, go to the web, and
compute this estimate for yourself.
While this method of estimating probabilities directly from counts works fine in
many cases, it turns out that even the web isn’t big enough to give us good estimates
in most cases. This is because language is creative; new sentences are created all the
time, and we won’t always be able to count entire sentences. Even simple extensions
of the example sentence may have counts of zero on the web (such as “Walden
Pond’s water is so transparent that the”; well, used to have counts of zero).
3.1 • N-G RAMS 3
The chain rule shows the link between computing the joint probability of a se-
quence and computing the conditional probability of a word given previous words.
Equation 3.4 suggests that we could estimate the joint probability of an entire se-
quence of words by multiplying together a number of conditional probabilities. But
using the chain rule doesn’t really seem to help us! We don’t know any way to
compute the exact probability of a word given a long sequence of preceding words,
P(wn |wn−1
1 ). As we said above, we can’t just estimate by counting the number of
times every word occurs following every long string, because language is creative
and any particular context might have never occurred before!
The intuition of the n-gram model is that instead of computing the probability of
a word given its entire history, we can approximate the history by just the last few
words.
bigram The bigram model, for example, approximates the probability of a word given
all the previous words P(wn |w1n−1 ) by using only the conditional probability of the
preceding word P(wn |wn−1 ). In other words, instead of computing the probability
P(the|that) (3.6)
4 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
When we use a bigram model to predict the conditional probability of the next
word, we are thus making the following approximation:
P(wn |wn−1
1 ) ≈ P(wn |wn−1 ) (3.7)
The assumption that the probability of a word depends only on the previous word
Markov is called a Markov assumption. Markov models are the class of probabilistic models
that assume we can predict the probability of some future unit without looking too
far into the past. We can generalize the bigram (which looks one word into the past)
n-gram to the trigram (which looks two words into the past) and thus to the n-gram (which
looks n − 1 words into the past).
Thus, the general equation for this n-gram approximation to the conditional
probability of the next word in a sequence is
maximum
How do we estimate these bigram or n-gram probabilities? An intuitive way to
likelihood estimate probabilities is called maximum likelihood estimation or MLE. We get
estimation
the MLE estimate for the parameters of an n-gram model by getting counts from a
normalize corpus, and normalizing the counts so that they lie between 0 and 1.1
For example, to compute a particular bigram probability of a word y given a
previous word x, we’ll compute the count of the bigram C(xy) and normalize by the
sum of all the bigrams that share the same first word x:
C(wn−1 wn )
P(wn |wn−1 ) = P (3.10)
w C(wn−1 w)
We can simplify this equation, since the sum of all bigram counts that start with
a given word wn−1 must be equal to the unigram count for that word wn−1 (the reader
should take a moment to be convinced of this):
C(wn−1 wn )
P(wn |wn−1 ) = (3.11)
C(wn−1 )
Let’s work through an example using a mini-corpus of three sentences. We’ll
first need to augment each sentence with a special symbol <s> at the beginning
of the sentence, to give us the bigram context of the first word. We’ll also need a
special end-symbol. </s>2
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
1 For probabilistic models, normalizing means dividing by some total count so that the resulting prob-
abilities fall legally between 0 and 1.
2 We need the end-symbol to make the bigram grammar a true probability distribution. Without an
end-symbol, the sentence probabilities for all sentences of a given length would sum to one. This model
would define an infinite set of probability distributions, with one distribution per sentence length. See
Exercise 3.5.
3.1 • N-G RAMS 5
Here are the calculations for some of the bigram probabilities from this corpus
2 1 2
P(I|<s>) = 3 = .67 P(Sam|<s>) = 3 = .33 P(am|I) = 3 = .67
1 1 1
P(</s>|Sam) = 2 = 0.5 P(Sam|am) = 2 = .5 P(do|I) = 3 = .33
For the general case of MLE n-gram parameter estimation:
C(wn−1
n−N+1 wn )
P(wn |wn−1
n−N+1 ) = n−1
(3.12)
C(wn−N+1 )
Equation 3.12 (like Eq. 3.11) estimates the n-gram probability by dividing the
observed frequency of a particular sequence by the observed frequency of a prefix.
relative
frequency This ratio is called a relative frequency. We said above that this use of relative
frequencies as a way to estimate probabilities is an example of maximum likelihood
estimation or MLE. In MLE, the resulting parameter set maximizes the likelihood
of the training set T given the model M (i.e., P(T |M)). For example, suppose the
word Chinese occurs 400 times in a corpus of a million words like the Brown corpus.
What is the probability that a random word selected from some other text of, say,
400
a million words will be the word Chinese? The MLE of its probability is 1000000
or .0004. Now .0004 is not the best possible estimate of the probability of Chinese
occurring in all situations; it might turn out that in some other corpus or context
Chinese is a very unlikely word. But it is the probability that makes it most likely
that Chinese will occur 400 times in a million-word corpus. We present ways to
modify the MLE estimates slightly to get better probability estimates in Section 3.4.
Let’s move on to some examples from a slightly larger corpus than our 14-word
example above. We’ll use data from the now-defunct Berkeley Restaurant Project,
a dialogue system from the last century that answered questions about a database
of restaurants in Berkeley, California (Jurafsky et al., 1994). Here are some text-
normalized sample user queries (a sample of 9332 sentences is on the website):
can you tell me about any good cantonese restaurants close by
mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
Figure 3.1 shows the bigram counts from a piece of a bigram grammar from the
Berkeley Restaurant Project. Note that the majority of the values are zero. In fact,
we have chosen the sample words to cohere with each other; a matrix selected from
a random set of seven words would be even more sparse.
Figure 3.2 shows the bigram probabilities after normalization (dividing each cell
in Fig. 3.1 by the appropriate unigram for its row, taken from the following set of
unigram probabilities):
i want to eat chinese food lunch spend
2533 927 2417 746 158 1093 341 278
Here are a few other useful probabilities:
P(i|<s>) = 0.25 P(english|want) = 0.0011
P(food|english) = 0.5 P(</s>|food) = 0.68
Now we can compute the probability of sentences like I want English food or
I want Chinese food by simply multiplying the appropriate bigram probabilities to-
gether, as follows:
6 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
the smallest test set that gives us enough statistical power to measure a statistically
significant difference between two potential models. In practice, we often just divide
our data into 80% training, 10% development, and 10% test. Given a large corpus
that we want to divide into training and test, test data can either be taken from some
continuous sequence of text inside the corpus, or we can remove smaller “stripes”
of text from randomly selected parts of our corpus and combine them into a test set.
3.2.1 Perplexity
In practice we don’t use raw probability as our metric for evaluating language mod-
perplexity els, but a variant called perplexity. The perplexity (sometimes called PP for short)
of a language model on a test set is the inverse probability of the test set, normalized
by the number of words. For a test set W = w1 w2 . . . wN ,:
1
PP(W ) = P(w1 w2 . . . wN )− N (3.14)
s
1
= N
P(w1 w2 . . . wN )
v
uN
uY
N 1
PP(W ) = t (3.15)
P(wi |w1 . . . wi−1 )
i=1
v
uN
uY
N 1
PP(W ) = t (3.16)
P(wi |wi−1 )
i=1
Note that because of the inverse in Eq. 3.15, the higher the conditional probabil-
ity of the word sequence, the lower the perplexity. Thus, minimizing perplexity is
equivalent to maximizing the test set probability according to the language model.
What we generally use for word sequence in Eq. 3.15 or Eq. 3.16 is the entire se-
quence of words in some test set. Since this sequence will cross many sentence
boundaries, we need to include the begin- and end-sentence markers <s> and </s>
in the probability computation. We also need to include the end-of-sentence marker
</s> (but not the beginning-of-sentence marker <s>) in the total count of word to-
kens N.
There is another way to think about perplexity: as the weighted average branch-
ing factor of a language. The branching factor of a language is the number of possi-
ble next words that can follow any word. Consider the task of recognizing the digits
in English (zero, one, two,..., nine), given that (both in some training set and in some
1
test set) each of the 10 digits occurs with equal probability P = 10 . The perplexity of
this mini-language is in fact 10. To see that, imagine a test string of digits of length
N, and assume that in the training set all the digits occurred with equal probability.
By Eq. 3.15, the perplexity will be
3.3 • G ENERALIZATION AND Z EROS 9
1
PP(W ) = P(w1 w2 . . . wN )− N
1 N −1
= ( ) N
10
1 −1
=
10
= 10 (3.17)
But suppose that the number zero is really frequent and occurs far more often
than other numbers. Let’s say that 0 occur 91 times in the training set, and each
of the other digits occurred 1 time each. Now we see the following test set: 0 0
0 0 0 3 0 0 0 0. We should expect the perplexity of this test set to be lower since
most of the time the next number will be zero, which is very predictable, i.e. has
a high probability. Thus, although the branching factor is still 10, the perplexity or
weighted branching factor is smaller. We leave this exact calculation as exercise 12.
We see in Section 3.7 that perplexity is also closely related to the information-
theoretic notion of entropy.
Finally, let’s look at an example of how perplexity can be used to compare dif-
ferent n-gram models. We trained unigram, bigram, and trigram grammars on 38
million words (including start-of-sentence tokens) from the Wall Street Journal, us-
ing a 19,979 word vocabulary. We then computed the perplexity of each of these
models on a test set of 1.5 million words with Eq. 3.16. The table below shows the
perplexity of a 1.5 million word WSJ test set according to each of these grammars.
Unigram Bigram Trigram
Perplexity 962 170 109
As we see above, the more information the n-gram gives us about the word
sequence, the lower the perplexity (since as Eq. 3.15 showed, perplexity is related
inversely to the likelihood of the test sequence according to the model).
Note that in computing perplexities, the n-gram model P must be constructed
without any knowledge of the test set or any prior knowledge of the vocabulary of
the test set. Any kind of knowledge of the test set can cause the perplexity to be
artificially low. The perplexity of two language models is only comparable if they
use identical vocabularies.
An (intrinsic) improvement in perplexity does not guarantee an (extrinsic) im-
provement in the performance of a language processing task like speech recognition
or machine translation. Nonetheless, because perplexity often correlates with such
improvements, it is commonly used as a quick check on an algorithm. But a model’s
improvement in perplexity should always be confirmed by an end-to-end evaluation
of a real task before concluding the evaluation of the model.
–To him swallowed confess hear both. Which. Of save on trail for are ay device and
1
gram
rote life have
–Hill he late speaks; or! a more to leg less first you enter
–Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live
2
gram
king. Follow.
–What means, sir. I confess she? then all sorts, he is trim, captain.
–Fly, and will rid me these news of price. Therefore the sadness of parting, as they say,
3
gram
’tis done.
–This shall forbid it should be branded, if renown made it empty.
–King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A
4
gram
great banquet serv’d in;
–It cannot be but so.
Figure 3.3 Eight sentences randomly generated from four n-grams computed from Shakespeare’s works. All
characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected
for capitalization to improve readability.
The longer the context on which we train the model, the more coherent the sen-
tences. In the unigram sentences, there is no coherent relation between words or any
sentence-final punctuation. The bigram sentences have some local word-to-word
coherence (especially if we consider that punctuation counts as a word). The tri-
gram and 4-gram sentences are beginning to look a lot like Shakespeare. Indeed, a
careful investigation of the 4-gram sentences shows that they look a little too much
like Shakespeare. The words It cannot be but so are directly from King John. This is
because, not to put the knock on Shakespeare, his oeuvre is not very large as corpora
go (N = 884, 647,V = 29, 066), and our n-gram probability matrices are ridiculously
sparse. There are V 2 = 844, 000, 000 possible bigrams alone, and the number of pos-
sible 4-grams is V 4 = 7 × 1017 . Thus, once the generator has chosen the first 4-gram
(It cannot be but), there are only five possible continuations (that, I, he, thou, and
so); indeed, for many 4-grams, there is only one continuation.
To get an idea of the dependence of a grammar on its training set, let’s look at an
n-gram grammar trained on a completely different corpus: the Wall Street Journal
(WSJ) newspaper. Shakespeare and the Wall Street Journal are both English, so
we might expect some overlap between our n-grams for the two genres. Fig. 3.4
3.3 • G ENERALIZATION AND Z EROS 11
Compare these examples to the pseudo-Shakespeare in Fig. 3.3. While they both
model “English-like sentences”, there is clearly no overlap in generated sentences,
and little overlap even in small phrases. Statistical models are likely to be pretty use-
less as predictors if the training sets and the test sets are as different as Shakespeare
and WSJ.
How should we deal with this problem when we build n-gram models? One step
is to be sure to use a training corpus that has a similar genre to whatever task we are
trying to accomplish. To build a language model for translating legal documents,
we need a training corpus of legal documents. To build a language model for a
question-answering system, we need a training corpus of questions.
It is equally important to get training data in the appropriate dialect, especially
when processing social media posts or spoken transcripts. Thus tweets in AAVE
(African American Vernacular English) often use words like finna—an auxiliary
verb that marks immediate future tense —that don’t occur in other dialects, or
spellings like den for then, in tweets like this one (Blodgett and O’Connor, 2017):
(3.18) Bored af den my phone finna die!!!
while tweets from varieties like Nigerian English have markedly different vocabu-
lary and n-gram patterns from American English (Jurgens et al., 2017):
(3.19) @username R u a wizard or wat gan sef: in d mornin - u tweet, afternoon - u
tweet, nyt gan u dey tweet. beta get ur IT placement wiv twitter
Matching genres and dialects is still not sufficient. Our models may still be
subject to the problem of sparsity. For any n-gram that occurred a sufficient number
of times, we might have a good estimate of its probability. But because any corpus is
limited, some perfectly acceptable English word sequences are bound to be missing
from it. That is, we’ll have many cases of putative “zero probability n-grams” that
should really have some non-zero probability. Consider the words that follow the
bigram denied the in the WSJ Treebank3 corpus, together with their counts:
denied the allegations: 5
denied the speculation: 2
denied the rumors: 1
denied the report: 1
But suppose our test set has phrases like:
12 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
3.4 Smoothing
What do we do with words that are in our vocabulary (they are not unknown words)
but appear in a test set in an unseen context (for example they appear after a word
they never appeared after in training)? To keep a language model from assigning
zero probability to these unseen events, we’ll have to shave off a bit of probability
mass from some more frequent events and give it to the events we’ve never seen.
smoothing This modification is called smoothing or discounting. In this section and the fol-
discounting lowing ones we’ll introduce a variety of ways to do smoothing: add-1 smoothing,
add-k smoothing, stupid backoff, and Kneser-Ney smoothing.
ci
P(wi ) =
N
Laplace smoothing merely adds one to each count (hence its alternate name add-
add-one one smoothing). Since there are V words in the vocabulary and each one was incre-
mented, we also need to adjust the denominator to take into account the extra V
observations. (What happens to our P values if we don’t increase the denominator?)
ci + 1
PLaplace (wi ) = (3.20)
N +V
Instead of changing both the numerator and denominator, it is convenient to
describe how a smoothing algorithm affects the numerator, by defining an adjusted
count c∗ . This adjusted count is easier to compare directly with the MLE counts and
can be turned into a probability like an MLE count by normalizing by N. To define
this count, since we are only changing the numerator in addition to adding 1 we’ll
N
also need to multiply by a normalization factor N+V :
N
c∗i = (ci + 1) (3.21)
N +V
We can now turn c∗i into a probability Pi∗ by normalizing by N.
discounting A related way to view smoothing is as discounting (lowering) some non-zero
counts in order to get the probability mass that will be assigned to the zero counts.
Thus, instead of referring to the discounted counts c∗ , we might describe a smooth-
discount ing algorithm in terms of a relative discount dc , the ratio of the discounted counts
to the original counts:
14 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
c∗
dc =
c
Now that we have the intuition for the unigram case, let’s smooth our Berkeley
Restaurant Project bigrams. Figure 3.5 shows the add-one smoothed counts for the
bigrams in Fig. 3.1.
Figure 3.6 shows the add-one smoothed probabilities for the bigrams in Fig. 3.2.
Recall that normal bigram probabilities are computed by normalizing each row of
counts by the unigram count:
C(wn−1 wn )
P(wn |wn−1 ) = (3.22)
C(wn−1 )
For add-one smoothed bigram counts, we need to augment the unigram count by
the number of total word types in the vocabulary V :
∗ C(wn−1 wn ) + 1 C(wn−1 wn ) + 1
PLaplace (wn |wn−1 ) = P = (3.23)
w (C(wn−1 w) + 1) C(wn−1 ) +V
Thus, each of the unigram counts given in the previous section will need to be
augmented by V = 1446. The result is the smoothed bigram probabilities in Fig. 3.6.
It is often convenient to reconstruct the count matrix so we can see how much a
smoothing algorithm has changed the original counts. These adjusted counts can be
computed by Eq. 3.24. Figure 3.7 shows the reconstructed counts.
[C(wn−1 wn ) + 1] ×C(wn−1 )
c∗ (wn−1 wn ) = (3.24)
C(wn−1 ) +V
3.4 • S MOOTHING 15
Note that add-one smoothing has made a very big change to the counts. C(want to)
changed from 609 to 238! We can see this in probability space as well: P(to|want)
decreases from .66 in the unsmoothed case to .26 in the smoothed case. Looking at
the discount d (the ratio between new and old counts) shows us how strikingly the
counts for each prefix word have been reduced; the discount for the bigram want to
is .39, while the discount for Chinese food is .10, a factor of 10!
The sharp change in counts and probabilities occurs because too much probabil-
ity mass is moved to all the zeros.
∗ C(wn−1 wn ) + k
PAdd-k (wn |wn−1 ) = (3.25)
C(wn−1 ) + kV
Add-k smoothing requires that we have a method for choosing k; this can be
done, for example, by optimizing on a devset. Although add-k is useful for some
tasks (including text classification), it turns out that it still doesn’t work well for
language modeling, generating counts with poor variances and often inappropriate
discounts (Gale and Church, 1994).
How are these λ values set? Both the simple interpolation and conditional inter-
held-out polation λ s are learned from a held-out corpus. A held-out corpus is an additional
training corpus that we use to set hyperparameters like these λ values, by choosing
the λ values that maximize the likelihood of the held-out corpus. That is, we fix
the n-gram probabilities and then search for the λ values that—when plugged into
Eq. 3.26—give us the highest probability of the held-out set. There are various ways
to find this optimal set of λ s. One way is to use the EM algorithm, an iterative
learning algorithm that converges on locally optimal λ s (Jelinek and Mercer, 1980).
In a backoff n-gram model, if the n-gram we need has zero counts, we approxi-
mate it by backing off to the (N-1)-gram. We continue backing off until we reach a
history that has some counts.
In order for a backoff model to give a correct probability distribution, we have
discount to discount the higher-order n-grams to save some probability mass for the lower
order n-grams. Just as with add-one smoothing, if the higher-order n-grams aren’t
discounted and we just used the undiscounted MLE probability, then as soon as we
replaced an n-gram which has zero probability with a lower-order n-gram, we would
be adding probability mass, and the total probability assigned to all possible strings
by the language model would be greater than 1! In addition to this explicit discount
factor, we’ll need a function α to distribute this probability mass to the lower order
n-grams.
Katz backoff This kind of backoff with discounting is also called Katz backoff. In Katz back-
off we rely on a discounted probability P∗ if we’ve seen this n-gram before (i.e., if
we have non-zero counts). Otherwise, we recursively back off to the Katz probabil-
ity for the shorter-history (N-1)-gram. The probability for a backoff n-gram PBO is
3.5 • K NESER -N EY S MOOTHING 17
The astute reader may have noticed that except for the held-out counts for 0
and 1, all the other bigram counts in the held-out set could be estimated pretty well
Absolute
discounting by just subtracting 0.75 from the count in the training set! Absolute discounting
formalizes this intuition by subtracting a fixed (absolute) discount d from each count.
The intuition is that since we have good estimates already for the very high counts, a
small discount d won’t affect them much. It will mainly modify the smaller counts,
18 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
for which we don’t necessarily trust the estimate anyway, and Fig. 3.8 suggests that
in practice this discount is actually a good one for bigrams with counts 2 through 9.
The equation for interpolated absolute discounting applied to bigrams:
C(wi−1 wi ) − d
PAbsoluteDiscounting (wi |wi−1 ) = P + λ (wi−1 )P(wi ) (3.30)
v C(wi−1 v)
The first term is the discounted bigram, and the second term is the unigram with
an interpolation weight λ . We could just set all the d values to .75, or we could keep
a separate discount value of 0.5 for the bigrams with counts of 1.
Kneser-Ney discounting (Kneser and Ney, 1995) augments absolute discount-
ing with a more sophisticated way to handle the lower-order unigram distribution.
Consider the job of predicting the next word in this sentence, assuming we are inter-
polating a bigram and a unigram model.
I can’t see without my reading .
The word glasses seems much more likely to follow here than, say, the word
Kong, so we’d like our unigram model to prefer glasses. But in fact it’s Kong that is
more common, since Hong Kong is a very frequent word. A standard unigram model
will assign Kong a higher probability than glasses. We would like to capture the
intuition that although Kong is frequent, it is mainly only frequent in the phrase Hong
Kong, that is, after the word Hong. The word glasses has a much wider distribution.
In other words, instead of P(w), which answers the question “How likely is
w?”, we’d like to create a unigram model that we might call PCONTINUATION , which
answers the question “How likely is w to appear as a novel continuation?”. How can
we estimate this probability of seeing the word w as a novel continuation, in a new
unseen context? The Kneser-Ney intuition is to base our estimate of PCONTINUATION
on the number of different contexts word w has appeared in, that is, the number of
bigram types it completes. Every bigram type was a novel continuation the first time
it was seen. We hypothesize that words that have appeared in more contexts in the
past are more likely to appear in some new context as well. The number of times a
word w appears as a novel continuation can be expressed as:
Interpolated
Kneser-Ney The final equation for Interpolated Kneser-Ney smoothing for bigrams is then:
max(C(wi−1 wi ) − d, 0)
PKN (wi |wi−1 ) = + λ (wi−1 )PCONTINUATION (wi ) (3.35)
C(wi−1 )
max(cKN (w ii−n+1 ) − d, 0)
PKN (wi |wi−1
i−n+1 ) = P i−1
+ λ (wi−1 i−1
i−n+1 )PKN (wi |wi−n+2 ) (3.37)
c
v KN (w i−n+1 v)
where the definition of the count cKN depends on whether we are counting the
highest-order n-gram being interpolated (for example trigram if we are interpolating
trigram, bigram, and unigram) or one of the lower-order n-grams (bigram or unigram
if we are interpolating trigram, bigram, and unigram):
count(·) for the highest order
cKN (·) = (3.38)
continuationcount(·) for lower orders
The continuation count is the number of unique single word contexts for ·.
At the termination of the recursion, unigrams are interpolated with the uniform
distribution, where the parameter ε is the empty string:
max(cKN (w) − d, 0) 1
PKN (w) = P 0
+ λ (ε) (3.39)
c
w0 KN (w ) V
If we want to include an unknown word <UNK>, it’s just included as a regular vo-
cabulary entry with count zero, and hence its probability will be a lambda-weighted
uniform distribution λ V(ε) .
The best-performing version of Kneser-Ney smoothing is called modified Kneser-
modified
Kneser-Ney Ney smoothing, and is due to Chen and Goodman (1998). Rather than use a single
fixed discount d, modified Kneser-Ney uses three different discounts d1 , d2 , and
d3+ for n-grams with counts of 1, 2 and three or more, respectively. See Chen and
Goodman (1998, p. 19) or Heafield et al. (2013) for the details.
40 times from 1,024,908,267,229 words of running text on the web; this includes
1,176,470,663 five-word sequences using over 13 million unique words types (Franz
and Brants, 2006). Some examples:
4-gram Count
serve as the incoming 92
serve as the incubator 99
serve as the independent 794
serve as the index 223
serve as the indication 72
serve as the indicator 120
serve as the indicators 45
serve as the indispensable 111
serve as the indispensible 40
serve as the individual 234
Efficiency considerations are important when building language models that use
such large sets of n-grams. Rather than store each word as a string, it is generally
represented in memory as a 64-bit hash number, with the words themselves stored
on disk. Probabilities are generally quantized using only 4-8 bits (instead of 8-byte
floats), and n-grams are stored in reverse tries.
N-grams can also be shrunk by pruning, for example only storing n-grams with
counts greater than some threshold (such as the count threshold of 40 used for the
Google n-gram release) or using entropy to prune less-important n-grams (Stolcke,
1998). Another option is to build approximate language models using techniques
Bloom filters like Bloom filters (Talbot and Osborne 2007, Church et al. 2007). Finally, effi-
cient language model toolkits like KenLM (Heafield 2011, Heafield et al. 2013) use
sorted arrays, efficiently combine probabilities and backoffs in a single value, and
use merge sorts to efficiently build the probability tables in a minimal number of
passes through a large corpus.
Although with these toolkits it is possible to build web-scale language models
using full Kneser-Ney smoothing, Brants et al. (2007) show that with very large lan-
guage models a much simpler algorithm may be sufficient. The algorithm is called
stupid backoff stupid backoff. Stupid backoff gives up the idea of trying to make the language
model a true probability distribution. There is no discounting of the higher-order
probabilities. If a higher-order n-gram has a zero count, we simply backoff to a
lower order n-gram, weighed by a fixed (context-independent) weight. This algo-
rithm does not produce a probability distribution, so we’ll follow Brants et al. (2007)
in referring to it as S:
count(wii−k+1 ) if count(wi
i−1
S(wi |wi−k+1 ) = i−1
count(wi−k+1 ) i−k+1 ) > 0 (3.40)
λ S(w |wi−1 ) otherwise
i i−k+2
The backoff terminates in the unigram, which has probability S(w) = count(w)
N . Brants
et al. (2007) find that a value of 0.4 worked well for λ .
test data, and perplexity is a normalized version of the probability of the test set.
The perplexity measure actually arises from the information-theoretic concept of
cross-entropy, which explains otherwise mysterious properties of perplexity (why
Entropy the inverse probability, for example?) and its relationship to entropy. Entropy is a
measure of information. Given a random variable X ranging over whatever we are
predicting (words, letters, parts of speech, the set of which we’ll call χ) and with a
particular probability function, call it p(x), the entropy of the random variable X is:
X
H(X) = − p(x) log2 p(x) (3.41)
x∈χ
The log can, in principle, be computed in any base. If we use log base 2, the
resulting value of entropy will be measured in bits.
One intuitive way to think about entropy is as a lower bound on the number of
bits it would take to encode a certain decision or piece of information in the optimal
coding scheme.
Consider an example from the standard information theory textbook Cover and
Thomas (1991). Imagine that we want to place a bet on a horse race but it is too
far to go all the way to Yonkers Racetrack, so we’d like to send a short message to
the bookie to tell him which of the eight horses to bet on. One way to encode this
message is just to use the binary representation of the horse’s number as the code;
thus, horse 1 would be 001, horse 2 010, horse 3 011, and so on, with horse 8 coded
as 000. If we spend the whole day betting and each horse is coded with 3 bits, on
average we would be sending 3 bits per race.
Can we do better? Suppose that the spread is the actual distribution of the bets
placed and that we represent it as the prior probability of each horse as follows:
1 1
Horse 1 2 Horse 5 64
1 1
Horse 2 4 Horse 6 64
1 1
Horse 3 8 Horse 7 64
1 1
Horse 4 16 Horse 8 64
The entropy of the random variable X that ranges over horses gives us a lower
bound on the number of bits and is
i=8
X
H(X) = − p(i) log p(i)
i=1
= − 21 log 12 − 41 log 14 − 18 log 18 − 16
1 log 1 −4( 1 log 1 )
16 64 64
= 2 bits (3.42)
A code that averages 2 bits per race can be built with short encodings for more
probable horses, and longer encodings for less probable horses. For example, we
could encode the most likely horse with the code 0, and the remaining horses as 10,
then 110, 1110, 111100, 111101, 111110, and 111111.
What if the horses are equally likely? We saw above that if we used an equal-
length binary code for the horse numbers, each horse took 3 bits to code, so the
average was 3. Is the entropy the same? In this case each horse would have a
probability of 18 . The entropy of the choice of horses is then
i=8
X 1 1 1
H(X) = − log = − log = 3 bits (3.43)
8 8 8
i=1
22 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
Until now we have been computing the entropy of a single variable. But most of
what we will use entropy for involves sequences. For a grammar, for example, we
will be computing the entropy of some sequence of words W = {w0 , w1 , w2 , . . . , wn }.
One way to do this is to have a variable that ranges over sequences of words. For
example we can compute the entropy of a random variable that ranges over all finite
sequences of words of length n in some language L as follows:
X
H(w1 , w2 , . . . , wn ) = − p(W1n ) log p(W1n ) (3.44)
W1n ∈L
entropy rate We could define the entropy rate (we could also think of this as the per-word
entropy) as the entropy of this sequence divided by the number of words:
1 1 X
H(W1n ) = − p(W1n ) log p(W1n ) (3.45)
n n n
W1 ∈L
1
H(L) = lim H(w1 , w2 , . . . , wn )
n
n→∞
1X
= − lim p(w1 , . . . , wn ) log p(w1 , . . . , wn ) (3.46)
n→∞ n
W ∈L
1X
H(p, m) = lim − p(w1 , . . . , wn ) log m(w1 , . . . , wn ) (3.48)
n→∞ n
W ∈L
That is, we draw sequences according to the probability distribution p, but sum
the log of their probabilities according to m.
Again, following the Shannon-McMillan-Breiman theorem, for a stationary er-
godic process:
1
H(p, m) = lim − log m(w1 w2 . . . wn ) (3.49)
n→∞ n
This means that, as for entropy, we can estimate the cross-entropy of a model
m on some distribution p by taking a single sequence that is long enough instead of
summing over all possible sequences.
What makes the cross-entropy useful is that the cross-entropy H(p, m) is an up-
per bound on the entropy H(p). For any model m:
This means that we can use some simplified model m to help estimate the true en-
tropy of a sequence of symbols drawn according to probability p. The more accurate
m is, the closer the cross-entropy H(p, m) will be to the true entropy H(p). Thus,
the difference between H(p, m) and H(p) is a measure of how accurate a model is.
Between two models m1 and m2 , the more accurate model will be the one with the
lower cross-entropy. (The cross-entropy can never be lower than the true entropy, so
a model cannot err by underestimating the true entropy.)
We are finally ready to see the relation between perplexity and cross-entropy as
we saw it in Eq. 3.49. Cross-entropy is defined in the limit, as the length of the
observed word sequence goes to infinity. We will need an approximation to cross-
entropy, relying on a (sufficiently long) sequence of fixed length. This approxima-
tion to the cross-entropy of a model M = P(wi |wi−N+1 ...wi−1 ) on a sequence of
words W is
1
H(W ) = − log P(w1 w2 . . . wN ) (3.51)
N
perplexity The perplexity of a model P on a sequence of words W is now formally defined as
the exp of this cross-entropy:
Perplexity(W ) = 2H(W )
1
= P(w1 w2 . . . wN )− N
s
1
= N
P(w1 w2 . . . wN )
v
uN
uY
N 1
= t (3.52)
P(wi |w1 . . . wi−1 )
i=1
24 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
3.8 Summary
This chapter introduced language modeling and the n-gram, one of the most widely
used tools in language processing.
• Language models offer a way to assign a probability to a sentence or other
sequence of words, and to predict a word from preceding words.
• n-grams are Markov models that estimate words from a fixed window of pre-
vious words. n-gram probabilities can be estimated by counting in a corpus
and normalizing (the maximum likelihood estimate).
• n-gram language models are evaluated extrinsically in some task, or intrinsi-
cally using perplexity.
• The perplexity of a test set according to a language model is the geometric
mean of the inverse test set probability computed by the model.
• Smoothing algorithms provide a more sophisticated way to estimate the prob-
ability of n-grams. Commonly used smoothing algorithms for n-grams rely on
lower-order n-gram counts through backoff or interpolation.
• Both backoff and interpolation require discounting to create a probability dis-
tribution.
• Kneser-Ney smoothing makes use of the probability of a word being a novel
continuation. The interpolated Kneser-Ney smoothing algorithm mixes a
discounted probability with a lower-order continuation probability.
based on an earlier Add-K suggestion by Johnson (1932). Problems with the add-
one algorithm are summarized in Gale and Church (1994).
A wide variety of different language modeling and smoothing techniques were
proposed in the 80s and 90s, including Good-Turing discounting—first applied to
the n-gram smoothing at IBM by Katz (Nádas 1984, Church and Gale 1991)—
Witten-Bell discounting (Witten and Bell, 1991), and varieties of class-based n-
class-based
n-gram gram models that used information about word classes.
Starting in the late 1990s, Chen and Goodman produced a highly influential
series of papers with a comparison of different language models (Chen and Good-
man 1996, Chen and Goodman 1998, Chen and Goodman 1999, Goodman 2006).
They performed a number of carefully controlled experiments comparing differ-
ent discounting algorithms, cache models, class-based models, and other language
model parameters. They showed the advantages of Modified Interpolated Kneser-
Ney, which has since become the standard baseline for language modeling, espe-
cially because they showed that caches and class-based models provided only minor
additional improvement. These papers are recommended for any reader with further
interest in language modeling.
Two commonly used toolkits for building language models are SRILM (Stolcke,
2002) and KenLM (Heafield 2011, Heafield et al. 2013). Both are publicly available.
SRILM offers a wider range of options and types of discounting, while KenLM is
optimized for speed and memory size, making it possible to build web-scale lan-
guage models.
The highest accuracy language models are neural network language models.
These solve a major problem with n-gram language models: the number of parame-
ters increases exponentially as the n-gram order increases, and n-grams have no way
to generalize from training to test set. Neural language models instead project words
into a continuous space in which words with similar contexts have similar represen-
tations. We’ll introduce both feedforward language models (Bengio et al. 2006,
Schwenk 2007) in Chapter 7, and recurrent language models (Mikolov, 2012) in
Chapter 9.
Exercises
3.1 Write out the equation for trigram probability estimation (modifying Eq. 3.11).
Now write out all the non-zero trigram probabilities for the I am Sam corpus
on page 4.
3.2 Calculate the probability of the sentence i want chinese food. Give two
probabilities, one using Fig. 3.2 and the ‘useful probabilities’ just below it on
page 6, and another using the add-1 smoothed table in Fig. 3.6. Assume the
additional add-1 smoothed probabilities P(i|<s>) = 0.19 and P(</s>|food) =
0.40.
3.3 Which of the two probabilities you computed in the previous exercise is higher,
unsmoothed or smoothed? Explain why.
3.4 We are given the following corpus, modified from the one in the chapter:
<s> I am Sam </s>
<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and Sam </s>
26 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
Algoet, P. H. and Cover, T. M. (1988). A sandwich proof of Heafield, K. (2011). KenLM: Faster and smaller language
the Shannon-McMillan-Breiman theorem. The Annals of model queries. In Workshop on Statistical Machine Trans-
Probability, 16(2), 899–909. lation, 187–197.
Bahl, L. R., Jelinek, F., and Mercer, R. L. (1983). A max- Heafield, K., Pouzyrevsky, I., Clark, J. H., and Koehn, P.
imum likelihood approach to continuous speech recogni- (2013). Scalable modified Kneser-Ney language model es-
tion. IEEE Transactions on Pattern Analysis and Machine timation.. In ACL 2013, 690–696.
Intelligence, 5(2), 179–190. Jeffreys, H. (1948). Theory of Probability (2nd Ed.). Claren-
Baker, J. K. (1975a). The DRAGON system – An overview. don Press. Section 3.23.
IEEE Transactions on Acoustics, Speech, and Signal Pro- Jelinek, F. (1976). Continuous speech recognition by statis-
cessing, ASSP-23(1), 24–29. tical methods. Proceedings of the IEEE, 64(4), 532–557.
Baker, J. K. (1975b). Stochastic modeling for automatic Jelinek, F. (1990). Self-organized language modeling for
speech understanding. In Reddy, D. R. (Ed.), Speech speech recognition. In Waibel, A. and Lee, K.-F. (Eds.),
Recognition. Academic Press. Readings in Speech Recognition, 450–506. Morgan Kauf-
Bengio, Y., Schwenk, H., Senécal, J.-S., Morin, F., and Gau- mann. Originally distributed as IBM technical report in
vain, J.-L. (2006). Neural probabilistic language models. 1985.
In Innovations in Machine Learning, 137–186. Springer. Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation
Blodgett, S. L. and O’Connor, B. (2017). Racial disparity in of Markov source parameters from sparse data. In Gelsema,
natural language processing: A case study of social media E. S. and Kanal, L. N. (Eds.), Proceedings, Workshop on
african-american english. In Fairness, Accountability, and Pattern Recognition in Practice, 381–397. North Holland.
Transparency in Machine Learning (FAT/ML) Workshop, Johnson, W. E. (1932). Probability: deductive and inductive
KDD. problems (appendix to). Mind, 41(164), 421–423.
Brants, T., Popat, A. C., Xu, P., Och, F. J., and Dean, J. Jurafsky, D., Wooters, C., Tajchman, G., Segal, J., Stolcke,
(2007). Large language models in machine translation. In A., Fosler, E., and Morgan, N. (1994). The Berkeley restau-
EMNLP/CoNLL 2007. rant project. In ICSLP-94, 2139–2142.
Buck, C., Heafield, K., and Van Ooyen, B. (2014). N-gram Jurgens, D., Tsvetkov, Y., and Jurafsky, D. (2017). Incorpo-
counts and language models from the common crawl. In rating dialectal variability for socially equitable language
Proceedings of LREC. identification. In ACL 2017, 51–57.
Chen, S. F. and Goodman, J. (1996). An empirical study of Kane, S. K., Morris, M. R., Paradiso, A., and Campbell, J.
smoothing techniques for language modeling. In ACL-96, (2017). “at times avuncular and cantankerous, with the
310–318. reflexes of a mongoose”: Understanding self-expression
Chen, S. F. and Goodman, J. (1998). An empirical study of through augmentative and alternative communication de-
smoothing techniques for language modeling. Tech. rep. vices. In CSCW 2017, 1166–1179.
TR-10-98, Computer Science Group, Harvard University. Kneser, R. and Ney, H. (1995). Improved backing-off for M-
Chen, S. F. and Goodman, J. (1999). An empirical study of gram language modeling. In ICASSP-95, Vol. 1, 181–184.
smoothing techniques for language modeling. Computer Markov, A. A. (1913). Essai d’une recherche statistique sur
Speech and Language, 13, 359–394. le texte du roman “Eugene Onegin” illustrant la liaison des
Chomsky, N. (1956). Three models for the description of epreuve en chain (‘Example of a statistical investigation of
language. IRE Transactions on Information Theory, 2(3), the text of “Eugene Onegin” illustrating the dependence be-
113–124. tween samples in chain’). Izvistia Imperatorskoi Akademii
Chomsky, N. (1957). Syntactic Structures. Mouton, The Nauk (Bulletin de l’Académie Impériale des Sciences de
Hague. St.-Pétersbourg), 7, 153–162.
Church, K. W. and Gale, W. A. (1991). A comparison of Mikolov, T. (2012). Statistical language models based on
the enhanced Good-Turing and deleted estimation methods neural networks. Ph.D. thesis, Ph. D. thesis, Brno Univer-
for estimating probabilities of English bigrams. Computer sity of Technology.
Speech and Language, 5, 19–54. Miller, G. A. and Chomsky, N. (1963). Finitary models of
Church, K. W., Hart, T., and Gao, J. (2007). Compress- language users. In Luce, R. D., Bush, R. R., and Galanter,
ing trigram language models with Golomb coding. In E. (Eds.), Handbook of Mathematical Psychology, Vol. II,
EMNLP/CoNLL 2007, 199–207. 419–491. John Wiley.
Cover, T. M. and Thomas, J. A. (1991). Elements of Infor- Miller, G. A. and Selfridge, J. A. (1950). Verbal context
mation Theory. Wiley. and the recall of meaningful material. American Journal of
Franz, A. and Brants, T. (2006). All our n-gram are belong to Psychology, 63, 176–185.
you. http://googleresearch.blogspot.com/2006/ Nádas, A. (1984). Estimation of probabilities in the language
08/all-our-n-gram-are-belong-to-you.html. model of the IBM speech recognition system. IEEE Trans-
Gale, W. A. and Church, K. W. (1994). What is wrong actions on Acoustics, Speech, Signal Processing, 32(4),
with adding one?. In Oostdijk, N. and de Haan, P. (Eds.), 859–861.
Corpus-Based Research into Language, 189–198. Rodopi. Schwenk, H. (2007). Continuous space language models.
Goodman, J. (2006). A bit of progress in language mod- Computer Speech & Language, 21(3), 492–518.
eling: Extended version. Tech. rep. MSR-TR-2001-72, Shannon, C. E. (1948). A mathematical theory of commu-
Machine Learning and Applied Statistics Group, Microsoft nication. Bell System Technical Journal, 27(3), 379–423.
Research, Redmond, WA. Continued in the following volume.
28 Chapter 3 • N-gram Language Models