Darling Lda
Darling Lda
Darling Lda
Abstract
This technical report provides a tutorial on the theoretical details
of probabilistic topic modeling and gives practical steps on implement-
ing topic models such as Latent Dirichlet Allocation (LDA) through the
Markov Chain Monte Carlo approximate inference algorithm Gibbs Sam-
pling.
1 Introduction
Following its publication in 2003, Blei et al.’s Latent Dirichlet Allocation (LDA)
[3] has made topic modeling – a subfield of machine learning applied to ev-
erything from computational linguistics [4] to bioinformatics [8] and political
science [2] – one of the most popular and most successful paradigms for both
supervised and unsupervised learning. Despite topic modeling’s undisputed
popularity, however, it is for many – particularly newcomers – a difficult area
to break into due to its relative complexity and the common practice of leav-
ing out implementation details in papers describing new models. While key
update equations and other details on inference are often included, the inter-
mediate steps used to arrive at these conclusions are often left out due to space
constraints, and what details are given are rarely enough to enable most re-
searchers to test the given results for themselves by implementing their own
version of the described model. The purpose of this technical report is to help
bridge the gap between the model definitions provided in research publications
and the practical implementations that are required for performing learning in
this exciting area. Ultimately, it is hoped that this tutorial will help enable the
reader to build his or her own novel topic models.
This technical report will describe what topic modeling is, how various mod-
els (LDA in particular) work, and most importantly, how to implement a work-
ing system to perform learning with topic models. Topic modeling as an area
will be introduced through the section on LDA, as it is the “original” topic model
1
and its modularity allows the basics of the model to be used in more complicated
topic models.1 Following the introduction to topic modeling through LDA, the
problem of posterior inference will be discussed. This section will concentrate
first on the theory of the stochastic approximate inference technique Gibbs Sam-
pling and then it will discuss implementation details for building a topic model
Gibbs sampler.
1. For k = 1...K:
ideological routes in the matrix factorization technique LSI), the topic modeling “revolution”
really took off with the introduction of LDA likely due to its fully probabilistic grounding.
2
“environment” “travel” “fantasy football”
emission travel game
environmental hotel yard
air roundtrip defense
permit fares allowed
plant special fantasy
facility offer point
unit city passing
epa visit rank
water miles against
station deal team
Table 1: Three topics learned using LDA on the Enron Email Dataset.
we would associate with certain topics, and this is expressed through the topic
distributions φ. An example of the top 10 words for 3 topics learned using LDA
on the Enron email dataset2 is shown in Figure 1 (the topic labels are added
manually).
3 Inference
The key problem in topic modeling is posterior inference. This refers to reversing
the defined generative process and learning the posterior distributions of the
latent variables in the model given the observed data. In LDA, this amounts to
solving the following equation:
p(θ, φ, z, w|α, β)
p(θ, φ, z|w, α, β) = (2)
p(w|α, β)
Unfortunately, this distribution is intractable to compute. The normalization
factor in particular, p(w|α, β), cannot be computed exactly. All is not lost,
however, as there are a number of approximate inference techniques available
that we can apply to the problem including variational inference (as used in the
original LDA paper) and Gibbs Sampling (as we will use here).
3
from the desired posterior. Gibbs Sampling is based on sampling from condi-
tional distributions of the variables of the posterior.
For example, to sample x from the joint distribution p(x) = p(x1 , ..., xm ),
where there is no closed form solution for p(x), but a representation for the
conditional distributions is available, using Gibbs Sampling one would perform
the following (from [1]):
This procedure is repeated a number of times until the samples begin to con-
verge to what would be sampled from the true distribution. While convergence
is theoretically guaranteed with Gibbs Sampling, there is no way of knowing
how many iterations are required to reach the stationary distribution. There-
fore, diagnosing convergence is a real problem with the Gibbs Sampling ap-
proximate inference method. However, in practice it is quite powerful and has
fairly good performance. Typically, an acceptable estimation of convergence
can be obtained by calculating the log-likelihood or even, in some situations, by
inspection of the posteriors.
For LDA, we are interested in the latent document-topic portions θd , the
topic-word distributions φ(z) , and the topic index assignments for each word
zi . While conditional distributions – and therefore an LDA Gibbs Sampling
algorithm – can be derived for each of these latent variables, we note that both
θd and φ(z) can be calculated using just the topic index assignments zi (i.e. z is a
sufficient statistic for both these distributions).3 Therefore, a simpler algorithm
can be used if we integrate out the multinomial parameters and simply sample
zi . This is called a collapsed Gibbs sampler.
The collapsed Gibbs sampler for LDA needs to compute the probability of a
topic z being assigned to a word wi , given all other topic assignments to all other
words. Somewhat more formally, we are interested in computing the following
posterior up to a constant:
4
We then have:
Z Z
p(w, z|α, β) = p(z, w, θ, φ|α, β)dθdφ (5)
Following the LDA model defined in equation (1), we can expand the above
equation to get:
Z Z
p(w, z|α, β) = p(φ|β)p(θ|α)p(z|θ)p(w|φz )dθdφ (6)
Both terms are multinomials with Dirichlet priors. Because the Dirichlet dis-
tribution is conjugate to the multinomial distribution, our work is vastly sim-
plified; multiplying the two results in a Dirichlet distribution with an adjusted
parameter. Beginning with the first term, we have:
Z Z Y
1 Y αk
p(z|θ)p(θ|α)dθ = θd,zi θd,k dθd
i
B(α)
k
Z Y
1 nd,k +αk
= θd,k dθd
B(α)
k
B(nd,· + α)
= (8)
B(α)
where nd,k is the number of times words in document d are assigned to topic k, a
· indicates
Q
summing over that index, and B(α) is the multinomial beta function,
Γ(α )
B(α) = Γ( αkk ) . Similarly, for the second term (calculating the likelihood of
k
P
k
words given certain topic assignments):
Z Z YY Y 1 Y βw
p(w|φz )p(φ|β)dφ = φzd,i ,wd,i φ dφk
i
B(β) w k,w
d k
Z Y
Y 1 βw +nk,w
= φk,w dφk
B(β) w
k
Y B(nk,· + β)
= (9)
B(β)
k
Combining equations (8) and (9), the expanded joint distribution is then:
Y B(nd,· + α) Y B(nk,· + β)
p(w, z|α, β) = (10)
B(α) B(β)
d k
5
The Gibbs sampling equation for LDA can then be derived using the chain rule
(where we leave the hyperparameters α and β out for clarity).4 Note that the
superscript (−i) signifies leaving the ith token out of the calculation:
3.1.2 Implementation
Implementing an LDA collapsed Gibbs sampler is surprisingly straightforward.
It involves setting up the requisite count variables, randomly initializing them,
and then running a loop over the desired number of iterations where on each
loop a topic is sampled for each word instance in the corpus. Following the
Gibbs iterations, the counts can be used to compute the latent distributions θd
and φk .
The only required count variables include nd,k , the number of words assigned
to topic k in document d; and nk,w , the number of times word w is assigned
to topic k. However, for simplicity and efficiency, we also keep a running count
of nk , the total number of times any word is assigned to topic k. Finally, in
addition to the obvious variables such as a representation of the corpus (w), we
need an array z which will contain the current topic assignment for each of the
N words in the corpus.
Because the Gibbs sampling procedure involves sampling from distributions
conditioned on all other variables (in LDA this of course includes all other cur-
rent topic assignments, but not the current one), before building a distribution
from equation (11), we must remove the current assignment from the equation.
We can do this by decrementing the counts associated with the current assign-
ment because the topic assignments in LDA are exchangeable (i.e. the joint
probability distribution is invariant to permutation). We then calculate the
(unnormalized) probability of each topic assignment using equation (11). This
discrete distribution is then sampled from and the chosen topic is set in the z
array and the appropriate counts are then incremented. See Algorithm 1 for
the full LDA Gibbs sampling procedure.
4 For the full, nothing-left-out derivation, please see [5] and [11].
6
Input: words w ∈ documents d
Output: topic assignments z and counts nd,k , nk,w , and nk
begin
randomly initialize z and increment counters
foreach iteration do
for i = 0 → N − 1 do
word ← w[i]
topic ← z[i]
nd,topic -=1; nword,topic -=1; ntopic -=1
for k = 0 → K − 1 do
n +βw
p(z = k|·) = (nd,k + αk ) nkk,w
+β×W
end
topic ← sample from p(z|·)
z[i] ← topic
nd,topic +=1; nword,topic +=1; ntopic +=1
end
end
return z, nd,k , nk,w , nk
end
Algorithm 1: LDA Gibbs Sampling
4 Extensions To LDA
While LDA – the “simplest” topic model – is useful in and of itself, a great
deal of novel research surrounds extending the basic LDA model to fit a specific
task or to improve the model by describing a more complex generative process
that results in a better model of the real world. There are countless papers
delineating such extensions and it is not my intention to go through them all
here. Instead, this section will outline some of the ways that LDA can and has
been extended with the goal of explaining how inference changes as a result of
additions to a model and how to implement those changes in a Gibbs sampler.
7
ine all stop-words being generated by a “background” distribution [6, 7, 10].
The background distribution is the same as a topic – it is a discrete probability
distribution over the corpus vocabulary – but every document draws from the
background as well as the topics specific to that document. [7] and [10] use this
approach to separate high-content words from less-important words to perform
multi-document summarization. [6] uses a similar model for information re-
trieval where a word can either be generated from a background distribution, a
document-specific distribution, or one of T topic distributions shared amongst
all the documents. The generative process is similar to that of LDA, except
that there is a multinomial variable x associated with each word that is over the
three different “sources” of words. When x = 0, the background distribution
generates the word, when x = 1, the document-specific distribution generates
the word, and when x = 2, one of the topic distributions generates the word.
Here, we will describe a simpler model where only a background distribution
is added to LDA. A binomial variable x is associated with each word that
decides whether the word will be generated by the topic distributions or by the
background. The generative process is then:
1. ζ ∼ Dirichlet(δ)
2. For k = 1...K:
(a) φ(k) ∼ Dirichlet(β)
3. For each document d ∈ D:
(a) θd ∼ Dirichlet(α)
(b) λd ∼ Dirichlet(γ)
(c) For each word wi ∈ d:
i. xi ∼ Discrete(λd )
ii. If x = 0:
A. wi ∼ Discrete(ζ)
iii. Else:
A. zi ∼ Discrete(θd )
B. wi ∼ Disctete(φ(zi ) )
8
word, it must also compute the probability that the model is in the topic-model
state. This too, however, is straightforward to implement. A distribution of
T + 1 components can be created for each word (on each iteration) where the
first component corresponds to the background distribution generating the word
and the other T are the probabilities for each topic having generated the word.
5 Conclusion
LDA and other topic models are an exciting development in machine learn-
ing and the surface has only been scratched on their potential in a number of
diverse fields. This report has sought to aid researchers new to the field in
both understanding the mathematical underpinnings of topic modeling and in
implementing algorithms to make use of this new pattern recognition paradigm.
References
[1] Christopher M. Bishop. Pattern Recognition and Machine Learning (Infor-
mation Science and Statistics). Springer-Verlag New York, Inc., Secaucus,
NJ, USA, 2006.
[2] David M. Blei and Sean Gerrish. Predicting legislative roll calls from text.
In International Conference on Machine Learning, 2011.
[3] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet
allocation. J. Mach. Learn. Res., 3:993–1022, 2003.
[4] Jordan Boyd-Graber, David Blei, and Xiaojin Zhu. A topic model for
word sense disambiguation. In Empirical Methods in Natural Language
Processing, 2007.
[5] Bob Carpenter. Integrating out multinomial parameters in latent dirichlet
allocation and naive bayes for collapsed gibbs sampling. Technical report,
Lingpipe, Inc., 2010.
9
[9] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov Chain Monte
Carlo In Practice. Chapman and Hall/CRC, 1999.
[10] Aria Haghighi and Lucy Vanderwende. Exploring content models for multi-
document summarization. In NAACL ’09: Proceedings of Human Language
Technologies: The 2009 Annual Conference of the North American Chapter
of the Association for Computational Linguistics, pages 362–370, Morris-
town, NJ, USA, 2009. Association for Computational Linguistics.
[11] Gregor Heinrich. Parameter estimation for text analysis. Technical report,
University of Leipzig, Germany, 2008.
10