Topic Models in Natural Language Processing
Topic Models in Natural Language Processing
Topic Models in Natural Language Processing
Hongning Wang
CS@UVa
Outline
1. General idea of topic models
2. Basic topic models
- Probabilistic Latent Semantic Analysis (pLSA)
- Latent Dirichlet Allocation (LDA)
3. Variants of topic models
4. Summary
CS@UVa 2
What is a Topic?
Representation: a probabilistic
distribution over words.
retrieval 0.2
information 0.15
model 0.08
query 0.07
language 0.06
feedback 0.03
……
CS@UVa 5
Simplest Case: 1 topic + 1 “background”
zi
Suppose the parameters are all known,
the 1 what’s a reasonable guess of zi?
paper 1 - depends on
presents 1 - depends on p(w|B) and p(w|)
a 1
p(z i 1) p(w | z 1)
text 0 p(zi 1| wi ) i
p(zi 1) p(w | zi 1) p(zi 0) p(w | zi 0)
mining 0
algorithm 0 p(w |B)
E-step
the 1 p(w| )B (1)pcurrent (w |)
paper 0 c(wi , d )(1 p(zi 1| wi ))
... pnew (wi |) M-step
...
w'V c(w',d )(1 p(z i 1| w'))
B and are competing for explaining words in document d!
Initially, set p(w| ) to some random values, then iterate …
CS@UVa 9
An example of EM computation
p (n) (z 1| w ) p(wi |B )
i i
p(wi | B) (1 ) p (n) (wi | ) Expectation-Step:
Augmenting data by guessing hidden variables
c(wi , d)(1 p (n) (z i 1| w i ))
p (n1)
(wi | )
c(w j, d)(1 p (n) (z j 1| wj ))
w j vocabulary
Maximization-Step
With the “augmented data”, estimate parameters
using maximum likelihood
Assume =0.5
Word # P(w|B) Iteration 1 Iteration 2 Iteration 3
P(w|) P(z=1) P(w|) P(z=1) P(w|) P(z=1)
The 4 0.5 0.25 0.67 0.20 0.71 0.18 0.74
Paper 2 0.3 0.25 0.55 0.14 0.68 0.10 0.75
Text 4 0.1 0.25 0.29 0.44 0.19 0.50 0.17
Mining 2 0.1 0.25 0.29 0.22 0.31 0.22 0.31
Log-Likelihood -16.96 -16.13 -16.02
CS@UVa 10
Outline
1. General idea of topic models
2. Basic topic models
- Probabilistic Latent Semantic Analysis (pLSA)
- Latent Dirichlet Allocation (LDA)
3. Variants of topic models
4. Summary
CS@UVa 11
Discover multiple topics in a collection
warning 0.3
? Topic coverage
Topic 1 system 0.2..
? in document d
Topic 2
aid ?0.1 1 d,1 “Generating” word w
donation 0.05
? in doc d in the collection
support 0.02
? .. 2
… d,2 1 - B
statistics 0.2
? d, k
loss ?0.1 k W
Topic k dead 0.05
? ..
B Parameters:
is 0.05
Background B
?
the 0.04
?
B Global: {𝜃𝑘}𝐾𝑘 =1
a 0.03
? .. Local: {𝜋𝑑,𝑘}𝐾𝑑 ,𝑘 =
Manual: 𝜆𝐵 1
CS@UVa 12
Probabilistic Latent Semantic Analysis
[Hofmann 99a, 99b]
CS@UVa 13
EM for estimating multiple topics
the 0.2 E-Step:
Known a 0.1
Background we 0.01 Predict topic labels
p(w | B) to 0.02 using Bayes Rule Observed Words
…
Unknown …
topic model text =?
mining =? M-Step:
p(w|1)=? association =? ML Estimator
word =? based on
“Text mining” … “fractional
… counts”
Unknown …
topic model information =?
p(w|2)=? retrieval =?
query =?
“information document =?
retrieval” …
CS@UVa 14
Parameter estimation
E-Step:
Word w in doc d is generated Posterior: application of Bayes rule
- from topic j
- from background
d(n), j p (n) (w | j )
p(zd,w j)
(w | j')
k (n) (n)
j'1 d , j' p
B p(w | B )
p(z d,w B)
B p(w | B ) (1 B) j1 d , j p (w | j )
k (n) (n)
M-Step: d,(n1)
wV
c(w, d)(1 p(zd ,w B)) p(zd ,w j)
Re-estimate
j
j' wV
c(w, d)(1 p(zd ,w B)) p(zd ,w j')
- mixing weights
p(n1) (w | j )
c(w, d)(1 p(z B)) p(z j)
dC d ,w d ,w
- word-topic distribution c(w', d )(1 p(z B)) p(z j)
w'V dC d ,w' d ,w'
CS@UVa 17
pLSA with prior knowledge
• What if we have some domain knowledge in
mind
– We want to see topics such as “battery” and
“memory” for opinions about a laptop
– We want words like “apple” and “orange” co-
occur in a topic
– One topic should be fixed to model background
words (infinitely strong prior!)
• We can easily incorporate such knowledge as
priors of pLSA model
CS@UVa 18
Maximum a Posteriori (MAP) estimation
* arg max p() p(Data | )
Prior can be placed on as
Topic coverage well (more about this later)
warning 0.3 in document d
Topic 1 system 0.2..
1 d,1 “Generating” word w
aid 0.1 in doc d in the collection
Topic 2 donation 0.05 2
support 0.02 .. d,2 1 - B
…
d, k
statistics 0.2 k W
loss 0.1
Topic k dead 0.05 .. B
B
is 0.05
Background B the 0.04 Parameters:
a 0.03 .. B=noise-level (manually set)
’s and ’s are estimated with Maximum A Posteriori (MAP)
CS@UVa 19
MAP estimation
CS@UVa 20
Some background knowledge
• Conjugate prior Gaussian -> Gaussian
– Posterior dist in the same Beta -> Binomial
Dirichlet -> Multinomial
family as prior
• Dirichlet distribution
– Continuous
– Samples from it will be the
parameters in a
multinomial distribution
CS@UVa 21
Prior as pseudo counts
Observed Doc(s)
the 0.2
Known a 0.1
Background we 0.01
p(w | B) to 0.02
…
MAP
Unknown … Suppose, Estimator
topic model text =? we know
mining =?
p(w|1)=? association =? the identity
word =? of each
“Text mining” …
word ...
…
Pseudo Doc
Unknown …
topic model information =?
p(w|2)=? retrieval =?
query =? Size = μ
document =? text
“information
retrieval” …
CS@UVa
mining 22
Deficiency of pLSA
• Not a fully generative model
– Can’t compute probability of a new document
• Topic coverage p(π|d) is per-document estimated
– Heuristic workaround is possible
• Many parameters high complexity of
models
– Many local maxima
– Prone to overfitting
CS@UVa 23
Latent Dirichlet Allocation [Blei et al. 02]
• Make pLSA a fully generative model by
imposing Dirichlet priors
– Dirichlet priors over p(π|d)
– Dirichlet priors over p(w|θ)
– A Bayesian version of pLSA
• Provide mechanism to deal with new
documents
– Flexible to model many other observations in a
document
CS@UVa 24
LDA = Imposing Prior on PLSA
pLSA: Topic coverage {d,j } are free for tuning
Topic coverage d,j is specific to each in document d
“training document”, thus can’t be
used to generate a new document “Generating” word w
1 d,1 in doc d in the collection
2 W
d,2
LDA: d, k
Topic coverage distribution {d,j } for k
any document is sampled from a
Dirichlet distribution, allowing for {d,j } are regularized
generating a new doc
p( d ) Dirichlet( ) Magnitudes of and
determine the variances of the prior,
In addition, the topic word distributions thus also the concentration of prior
{j } are also drawn from another (larger and stronger prior)
Dirichlet prior p( )
i Dirichlet( )
CS@UVa 25
pLSA v.s. LDA
pLSA k
LDA k
pd (w |{j },{d , j }) d , j p(w |j )
j1
k
|)d
d, j
log p(d | ,{})j c(w, d ) log[ p(w | j )] p(
d d
wV j1
k
log p(C |,) log p(d | ,{}) p( |
j j )d...d
1 k
dC j1 Regularization
added by LDA
CS@UVa 26
LDA as a graphical model [Blei et al. 03a]
distribution over topics
Dirichlet priors for each document
(same as d on the previous slides)
wi Discrete( (zi) )
wi
Nd D
Most approximate inference algorithms aim to infer p(z i | w , , )
from which other interesting variables can be easily computed
CS@UVa 27
Approximate inferences for LDA
• Deterministic approximation
– Variational inference
– Expectation propagation
• Markov chain Monte Carlo
– Full Gibbs sampler
– Collapsed Gibbs sampler
Most efficient, and quite popular, but can only work with conjugate prior
CS@UVa 28
Collapsed Gibbs sampling [Griffiths & Steyvers 04]
• Using conjugacy between Dirichlet and multinomial
distributions, integrate out continuous random
variables
D
(d )
(n j ) (T )
P(z) P(z | ) p()d
j
CS@UVa 29
Collapsed Gibbs sampling [Griffiths & Steyvers 04]
All the other words
• Sample each zi conditioned on z -i beside zi
CS@UVa 31
Gibbs sampling in LDA
iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 ?
2 KNOWLEDGE 1 2
3 RESEARCH 1 1
4 WORK 1 2
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . .
. . . .
. . . .
50 JOY 5 2
CS@UVa 32
Gibbs sampling in LDA iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 ?
2 KNOWLEDGE 1 2
3 RESEARCH 1 1
4 WORK 1 2
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1 words in di assigned with topic j
12 KNOWLEDGE 2 1
. . . .
. . . .
.
Count of instances where w i is
. . .
50 JOY 5 2 assigned with topic j
CS@UVa 34
Gibbs sampling in LDA
iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 2
2 KNOWLEDGE 1 2 ?
3 RESEARCH 1 1
4 WORK 1 2
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . .
. . . .
. . . .
50 JOY 5 2
CS@UVa 35
Gibbs sampling in LDA
iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 2
2 KNOWLEDGE 1 2 1
3 RESEARCH 1 1 ?
4 WORK 1 2
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . .
. . . .
. . . .
50 JOY 5 2
CS@UVa 36
Gibbs sampling in LDA
iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 2
2 KNOWLEDGE 1 2 1
3 RESEARCH 1 1 1
4 WORK 1 2 ?
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . .
. . . .
. . . .
50 JOY 5 2
CS@UVa 37
Gibbs sampling in LDA
iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 2
2 KNOWLEDGE 1 2 1
3 RESEARCH 1 1 1
4 WORK 1 2 2
5 MATHEMATICS 1 1 ?
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . .
. . . .
. . . .
50 JOY 5 2
CS@UVa 38
Gibbs sampling in LDA
iteration
1 2 … 1000
i wi di zi zi zi
1 MATHEMATICS 1 2 2 2
2 KNOWLEDGE 1 2 1 2
3 RESEARCH 1 1 1 2
4 WORK 1 2 2 1
5 MATHEMATICS 1 1 2 2
6 RESEARCH 1 2 2 2
7 WORK 1 2 2 2
8 SCIENTIFIC 1 1 1 … 1
9 MATHEMATICS 1 2 2 2
10 WORK 1 1 2 2
11 SCIENTIFIC 2 1 1 2
12 KNOWLEDGE 2 1 2 2
. . . . . .
. . . . . .
. . . . . .
50 JOY 5 2 1 1
CS@UVa 39
Topics learned by LDA
CS@UVa 40
Topic assignments in document
• Based on the topics shown in last slide
CS@UVa 41
Application of learned topics
• Document classification
– A new type of feature representation
CS@UVa 42
Application of learned topics
• Collaborative filtering
– A new type of user profile
CS@UVa 43
Outline
1. General idea of topic models
2. Basic topic models
- Probabilistic Latent Semantic Analysis (pLSA)
- Latent Dirichlet Allocation (LDA)
3. Variants of topic models
4. Summary
CS@UVa 44
Supervised Topic Model [Blei & McAuliffe, NIPS’02]
CS@UVa 45
Sentiment polarity of topics
CS@UVa 46
Author Topic Model [Rosen-Zvi UAI’04]
• Authorship determines the topic mixture
CS@UVa 47
Learned association between words
and authors
CS@UVa 48
Collaborative Topic Model [Wang & Blei,KDD’11]
CS@UVa 50
Correspondence Topic Model [Blei SIGIR’03]
LDA part
CS@UVa 51
Annotation results
CS@UVa 52
Annotation results
CS@UVa 53
Dynamic Topic Model [Blei ICML’06]
• Capture the evolving topics over time
Markov
assumption
about the
topic
dynamics
CS@UVa 54
Evolution of topics
CS@UVa 55
Polylingual Topic Models [Mimmo et al., EMNLP’09]
CS@UVa 56
Topics learned in different languages
CS@UVa 57
Correlated Topic Model [Blei & Lafferty, Annals
of Applied Stat’07]
CS@UVa 59
Hierarchical Topic Models [Blei et al. NIPS’04]
CS@UVa 60
Hierarchical structure of topics
CS@UVa 61
Outline
1. General idea of topic models
2. Basic topic models
- Probabilistic Latent Semantic Analysis (pLSA)
- Latent Dirichlet Allocation (LDA)
3. Variants of topic models
4. Summary
CS@UVa 62
Summary
• Probabilistic Topic Models are a new family of document
modeling approaches, especially useful for
– Discovering latent topics in text
– Analyzing latent structures and patterns of topics
– Extensible for joint modeling and analysis of text and associated non-
textual data
• pLSA & LDA are two basic topic models that tend to function
similarly, with LDA better as a generative model
• Many different models have been proposed with probably
many more to come
• Many demonstrated applications in multiple domains and
many more to come
CS@UVa 63
Summary
• However, all topic models suffer from the problem of multiple local
maxima
– Make it hard/impossible to reproduce research results
– Make it hard/impossible to interpret results in real applications
• Complex models can’t scale up to handle large amounts of text data
– Collapsed Gibbs sampling is efficient, but only working for conjugate priors
– Variational EM needs to be derived in a model-specific way
– Parallel algorithms are promising
• Many challenges remain….
CS@UVa 64