Topic Models in Natural Language Processing

You are on page 1of 64

Probabilistic Topic Models

Hongning Wang
CS@UVa
Outline
1. General idea of topic models
2. Basic topic models
- Probabilistic Latent Semantic Analysis (pLSA)
- Latent Dirichlet Allocation (LDA)
3. Variants of topic models
4. Summary

CS@UVa 2
What is a Topic?
Representation: a probabilistic
distribution over words.
retrieval 0.2
information 0.15
model 0.08
query 0.07
language 0.06
feedback 0.03
……

Topic: A broad concept/theme,


semantically coherent, which is
hidden in documents
e.g., politics; sports; technology;
entertainment; education etc.
CS@UVa 3
Document as a mixture of topics
government 0.3
Topic  1 [ Criticism of government response to the hurricane
response 0.2 primarily consisted of criticism of its response to the
... approach of the storm and its aftermath, specifically in the
delayed response ] to the [ flooding of New Orleans. …
city 0.2
80% of the 1.3 million residents of the greater New Orleans
Topic  2 new 0.1 metropolitan area evacuated ] …[ Over seventy countries
orleans 0.05 pledged monetary donations or other assistance]. …
...

• How can we discover these topic-word
donate 0.1 distributions?
Topic k relief 0.05
help 0.02
• Many applications would be enabled
... by discovering such topics
– Summarize themes/aspects
is 0.05 – Facilitate navigation/browsing
Background k the 0.04 – Retrieve documents
a 0.03
...
– Segment documents
– Many other text mining tasks
CS@UVa 4
General idea of probabilistic topic models

• Topic: a multinomial distribution over words


• Document: a mixture of topics
– A document is “generated” by first sampling topics from
some prior distribution
– Each time, sample a word from a corresponding topic
– Many variations of how these topics are mixed
• Topic modeling
– Fitting the probabilistic model to text
– Answer topic-related questions by computing various kinds
of posterior distributions
• e.g., p(topic|time), p(sentiment|topic)

CS@UVa 5
Simplest Case: 1 topic + 1 “background”

B Assume words in d are from


d
General Background two topics: Text mining
English Text 1 topic + 1 background paper

the 0.03 B d the 0.031


a 0.02 a 0.018
is 0.015 …
we 0.01 text 0.04
How can we “get rid of” the mining 0.035
...
common words from the topic association 0.03
food 0.003 clustering 0.005
computer 0.00001 to make it more discriminative? computer 0.0009
… …
text 0.000006 food 0.000001
… …

Background Topic: p(w|B) Document Topic: p(w|d)


CS@UVa 6
The Simplest Case:
One Topic + One Background Model
Assume p(w|B) and  are known
 = mixing proportion of background topic in d
Background words
 P(w|B) w
Topic choice
P(Topic)
Topic words Document d
1-
P(w| ) w

p(w)  p(w | B )  (1  ) p(w |  )

log p(d |  )  c(w, d ) log[p(w | B )  (1  ) p(w |  )]


wV

Expectation Maximization ˆ arg max log p(d | )



CS@UVa 7
How to Estimate ?
the 0.2
a 0.1 Observed
Known we 0.01 =0.7 words
to 0.02
Background …
p(w|B) text 0.0001
mining 0.00005

ML
Estimator
Unknown …
text =? =0.3
topic p(w|) mining =?
for “Text association =?
word =?
mining” …
Suppose we know
the identity/label of each word ...
But we don’t!
CS@UVa 8
We guess the topic assignments
Assignment (“hidden”) variable: zi {1 (background), 0(topic)}

zi
Suppose the parameters are all known,
the 1 what’s a reasonable guess of zi?
paper 1 - depends on 
presents 1 - depends on p(w|B) and p(w|)
a 1
p(z i  1) p(w | z 1)
text 0 p(zi  1| wi ) i
p(zi  1) p(w | zi  1)  p(zi  0) p(w | zi  0)
mining 0
algorithm 0 p(w |B)
 E-step
the 1 p(w| )B (1)pcurrent (w |)
paper 0 c(wi , d )(1 p(zi  1| wi ))
... pnew (wi |)  M-step
...
w'V c(w',d )(1 p(z i  1| w'))
B and  are competing for explaining words in document d!
Initially, set p(w| ) to some random values, then iterate …
CS@UVa 9
An example of EM computation
p (n) (z  1| w ) p(wi |B )
i i
p(wi | B)  (1 ) p (n) (wi | ) Expectation-Step:
Augmenting data by guessing hidden variables
c(wi , d)(1 p (n) (z i  1| w i ))
p (n1)
(wi |  ) 
 c(w j, d)(1 p (n) (z j  1| wj ))
w j vocabulary
Maximization-Step
With the “augmented data”, estimate parameters
using maximum likelihood
Assume =0.5
Word # P(w|B) Iteration 1 Iteration 2 Iteration 3
P(w|) P(z=1) P(w|) P(z=1) P(w|) P(z=1)
The 4 0.5 0.25 0.67 0.20 0.71 0.18 0.74
Paper 2 0.3 0.25 0.55 0.14 0.68 0.10 0.75
Text 4 0.1 0.25 0.29 0.44 0.19 0.50 0.17
Mining 2 0.1 0.25 0.29 0.22 0.31 0.22 0.31
Log-Likelihood -16.96 -16.13 -16.02

CS@UVa 10
Outline
1. General idea of topic models
2. Basic topic models
- Probabilistic Latent Semantic Analysis (pLSA)
- Latent Dirichlet Allocation (LDA)
3. Variants of topic models
4. Summary

CS@UVa 11
Discover multiple topics in a collection

• Generalize the two topic mixture to k topics

warning 0.3
? Topic coverage
Topic 1 system 0.2..
? in document d

Topic  2
aid ?0.1 1 d,1 “Generating” word w
donation 0.05
? in doc d in the collection
support 0.02
? .. 2
… d,2 1 - B
statistics 0.2
? d, k
loss ?0.1 k W
Topic  k dead 0.05
? ..
B Parameters:
is 0.05
Background B
?
the 0.04
?
B Global: {𝜃𝑘}𝐾𝑘 =1
a 0.03
? .. Local: {𝜋𝑑,𝑘}𝐾𝑑 ,𝑘 =
Manual: 𝜆𝐵 1
CS@UVa 12
Probabilistic Latent Semantic Analysis
[Hofmann 99a, 99b]

• Topic: a multinomial distribution over words


• Document
– Mixture of k topics
– Mixing weights reflect the topic coverage
• Topic modeling
– Word distribution under topic: p(w|θ)
– Topic coverage: p(π|d)

CS@UVa 13
EM for estimating multiple topics
the 0.2 E-Step:
Known a 0.1
Background we 0.01 Predict topic labels
p(w | B) to 0.02 using Bayes Rule Observed Words

Unknown …
topic model text =?
mining =? M-Step:
p(w|1)=? association =? ML Estimator
word =? based on
“Text mining” … “fractional
… counts”

Unknown …
topic model information =?
p(w|2)=? retrieval =?
query =?
“information document =?
retrieval” …

CS@UVa 14
Parameter estimation
E-Step:
Word w in doc d is generated Posterior: application of Bayes rule
- from topic j
- from background
 d(n), j p (n) (w |  j )
p(zd,w  j) 
  (w | j')
k (n) (n)
j'1 d , j' p

B p(w | B )
p(z d,w  B) 
B p(w | B )  (1  B)  j1 d , j p (w |  j )
k (n) (n)

M-Step:  d,(n1) 
 wV
c(w, d)(1 p(zd ,w  B)) p(zd ,w  j)
Re-estimate
j

j' wV
c(w, d)(1 p(zd ,w  B)) p(zd ,w  j')
- mixing weights
p(n1) (w |  j ) 
 c(w, d)(1 p(z  B)) p(z  j)
dC d ,w d ,w
- word-topic distribution   c(w', d )(1 p(z  B)) p(z  j)
w'V dC d ,w' d ,w'

Sum over all docs Fractional counts contributing to


in the collection - using topic j in generating d
- generating w from topic j
CS@UVa 15
How the algorithm works
c(w,d)(1 - p(zd,w = B))p(zd,w=j)
c(w,d)p(zd,w = B) π d1,1 πd1,2 Topic coverage
c(w, d) ( P(θ1|d1) ) ( P(θ2|d1) )
aid 7
d1 price 5
Initial value
oil 6
πd2,1 πd2,2
aid 8 ( P(θ1|d2) ) ( P(θ2|d2) )
d2 price 7
oil 5 Initial value

P(w| θ) Topic 1 Topic 2 3, π4,


Initializing
Iteration 1:
2: Ed,5,
M and
… P(w|
Step: θj) with πd, j
jStep: split
re-estimate
word counts
aid and
UntilP(w|
random
with θj) bytopics
different
values
converging adding (byand
computing z’ s)
normalizing
the splitted word counts
price Initial value
oil
CS@UVa 16
Sample pLSA topics from TDT Corpus [Hofmann 99b]

CS@UVa 17
pLSA with prior knowledge
• What if we have some domain knowledge in
mind
– We want to see topics such as “battery” and
“memory” for opinions about a laptop
– We want words like “apple” and “orange” co-
occur in a topic
– One topic should be fixed to model background
words (infinitely strong prior!)
• We can easily incorporate such knowledge as
priors of pLSA model
CS@UVa 18
Maximum a Posteriori (MAP) estimation
*  arg max p() p(Data | )

Prior can be placed on  as
Topic coverage well (more about this later)
warning 0.3 in document d
Topic 1 system 0.2..
1 d,1 “Generating” word w
aid 0.1 in doc d in the collection
Topic  2 donation 0.05 2
support 0.02 .. d,2 1 - B

d, k
statistics 0.2 k W
loss 0.1
Topic  k dead 0.05 .. B
B
is 0.05
Background B the 0.04 Parameters:
a 0.03 .. B=noise-level (manually set)
’s and ’s are estimated with Maximum A Posteriori (MAP)
CS@UVa 19
MAP estimation

• Choosing conjugate priors Pseudo counts of w from prior ’


– Dirichlet prior for multinomial distribution
p (w | ) 
(n1)  c(w, d)(1 p(z  B)) p(z  j)+p(w|’j)
dC d ,w d ,w
j
  c(w', d)(1 p(z  B)) p(z  j)+
w'V dC d ,w' d,w'

Sum of all pseudo counts


– What if =0? What if =+?
– A consequence of using conjugate prior is that the
prior can be converted into “pseudo data” which
can then be “merged” with the actual data for
parameter estimation

CS@UVa 20
Some background knowledge
• Conjugate prior Gaussian -> Gaussian
– Posterior dist in the same Beta -> Binomial
Dirichlet -> Multinomial
family as prior
• Dirichlet distribution
– Continuous
– Samples from it will be the
parameters in a
multinomial distribution

CS@UVa 21
Prior as pseudo counts
Observed Doc(s)
the 0.2
Known a 0.1
Background we 0.01
p(w | B) to 0.02

MAP
Unknown … Suppose, Estimator
topic model text =? we know
mining =?
p(w|1)=? association =? the identity
word =? of each
“Text mining” …
word ...

Pseudo Doc
Unknown …
topic model information =?
p(w|2)=? retrieval =?
query =? Size = μ
document =? text
“information
retrieval” …

CS@UVa
mining 22
Deficiency of pLSA
• Not a fully generative model
– Can’t compute probability of a new document
• Topic coverage p(π|d) is per-document estimated
– Heuristic workaround is possible
• Many parameters  high complexity of
models
– Many local maxima
– Prone to overfitting

CS@UVa 23
Latent Dirichlet Allocation [Blei et al. 02]
• Make pLSA a fully generative model by
imposing Dirichlet priors
– Dirichlet priors over p(π|d)
– Dirichlet priors over p(w|θ)
– A Bayesian version of pLSA
• Provide mechanism to deal with new
documents
– Flexible to model many other observations in a
document

CS@UVa 24
LDA = Imposing Prior on PLSA
pLSA: Topic coverage {d,j } are free for tuning
Topic coverage d,j is specific to each in document d
“training document”, thus can’t be
used to generate a new document “Generating” word w
1 d,1 in doc d in the collection
2 W
d,2
LDA: d, k
Topic coverage distribution {d,j } for k
any document is sampled from a
Dirichlet distribution, allowing for {d,j } are regularized
generating a new doc
 
p( d )  Dirichlet( ) Magnitudes of  and 
determine the variances of the prior,
In addition, the topic word distributions thus also the concentration of prior
{j } are also drawn from another (larger  and   stronger prior)
 
Dirichlet prior p( )
i  Dirichlet( )
CS@UVa 25
pLSA v.s. LDA
pLSA k

pd (w |{j },{d , j })  d , j p(w | )j


Core assumption
j1 in all topic models
k
log p(d |{j},{d, j })   c(w, d ) log[ d,j p(w |j )]
wV j1

log p(C | {j},{d, j })   log p(d | {j},{d, j })


dC pLSA component

LDA k
pd (w |{j },{d , j })  d , j p(w |j )
j1
k
  |)d
  d, j
  
log p(d | ,{})j  c(w, d ) log[ p(w | j )] p(
d d
wV j1

   k 
log p(C |,)    log p(d | ,{})  p( | 
j j )d...d
1  k
dC j1 Regularization
added by LDA
CS@UVa 26
LDA as a graphical model [Blei et al. 03a]

distribution over topics
Dirichlet priors for each document
(same as d on the previous slides)

 (d)  (d)  Dirichlet()



topic assignment
distribution over words for
for each word
each topic

(same as  on zi  Discrete( (d) )


j the previous slides)  (j) zi
T word generated from
 (j)  Dirichlet()
assigned topic

wi  Discrete( (zi) )
wi
Nd D

  
Most approximate inference algorithms aim to infer p(z i | w , ,  )
from which other interesting variables can be easily computed
CS@UVa 27
Approximate inferences for LDA
• Deterministic approximation
– Variational inference
– Expectation propagation
• Markov chain Monte Carlo
– Full Gibbs sampler
– Collapsed Gibbs sampler

Most efficient, and quite popular, but can only work with conjugate prior

CS@UVa 28
Collapsed Gibbs sampling [Griffiths & Steyvers 04]
• Using conjugacy between Dirichlet and multinomial
distributions, integrate out continuous random
variables
D
(d )
 (n j   ) (T )
P(z)   P(z | ) p()d  
j

d 1 ( )T ( n (dj )   )


  )
j
T (n (w)
(W)
P(w | z)   P(w | z, ) p()d   w j

j1 ()W ( n (w)


j  )
w
• Define a distribution on topic assignment z
P(w | z)P(z)
With fixed P(z | w)
assignment of z  P(w | z)P(z)
z

CS@UVa 29
Collapsed Gibbs sampling [Griffiths & Steyvers 04]
All the other words
• Sample each zi conditioned on z -i beside zi

nw( izi )   n (dj i ) 


P(z i | w, z i ) n( zi )  W n (d i )  T
 
Word-topic distribution Topic proportion
– Implementation: counts can be cached in two sparse
matrices; no special functions, simple arithmetic
– Distributions on  and  can be analytic computed
given z and w
CS@UVa
30
Gibbs sampling in LDA
iteration
1
i wi di zi
1 MATHEMATICS 1 2
2 KNOWLEDGE 1 2
3 RESEARCH 1 1
4 WORK 1 2
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . .
. . . .
. . . .
50 JOY 5 2

CS@UVa 31
Gibbs sampling in LDA
iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 ?
2 KNOWLEDGE 1 2
3 RESEARCH 1 1
4 WORK 1 2
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . .
. . . .
. . . .
50 JOY 5 2

CS@UVa 32
Gibbs sampling in LDA iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 ?
2 KNOWLEDGE 1 2
3 RESEARCH 1 1
4 WORK 1 2
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1 words in di assigned with topic j
12 KNOWLEDGE 2 1
. . . .
. . . .
.
Count of instances where w i is
. . .
50 JOY 5 2 assigned with topic j

Count of all words


assigned with topic j
CS@UVa 33
words in di assigned with any topic
Gibbs sampling in LDA
iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 ?
2 KNOWLEDGE 1 2
3 RESEARCH 1 1
4 WORK 1 2 What’s the most likely topic for wi in di?
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2 How likely would di choose topic j?
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . . How likely would topic j
. . . .
. . . .
generate word wi ?
50 JOY 5 2

CS@UVa 34
Gibbs sampling in LDA
iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 2
2 KNOWLEDGE 1 2 ?
3 RESEARCH 1 1
4 WORK 1 2
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . .
. . . .
. . . .
50 JOY 5 2

CS@UVa 35
Gibbs sampling in LDA
iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 2
2 KNOWLEDGE 1 2 1
3 RESEARCH 1 1 ?
4 WORK 1 2
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . .
. . . .
. . . .
50 JOY 5 2

CS@UVa 36
Gibbs sampling in LDA
iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 2
2 KNOWLEDGE 1 2 1
3 RESEARCH 1 1 1
4 WORK 1 2 ?
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . .
. . . .
. . . .
50 JOY 5 2

CS@UVa 37
Gibbs sampling in LDA
iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 2
2 KNOWLEDGE 1 2 1
3 RESEARCH 1 1 1
4 WORK 1 2 2
5 MATHEMATICS 1 1 ?
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . .
. . . .
. . . .
50 JOY 5 2

CS@UVa 38
Gibbs sampling in LDA
iteration
1 2 … 1000
i wi di zi zi zi
1 MATHEMATICS 1 2 2 2
2 KNOWLEDGE 1 2 1 2
3 RESEARCH 1 1 1 2
4 WORK 1 2 2 1
5 MATHEMATICS 1 1 2 2
6 RESEARCH 1 2 2 2
7 WORK 1 2 2 2
8 SCIENTIFIC 1 1 1 … 1
9 MATHEMATICS 1 2 2 2
10 WORK 1 1 2 2
11 SCIENTIFIC 2 1 1 2
12 KNOWLEDGE 2 1 2 2
. . . . . .
. . . . . .
. . . . . .
50 JOY 5 2 1 1

CS@UVa 39
Topics learned by LDA

CS@UVa 40
Topic assignments in document
• Based on the topics shown in last slide

CS@UVa 41
Application of learned topics
• Document classification
– A new type of feature representation

CS@UVa 42
Application of learned topics
• Collaborative filtering
– A new type of user profile

CS@UVa 43
Outline
1. General idea of topic models
2. Basic topic models
- Probabilistic Latent Semantic Analysis (pLSA)
- Latent Dirichlet Allocation (LDA)
3. Variants of topic models
4. Summary

CS@UVa 44
Supervised Topic Model [Blei & McAuliffe, NIPS’02]

• A generative model for classification


– Topic generates both words and labels

CS@UVa 45
Sentiment polarity of topics

Sentiment polarity learned


from classification model

CS@UVa 46
Author Topic Model [Rosen-Zvi UAI’04]
• Authorship determines the topic mixture

Each author chooses


his/her topic to
contribute in the
document

CS@UVa 47
Learned association between words
and authors

CS@UVa 48
Collaborative Topic Model [Wang & Blei,KDD’11]

• Collaborative filtering in topic space


– User’s preference over topics determines his/her
rating for the item

Topics for the item, shifted


with random noise

User profile over topical space


CS@UVa 49
Topic-based recommendation

CS@UVa 50
Correspondence Topic Model [Blei SIGIR’03]

• Simultaneously modeling the generation of


multiple types of observations
– E.g., image and corresponding text annotations
Correspondence part (can be
described with different distributions)

LDA part

CS@UVa 51
Annotation results

CS@UVa 52
Annotation results

CS@UVa 53
Dynamic Topic Model [Blei ICML’06]
• Capture the evolving topics over time

Markov
assumption
about the
topic
dynamics

CS@UVa 54
Evolution of topics

CS@UVa 55
Polylingual Topic Models [Mimmo et al., EMNLP’09]

• Assumption: topics are universal over


languages
– Correspondence between documents are known
• E.g., news report about the same event in different
languages

CS@UVa 56
Topics learned in different languages

CS@UVa 57
Correlated Topic Model [Blei & Lafferty, Annals
of Applied Stat’07]

• Non-conjugate priors to capture correlation


between topics

Gaussian as the prior for topic proportion


(increase the computational complexity)
CS@UVa 58
Learned structure of topics

CS@UVa 59
Hierarchical Topic Models [Blei et al. NIPS’04]

• Nested Chinese restaurant process as a prior


for topic assignment

CS@UVa 60
Hierarchical structure of topics

CS@UVa 61
Outline
1. General idea of topic models
2. Basic topic models
- Probabilistic Latent Semantic Analysis (pLSA)
- Latent Dirichlet Allocation (LDA)
3. Variants of topic models
4. Summary

CS@UVa 62
Summary
• Probabilistic Topic Models are a new family of document
modeling approaches, especially useful for
– Discovering latent topics in text
– Analyzing latent structures and patterns of topics
– Extensible for joint modeling and analysis of text and associated non-
textual data
• pLSA & LDA are two basic topic models that tend to function
similarly, with LDA better as a generative model
• Many different models have been proposed with probably
many more to come
• Many demonstrated applications in multiple domains and
many more to come

CS@UVa 63
Summary
• However, all topic models suffer from the problem of multiple local
maxima
– Make it hard/impossible to reproduce research results
– Make it hard/impossible to interpret results in real applications
• Complex models can’t scale up to handle large amounts of text data
– Collapsed Gibbs sampling is efficient, but only working for conjugate priors
– Variational EM needs to be derived in a model-specific way
– Parallel algorithms are promising
• Many challenges remain….

CS@UVa 64

You might also like