Topic Models in Natural Language Processing

Probabilistic Topic Models
Hongning Wang
CS@UVa
Outline
1. General idea of topic models
2. Basic topic models
- Probabilistic Latent Semantic Analysis (pLSA)
- Latent Dirichlet Allocation (LDA)
3. Variants of topic models
4. Summary
CS@UVa 2
What is a Topic?
Representation: a probabilistic
distribution over words.
retrieval 0.2
information 0.15
model 0.08
query 0.07
language 0.06
feedback 0.03
……
Topic: A broad concept/theme,

semantically coherent, which is
hidden in documents
e.g., politics; sports; technology;
entertainment; education etc.
CS@UVa 3
Document as a mixture of topics
government 0.3
Topic  1 [ Criticism of government response to the hurricane
response 0.2 primarily consisted of criticism of its response to the
... approach of the storm and its aftermath, specifically in the
delayed response ] to the [ flooding of New Orleans. …
city 0.2
80% of the 1.3 million residents of the greater New Orleans
Topic  2 new 0.1 metropolitan area evacuated ] …[ Over seventy countries
orleans 0.05 pledged monetary donations or other assistance]. …
...
…
• How can we discover these topic-word
donate 0.1 distributions?
Topic k relief 0.05
help 0.02
• Many applications would be enabled
... by discovering such topics
– Summarize themes/aspects
is 0.05 – Facilitate navigation/browsing
Background k the 0.04 – Retrieve documents
a 0.03
...
– Segment documents
– Many other text mining tasks
CS@UVa 4
General idea of probabilistic topic models
• Topic: a multinomial distribution over words

• Document: a mixture of topics
– A document is “generated” by first sampling topics from
some prior distribution
– Each time, sample a word from a corresponding topic
– Many variations of how these topics are mixed
• Topic modeling
– Fitting the probabilistic model to text
– Answer topic-related questions by computing various kinds
of posterior distributions
• e.g., p(topic|time), p(sentiment|topic)
CS@UVa 5
Simplest Case: 1 topic + 1 “background”
B Assume words in d are from

d
General Background two topics: Text mining
English Text 1 topic + 1 background paper
the 0.03 B d the 0.031

a 0.02 a 0.018
is 0.015 …
we 0.01 text 0.04
How can we “get rid of” the mining 0.035
...
common words from the topic association 0.03
food 0.003 clustering 0.005
computer 0.00001 to make it more discriminative? computer 0.0009
… …
text 0.000006 food 0.000001
… …
Background Topic: p(w|B) Document Topic: p(w|d)

CS@UVa 6
The Simplest Case:
One Topic + One Background Model
Assume p(w|B) and  are known
 = mixing proportion of background topic in d
Background words
 P(w|B) w
Topic choice
P(Topic)
Topic words Document d
1-
P(w| ) w
p(w)  p(w | B )  (1  ) p(w |  )
log p(d |  )  c(w, d ) log[p(w | B )  (1  ) p(w |  )]

wV
Expectation Maximization ˆ arg max log p(d | )


CS@UVa 7
How to Estimate ?
the 0.2
a 0.1 Observed
Known we 0.01 =0.7 words
to 0.02
Background …
p(w|B) text 0.0001
mining 0.00005
…
ML
Estimator
Unknown …
text =? =0.3
topic p(w|) mining =?
for “Text association =?
word =?
mining” …
Suppose we know
the identity/label of each word ...
But we don’t!
CS@UVa 8
We guess the topic assignments
Assignment (“hidden”) variable: zi {1 (background), 0(topic)}
zi
Suppose the parameters are all known,
the 1 what’s a reasonable guess of zi?
paper 1 - depends on 
presents 1 - depends on p(w|B) and p(w|)
a 1
p(z i  1) p(w | z 1)
text 0 p(zi  1| wi ) i
p(zi  1) p(w | zi  1)  p(zi  0) p(w | zi  0)
mining 0
algorithm 0 p(w |B)
 E-step
the 1 p(w| )B (1)pcurrent (w |)
paper 0 c(wi , d )(1 p(zi  1| wi ))
... pnew (wi |)  M-step
...
w'V c(w',d )(1 p(z i  1| w'))
B and  are competing for explaining words in document d!
Initially, set p(w| ) to some random values, then iterate …
CS@UVa 9
An example of EM computation
p (n) (z  1| w ) p(wi |B )
i i
p(wi | B)  (1 ) p (n) (wi | ) Expectation-Step:
Augmenting data by guessing hidden variables
c(wi , d)(1 p (n) (z i  1| w i ))
p (n1)
(wi |  ) 
 c(w j, d)(1 p (n) (z j  1| wj ))
w j vocabulary
Maximization-Step
With the “augmented data”, estimate parameters
using maximum likelihood
Assume =0.5
Word # P(w|B) Iteration 1 Iteration 2 Iteration 3
P(w|) P(z=1) P(w|) P(z=1) P(w|) P(z=1)
The 4 0.5 0.25 0.67 0.20 0.71 0.18 0.74
Paper 2 0.3 0.25 0.55 0.14 0.68 0.10 0.75
Text 4 0.1 0.25 0.29 0.44 0.19 0.50 0.17
Mining 2 0.1 0.25 0.29 0.22 0.31 0.22 0.31
Log-Likelihood -16.96 -16.13 -16.02
CS@UVa 10
Outline
4. Summary
CS@UVa 11
Discover multiple topics in a collection
• Generalize the two topic mixture to k topics
warning 0.3
? Topic coverage
Topic 1 system 0.2..
? in document d
Topic  2
aid ?0.1 1 d,1 “Generating” word w
donation 0.05
? in doc d in the collection
support 0.02
? .. 2
… d,2 1 - B
statistics 0.2
? d, k
loss ?0.1 k W
Topic  k dead 0.05
? ..
B Parameters:
is 0.05
Background B
?
the 0.04
?
B Global: {𝜃𝑘}𝐾𝑘 =1
a 0.03
? .. Local: {𝜋𝑑,𝑘}𝐾𝑑 ,𝑘 =
Manual: 𝜆𝐵 1
CS@UVa 12
Probabilistic Latent Semantic Analysis
[Hofmann 99a, 99b]
• Topic: a multinomial distribution over words

• Document
– Mixture of k topics
– Mixing weights reflect the topic coverage
• Topic modeling
– Word distribution under topic: p(w|θ)
– Topic coverage: p(π|d)
CS@UVa 13
EM for estimating multiple topics
the 0.2 E-Step:
Known a 0.1
Background we 0.01 Predict topic labels
p(w | B) to 0.02 using Bayes Rule Observed Words
…
Unknown …
topic model text =?
mining =? M-Step:
p(w|1)=? association =? ML Estimator
word =? based on
“Text mining” … “fractional
… counts”
Unknown …
topic model information =?
p(w|2)=? retrieval =?
query =?
“information document =?
retrieval” …
CS@UVa 14
Parameter estimation
E-Step:
Word w in doc d is generated Posterior: application of Bayes rule
- from topic j
- from background
 d(n), j p (n) (w |  j )
p(zd,w  j) 
  (w | j')
k (n) (n)
j'1 d , j' p
B p(w | B )
p(z d,w  B) 
B p(w | B )  (1  B)  j1 d , j p (w |  j )
k (n) (n)
M-Step:  d,(n1) 
 wV
c(w, d)(1 p(zd ,w  B)) p(zd ,w  j)
Re-estimate
j

j' wV
c(w, d)(1 p(zd ,w  B)) p(zd ,w  j')
- mixing weights
p(n1) (w |  j ) 
 c(w, d)(1 p(z  B)) p(z  j)
dC d ,w d ,w
- word-topic distribution   c(w', d )(1 p(z  B)) p(z  j)
w'V dC d ,w' d ,w'
Sum over all docs Fractional counts contributing to

in the collection - using topic j in generating d
- generating w from topic j
CS@UVa 15
How the algorithm works
c(w,d)(1 - p(zd,w = B))p(zd,w=j)
c(w,d)p(zd,w = B) π d1,1 πd1,2 Topic coverage
c(w, d) ( P(θ1|d1) ) ( P(θ2|d1) )
aid 7
d1 price 5
Initial value
oil 6
πd2,1 πd2,2
aid 8 ( P(θ1|d2) ) ( P(θ2|d2) )
d2 price 7
oil 5 Initial value
P(w| θ) Topic 1 Topic 2 3, π4,

Initializing
Iteration 1:
2: Ed,5,
M and
… P(w|
Step: θj) with πd, j
jStep: split
re-estimate
word counts
aid and
UntilP(w|
random
with θj) bytopics
different
values
converging adding (byand
computing z’ s)
normalizing
the splitted word counts
price Initial value
oil
CS@UVa 16
Sample pLSA topics from TDT Corpus [Hofmann 99b]
CS@UVa 17
pLSA with prior knowledge
• What if we have some domain knowledge in
mind
– We want to see topics such as “battery” and
“memory” for opinions about a laptop
– We want words like “apple” and “orange” co-
occur in a topic
– One topic should be fixed to model background
words (infinitely strong prior!)
• We can easily incorporate such knowledge as
priors of pLSA model
CS@UVa 18
Maximum a Posteriori (MAP) estimation
*  arg max p() p(Data | )

Prior can be placed on  as
Topic coverage well (more about this later)
warning 0.3 in document d
Topic 1 system 0.2..
1 d,1 “Generating” word w
aid 0.1 in doc d in the collection
Topic  2 donation 0.05 2
support 0.02 .. d,2 1 - B
…
d, k
statistics 0.2 k W
loss 0.1
Topic  k dead 0.05 .. B
B
is 0.05
Background B the 0.04 Parameters:
a 0.03 .. B=noise-level (manually set)
’s and ’s are estimated with Maximum A Posteriori (MAP)
CS@UVa 19
MAP estimation
• Choosing conjugate priors Pseudo counts of w from prior ’

– Dirichlet prior for multinomial distribution
p (w | ) 
(n1)  c(w, d)(1 p(z  B)) p(z  j)+p(w|’j)
dC d ,w d ,w
j
  c(w', d)(1 p(z  B)) p(z  j)+
w'V dC d ,w' d,w'
Sum of all pseudo counts

– What if =0? What if =+?
– A consequence of using conjugate prior is that the
prior can be converted into “pseudo data” which
can then be “merged” with the actual data for
parameter estimation
CS@UVa 20
Some background knowledge
• Conjugate prior Gaussian -> Gaussian
– Posterior dist in the same Beta -> Binomial
Dirichlet -> Multinomial
family as prior
• Dirichlet distribution
– Continuous
– Samples from it will be the
parameters in a
multinomial distribution
CS@UVa 21
Prior as pseudo counts
Observed Doc(s)
the 0.2
Known a 0.1
Background we 0.01
p(w | B) to 0.02
…
MAP
Unknown … Suppose, Estimator
topic model text =? we know
mining =?
p(w|1)=? association =? the identity
word =? of each
“Text mining” …
word ...
…
Pseudo Doc
Unknown …
topic model information =?
p(w|2)=? retrieval =?
query =? Size = μ
document =? text
“information
retrieval” …
CS@UVa
mining 22
Deficiency of pLSA
• Not a fully generative model
– Can’t compute probability of a new document
• Topic coverage p(π|d) is per-document estimated
– Heuristic workaround is possible
• Many parameters  high complexity of
models
– Many local maxima
– Prone to overfitting
CS@UVa 23
Latent Dirichlet Allocation [Blei et al. 02]
• Make pLSA a fully generative model by
imposing Dirichlet priors
– Dirichlet priors over p(π|d)
– Dirichlet priors over p(w|θ)
– A Bayesian version of pLSA
• Provide mechanism to deal with new
documents
– Flexible to model many other observations in a
document
CS@UVa 24
LDA = Imposing Prior on PLSA
pLSA: Topic coverage {d,j } are free for tuning
Topic coverage d,j is specific to each in document d
“training document”, thus can’t be
used to generate a new document “Generating” word w
1 d,1 in doc d in the collection
2 W
d,2
LDA: d, k
Topic coverage distribution {d,j } for k
any document is sampled from a
Dirichlet distribution, allowing for {d,j } are regularized
generating a new doc
 
p( d )  Dirichlet( ) Magnitudes of  and 
determine the variances of the prior,
In addition, the topic word distributions thus also the concentration of prior
{j } are also drawn from another (larger  and   stronger prior)
 
Dirichlet prior p( )
i  Dirichlet( )
CS@UVa 25
pLSA v.s. LDA
pLSA k
pd (w |{j },{d , j })  d , j p(w | )j

Core assumption
j1 in all topic models
k
log p(d |{j},{d, j })   c(w, d ) log[ d,j p(w |j )]
wV j1
log p(C | {j},{d, j })   log p(d | {j},{d, j })

dC pLSA component
LDA k
pd (w |{j },{d , j })  d , j p(w |j )
j1
k
  |)d
  d, j
  
log p(d | ,{})j  c(w, d ) log[ p(w | j )] p(
d d
wV j1
   k 
log p(C |,)    log p(d | ,{})  p( | 
j j )d...d
1  k
dC j1 Regularization
added by LDA
CS@UVa 26
LDA as a graphical model [Blei et al. 03a]

distribution over topics
Dirichlet priors for each document
(same as d on the previous slides)
 (d)  (d)  Dirichlet()


topic assignment
distribution over words for
for each word
each topic
(same as  on zi  Discrete( (d) )

j the previous slides)  (j) zi
T word generated from
 (j)  Dirichlet()
assigned topic
wi  Discrete( (zi) )
wi
Nd D
  
Most approximate inference algorithms aim to infer p(z i | w , ,  )
from which other interesting variables can be easily computed
CS@UVa 27
Approximate inferences for LDA
• Deterministic approximation
– Variational inference
– Expectation propagation
• Markov chain Monte Carlo
– Full Gibbs sampler
– Collapsed Gibbs sampler
Most efficient, and quite popular, but can only work with conjugate prior
CS@UVa 28
Collapsed Gibbs sampling [Griffiths & Steyvers 04]
• Using conjugacy between Dirichlet and multinomial
distributions, integrate out continuous random
variables
D
(d )
 (n j   ) (T )
P(z)   P(z | ) p()d  
j
d 1 ( )T ( n (dj )   )

  )
j
T (n (w)
(W)
P(w | z)   P(w | z, ) p()d   w j
j1 ()W ( n (w)

j  )
w
• Define a distribution on topic assignment z
P(w | z)P(z)
With fixed P(z | w)
assignment of z  P(w | z)P(z)
z
CS@UVa 29
Collapsed Gibbs sampling [Griffiths & Steyvers 04]
All the other words
• Sample each zi conditioned on z -i beside zi
nw( izi )   n (dj i ) 

P(z i | w, z i ) n( zi )  W n (d i )  T
 
Word-topic distribution Topic proportion
– Implementation: counts can be cached in two sparse
matrices; no special functions, simple arithmetic
– Distributions on  and  can be analytic computed
given z and w
CS@UVa
30
Gibbs sampling in LDA
iteration
1
i wi di zi
1 MATHEMATICS 1 2
2 KNOWLEDGE 1 2
3 RESEARCH 1 1
4 WORK 1 2
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . .
. . . .
. . . .
50 JOY 5 2
CS@UVa 31
iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 ?
2 KNOWLEDGE 1 2
3 RESEARCH 1 1
4 WORK 1 2
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . .
. . . .
. . . .
50 JOY 5 2
CS@UVa 32
Gibbs sampling in LDA iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 ?
2 KNOWLEDGE 1 2
3 RESEARCH 1 1
4 WORK 1 2
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1 words in di assigned with topic j
12 KNOWLEDGE 2 1
. . . .
. . . .
.
Count of instances where w i is
. . .
50 JOY 5 2 assigned with topic j
Count of all words

assigned with topic j
CS@UVa 33
words in di assigned with any topic
iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 ?
2 KNOWLEDGE 1 2
3 RESEARCH 1 1
4 WORK 1 2 What’s the most likely topic for wi in di?
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2 How likely would di choose topic j?
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . . How likely would topic j
. . . .
. . . .
generate word wi ?
50 JOY 5 2
CS@UVa 34
iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 2
2 KNOWLEDGE 1 2 ?
3 RESEARCH 1 1
4 WORK 1 2
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . .
. . . .
. . . .
50 JOY 5 2
CS@UVa 35
iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 2
2 KNOWLEDGE 1 2 1
3 RESEARCH 1 1 ?
4 WORK 1 2
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . .
. . . .
. . . .
50 JOY 5 2
CS@UVa 36
iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 2
2 KNOWLEDGE 1 2 1
3 RESEARCH 1 1 1
4 WORK 1 2 ?
5 MATHEMATICS 1 1
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . .
. . . .
. . . .
50 JOY 5 2
CS@UVa 37
iteration
1 2
i wi di zi zi
1 MATHEMATICS 1 2 2
2 KNOWLEDGE 1 2 1
3 RESEARCH 1 1 1
4 WORK 1 2 2
5 MATHEMATICS 1 1 ?
6 RESEARCH 1 2
7 WORK 1 2
8 SCIENTIFIC 1 1
9 MATHEMATICS 1 2
10 WORK 1 1
11 SCIENTIFIC 2 1
12 KNOWLEDGE 2 1
. . . .
. . . .
. . . .
50 JOY 5 2
CS@UVa 38
iteration
1 2 … 1000
i wi di zi zi zi
1 MATHEMATICS 1 2 2 2
2 KNOWLEDGE 1 2 1 2
3 RESEARCH 1 1 1 2
4 WORK 1 2 2 1
6 RESEARCH 1 2 2 2
7 WORK 1 2 2 2
8 SCIENTIFIC 1 1 1 … 1
10 WORK 1 1 2 2
11 SCIENTIFIC 2 1 1 2
12 KNOWLEDGE 2 1 2 2
. . . . . .
. . . . . .
. . . . . .
50 JOY 5 2 1 1
CS@UVa 39
Topics learned by LDA
CS@UVa 40
Topic assignments in document
• Based on the topics shown in last slide
CS@UVa 41
Application of learned topics
• Document classification
– A new type of feature representation
CS@UVa 42
Application of learned topics
• Collaborative filtering
– A new type of user profile
CS@UVa 43
Outline
4. Summary
CS@UVa 44
Supervised Topic Model [Blei & McAuliffe, NIPS’02]
• A generative model for classification

– Topic generates both words and labels
CS@UVa 45
Sentiment polarity of topics
Sentiment polarity learned

from classification model
CS@UVa 46
Author Topic Model [Rosen-Zvi UAI’04]
• Authorship determines the topic mixture
Each author chooses

his/her topic to
contribute in the
document
CS@UVa 47
Learned association between words
and authors
CS@UVa 48
Collaborative Topic Model [Wang & Blei,KDD’11]
• Collaborative filtering in topic space

– User’s preference over topics determines his/her
rating for the item
Topics for the item, shifted

with random noise
User profile over topical space

CS@UVa 49
Topic-based recommendation
CS@UVa 50
Correspondence Topic Model [Blei SIGIR’03]
• Simultaneously modeling the generation of

multiple types of observations
– E.g., image and corresponding text annotations
Correspondence part (can be
described with different distributions)
LDA part
CS@UVa 51
Annotation results
CS@UVa 52
Annotation results
CS@UVa 53
Dynamic Topic Model [Blei ICML’06]
• Capture the evolving topics over time
Markov
assumption
about the
topic
dynamics
CS@UVa 54
Evolution of topics
CS@UVa 55
Polylingual Topic Models [Mimmo et al., EMNLP’09]
• Assumption: topics are universal over

languages
– Correspondence between documents are known
• E.g., news report about the same event in different
languages
CS@UVa 56
Topics learned in different languages
CS@UVa 57
Correlated Topic Model [Blei & Lafferty, Annals
of Applied Stat’07]
• Non-conjugate priors to capture correlation

between topics
Gaussian as the prior for topic proportion

(increase the computational complexity)
CS@UVa 58
Learned structure of topics
CS@UVa 59
Hierarchical Topic Models [Blei et al. NIPS’04]
• Nested Chinese restaurant process as a prior

for topic assignment
CS@UVa 60
Hierarchical structure of topics
CS@UVa 61
Outline
4. Summary
CS@UVa 62
Summary
• Probabilistic Topic Models are a new family of document
modeling approaches, especially useful for
– Discovering latent topics in text
– Analyzing latent structures and patterns of topics
– Extensible for joint modeling and analysis of text and associated non-
textual data
• pLSA & LDA are two basic topic models that tend to function
similarly, with LDA better as a generative model
• Many different models have been proposed with probably
many more to come
• Many demonstrated applications in multiple domains and
many more to come
CS@UVa 63
Summary
• However, all topic models suffer from the problem of multiple local
maxima
– Make it hard/impossible to reproduce research results
– Make it hard/impossible to interpret results in real applications
• Complex models can’t scale up to handle large amounts of text data
– Collapsed Gibbs sampling is efficient, but only working for conjugate priors
– Variational EM needs to be derived in a model-specific way
– Parallel algorithms are promising
• Many challenges remain….
CS@UVa 64

Topic Models in Natural Language Processing

Uploaded by

Copyright:

Topic Models in Natural Language Processing

Uploaded by

Document Information

Original Title

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Topic Models in Natural Language Processing

Uploaded by

Copyright:

Probabilistic Topic Models

Topic: A broad concept/theme,

• Topic: a multinomial distribution over words

B Assume words in d are from

the 0.03 B d the 0.031

Background Topic: p(w|B) Document Topic: p(w|d)

p(w)  p(w | B )  (1  ) p(w |  )

log p(d |  )  c(w, d ) log[p(w | B )  (1  ) p(w |  )]

Expectation Maximization ˆ arg max log p(d | )

• Generalize the two topic mixture to k topics

• Topic: a multinomial distribution over words

Sum over all docs Fractional counts contributing to

P(w| θ) Topic 1 Topic 2 3, π4,

• Choosing conjugate priors Pseudo counts of w from prior ’

Sum of all pseudo counts

pd (w |{j },{d , j })  d , j p(w | )j

log p(C | {j},{d, j })   log p(d | {j},{d, j })

 (d)  (d)  Dirichlet()

(same as  on zi  Discrete( (d) )

d 1 ( )T ( n (dj )   )

j1 ()W ( n (w)

nw( izi )   n (dj i ) 

Count of all words

• A generative model for classification

Sentiment polarity learned

Each author chooses

• Collaborative filtering in topic space

Topics for the item, shifted

User profile over topical space

• Simultaneously modeling the generation of

• Assumption: topics are universal over

• Non-conjugate priors to capture correlation

Gaussian as the prior for topic proportion

• Nested Chinese restaurant process as a prior

You might also like