NLP NB
NLP NB
NLP NB
Felipe Bravo-Marquez
Input:
• A document d
• A fixed set of classes C = {c1 , c2 , . . . , cJ }
Output: A predicted class c ∈ C
Classification Methods: Hand-coded rules
Input:
• A document d
• A fixed set of classes C = {c1 , c2 , . . . , cJ }
• A training set of m hand-labeled documents:
(d1 , c1 ), (d2 , c2 ), . . . , (dm , cm )
Output:
• A learned classifier γ : d → c
Classification Methods: Supervised Machine Learning
p(y)p(x|y ) X
p(y |x) = where p(x) = p(y )p(x|y).
p(x) y
Classification with Generative Models
p(y )p(x|y)
f (x) = arg max p(y|x) = arg max
y y p(x)
= arg max p(y)p(x|y )
y
Naive Bayes Intuition
• MAP stands for ”maximum a posteriori,” which represents the most likely class:
P(d|c)P(c)
= arg max
c∈C P(d)
• Finally, we can drop the denominator since it remains constant for all classes:
• The term P(x1 , x2 , . . . , xn |c) represents the ”likelihood” of the features given the
class.
• The term P(c) represents the ”prior” probability of the class.
Naı̈ve Bayes Classifier (IV)
• The Naı̈ve Bayes classifier [McCallum et al., 1998] calculates the MAP estimate
by considering the likelihood and prior probabilities:
• The probability of the features given the class, P(x1 , x2 , . . . , xn |c), can be
estimated by counting the relative frequencies in a corpus.
• The prior probability of the class, P(c), represents how often this class occurs.
• Without some simplifying assumptions, estimating the probability of every
possible combination of features in P(x1 , x2 , . . . , xn |c) would require huge
numbers of parameters and impossibly large training sets.
• Naive Bayes classifiers therefore make two simplifying assumptions.
Multinomial Naive Bayes Independence Assumptions
P(x1 , x2 , . . . , xn |c) = P(x1 |c) · P(x2 |c) · P(x3 |c) · . . . · P(xn |c)
Multinomial Naive Bayes Classifier
• The Maximum A Posteriori (MAP) estimate for class c in the Multinomial Naive
Bayes classifier is given by:
• The Multinomial Naive Bayes classifier for text classification can be applied as
follows: Y
cNB = arg max P(cj ) P(xi |cj )
cj ∈C
i∈positions
Ncj
P̂(cj ) =
Ntotal
where Ncj is the number of documents in class cj and Ntotal is the total number of
documents.
• The estimate of the probability of word wi given class cj is calculated as:
count(wi , cj )
P̂(wi |cj ) = P
w∈V count(w, cj )
To estimate the parameters of the Multinomial Naive Bayes model, we follow these
steps:
• Create a mega-document for each topic cj by concatenating all the documents in
that topic.
• We calculate the frequency of word wi in the mega-document, which represents
the fraction of times word wi appears among all words in the documents of topic
cj .
• The estimated probability P̂(wi |cj ) of word wi given class cj is obtained by
dividing the count of occurrences of wi in the mega-document of topic cj by the
total count of words in the mega-document:
count(wi , cj )
P̂(wi |cj ) = P
w∈V count(w, cj )
• In this case, the count of the word ”fantastic” in positive documents is zero,
leading to a zero probability:
0
P̂(”fantastic” | positive) = P =0
w∈V count(w, positive)
• With a zero probability for a word, the entire expression becomes zero,
regardless of other evidence.
Laplace (Add-1) Smoothing for Naı̈ve Bayes
count(wi , c) + 1
P̂(wi | c) = P
w∈V (count(w, c) + 1)
|docsj |
P(cj ) =
|total number of documents|
• To calculate P(wk | cj ), we need to extract the vocabulary V from the training
corpus.
Unknown Words
Stop words are frequently used words like ”the” and ”a” that are often considered to
have little or no significance in text classification. Some systems choose to ignore stop
words in the classification process. Here is how it is typically done:
• Sort the vocabulary by word frequency in the training set.
• Create a stopword list by selecting the top 10 or 50 most frequent words.
• Remove all stop words from both the training and test sets, treating them as if
they were never there.
However, removing stop words doesn’t usually improve the performance of Naive
Bayes classifiers. Therefore, in practice, most Naive Bayes algorithms use all words
and do not utilize stopword lists.
Worked Sentiment Example
Training data:
Category Text
Negative Just plain boring, entirely predictable and
lacks energy.
Negative No surprises and very few laughs.
Positive Very powerful.
Positive The most fun film of the summer.
Test:
Category Text
? Predictable with no fun.
Worked Sentiment Example
Naive Bayes as a Language Model
• When using individual word features and considering all words in the text, naive
Bayes has an important similarity to language modeling.
• Specifically, a naive Bayes model can be viewed as a set of class-specific
unigram language models, in which the model for each class instantiates a
unigram language model.
• The likelihood features from the naive Bayes model assign a probability to each
word P(word|c), and the model also assigns a probability to each sentence:
Y
P(s|c) = P(wi |c)
i∈positions
Consider a naive Bayes model with the classes positive (+) and negative (-) and the
following model parameters:
w P(w|+) P(w|−)
I 0.1 0.2
love 0.1 0.001
this 0.01 0.01
fun 0.05 0.005
film 0.1 0.1
... ... ...
Naive Bayes as a Language Model
• Each of the two columns above instantiates a language model that can assign a
probability to the sentence ”I love this fun film”:
P(”I love this fun film”|+) = 0.1 × 0.1 × 0.01 × 0.05 × 0.1 = 0.0000005
P(”I love this fun film”|−) = 0.2 × 0.001 × 0.01 × 0.005 × 0.1 = 0.0000000010
• As it happens, the positive model assigns a higher probability to the sentence:
• Note that this is just the likelihood part of the naive Bayes model; once we
multiply in the prior, a full naive Bayes model might well make a different
classification decision.
Evaluation
TP
Recall =
TP + FN
Precision:
TP
Precision =
TP + FP
Accuracy:
TP + TN
Accuracy =
TP + FP + TN + FN
Evaluation: Accuracy
True Positives
Precision =
True Positives + False Positives
Recall measures the percentage of items that were correctly
identified by the system out of all the items that should have
been identified.
True Positives
Recall =
True Positives + False Negatives
Why Precision and Recall?
(β 2 + 1)PR
Fβ =
β2P + R
• Cross-validation allows us to use all our data for training and testing without
having a fixed training set, devset, and test set.
• We choose a number k and partition our data into k disjoint subsets called folds.
• For each iteration, one fold is selected as the test set while the remaining k − 1
folds are used to train the classifier.
• We compute the error rate on the test set and repeat this process k times.
• Finally, we average the error rates from these k runs to obtain an average error
rate.
Cross-validation: Multiple Splits