NLP NB

Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

Natural Language Processing

Text Classification and Naı̈ve Bayes

Felipe Bravo-Marquez

June 20, 2023


Text Classification

• Classification lies at the heart of both human and machine intelligence.


• Deciding what letter, word, or image has been presented to our senses,
recognizing faces or voices, sorting mail, assigning grades to homeworks.
• These are all examples of assigning a category to an input.
• The goal of classification is to take a single observation, extract some useful
features, and thereby classify the observation into one of a set of discrete
classes.
• Most cases of classification in language processing are done via supervised
machine learning.
• This slides are based on the course material by Daniel Jurafsky :
https://web.stanford.edu/˜jurafsky/slp3/4.pdf
Example 1: Spam Classification
Example 2: Who wrote which Federalist papers?
• 1787-8: Anonymous essays attempted to convince New York to ratify the U.S
Constitution: Jay, Madison, Hamilton.
• Authorship of 12 of the letters is in dispute.
• 1963: Solved by Mosteller and Wallace using Bayesian methods.

Figure: James Figure: Alexander


Madison Hamilton
Example 3: What is the subject of this medical article?
Example 4: Positive or negative movie review?

• + ...zany characters and richly applied satire, and some


great plot twists
• - It was pathetic. The worst part about it was the boxing
scenes...
• + ...awesome caramel sauce and sweet toasty almonds. I
love this place!
• - ...awful pizza and ridiculously overpriced...
Why sentiment analysis?

• Movie: Is this review positive or negative?


• Products: What do people think about the new iPhone?
• Public sentiment: How is consumer confidence?
• Politics: What do people think about this candidate or
issue?
• Prediction: Predict election outcomes or market trends
from sentiment.
Basic Sentiment Classification

Sentiment analysis is the detection of attitudes.


• Simple task we focus on in this class
• Is the attitude of this text positive or negative?
Summary: Text Classification

Text classification can be applied to various tasks, including:


• Sentiment analysis
• Spam detection
• Authorship identification
• Language identification
• Assigning subject categories, topics, or genres
• ...
Text Classification: Definition

Input:
• A document d
• A fixed set of classes C = {c1 , c2 , . . . , cJ }
Output: A predicted class c ∈ C
Classification Methods: Hand-coded rules

Rules based on combinations of words or other features


• Spam: black-list-address OR (“dollars” AND “you have
been selected” )
• Accuracy can be high if rules carefully refined by experts
• But building and maintaining these rules is expensive
Classification Methods: Supervised Machine Learning

Input:
• A document d
• A fixed set of classes C = {c1 , c2 , . . . , cJ }
• A training set of m hand-labeled documents:
(d1 , c1 ), (d2 , c2 ), . . . , (dm , cm )
Output:
• A learned classifier γ : d → c
Classification Methods: Supervised Machine Learning

Any kind of classifier can be used:


• Naı̈ve Bayes
• Logistic regression
• Neural networks
• k-Nearest Neighbors
Supervised Learning Problems

• We have training examples x (i) , y (i) for i = 1, . . . , m. Each


x (i) is an input, each y (i) is a label.
• Task is to learn a function f mapping inputs x to labels f (x).
• Conditional models:
• Learn a distribution p(y |x) from training examples.
• For any test input x, define f (x) = arg maxy p(y |x).
Generative Models

• Given training examples x (i) , y (i) for i = 1, . . . , m. The task


is to learn a function f that maps inputs x to labels f (x).
• Generative models:
• Learn the joint distribution p(x, y ) from the training
examples.
• Often, we have p(x, y ) = p(y )p(x|y ).
• Note: We then have

p(y)p(x|y ) X
p(y |x) = where p(x) = p(y )p(x|y).
p(x) y
Classification with Generative Models

• Given training examples x (i) , y (i) for i = 1, . . . , m. The task


is to learn a function f that maps inputs x to labels f (x).
• Generative models:
• Learn the joint distribution p(x, y ) from the training
examples.
• Often, we have p(x, y ) = p(y )p(x|y ).
• Output from the model:

p(y )p(x|y)
f (x) = arg max p(y|x) = arg max
y y p(x)
= arg max p(y)p(x|y )
y
Naive Bayes Intuition

Naive Bayes is a simple (”naive”) classification method based


on Bayes’ rule.
• Relies on a very simple representation of a document: Bag
of words
The Bag of Words Representation
Bayes’ Rule Applied to Documents and Classes

For a document d and a class c:


P(d|c)P(c)
P(c|d) =
P(d)
Naive Bayes Classifier (I)

• MAP stands for ”maximum a posteriori,” which represents the most likely class:

cMAP = arg max P(c|d)


c∈C

• To calculate the most likely class, we apply Bayes’ rule:

P(d|c)P(c)
= arg max
c∈C P(d)

• Finally, we can drop the denominator since it remains constant for all classes:

= arg max P(d|c)P(c)


c∈C
Naive Bayes Classifier (II)

• To classify document d, we use the MAP estimate:

cMAP = arg max P(d|c)P(c)


c∈C

• The document d is represented as a set of features x1 , x2 , . . . , xn .


• The classifier calculates the conditional probability of the features given a class
and the prior probability of the class:

= arg max P(x1 , x2 , . . . , xn |c)P(c)


c∈C

• The term P(x1 , x2 , . . . , xn |c) represents the ”likelihood” of the features given the
class.
• The term P(c) represents the ”prior” probability of the class.
Naı̈ve Bayes Classifier (IV)

• The Naı̈ve Bayes classifier [McCallum et al., 1998] calculates the MAP estimate
by considering the likelihood and prior probabilities:

cMAP = arg max P(x1 , x2 , . . . , xn |c)P(c)


c∈C

• The probability of the features given the class, P(x1 , x2 , . . . , xn |c), can be
estimated by counting the relative frequencies in a corpus.
• The prior probability of the class, P(c), represents how often this class occurs.
• Without some simplifying assumptions, estimating the probability of every
possible combination of features in P(x1 , x2 , . . . , xn |c) would require huge
numbers of parameters and impossibly large training sets.
• Naive Bayes classifiers therefore make two simplifying assumptions.
Multinomial Naive Bayes Independence Assumptions

• Bag of Words assumption: We assume that the position of words in the


document does not matter.
• Conditional Independence assumption: We assume that the feature probabilities
P(xi |cj ) are independent given the class cj .
• In the Multinomial Naive Bayes classifier, the probability of a document with
features x1 , x2 , . . . , xn given class c can be calculated as:

P(x1 , x2 , . . . , xn |c) = P(x1 |c) · P(x2 |c) · P(x3 |c) · . . . · P(xn |c)
Multinomial Naive Bayes Classifier

• The Maximum A Posteriori (MAP) estimate for class c in the Multinomial Naive
Bayes classifier is given by:

cMAP = arg max P(x1 , x2 , . . . , xn |c)P(c)


c∈C

• Alternatively, we can write it as:


Y
cNB = arg max P(cj ) P(x|c)
c∈C
x∈X

• P(cj ) represents the prior probability of class cj .



Q
x∈X P(x|c) represents the likelihood of the features x1 , x2 , . . . , xn given class
c.
Applying Multinomial Naive Bayes Classifiers to Text
Classification

• The Multinomial Naive Bayes classifier for text classification can be applied as
follows: Y
cNB = arg max P(cj ) P(xi |cj )
cj ∈C
i∈positions

• cNB represents the predicted class for the test document.


• C is the set of all possible classes.
• P(cj ) is the prior probability of class cj .

Q
i∈positions P(xi |cj ) calculates the likelihood of each feature xi at position i given
class cj .
• The product is taken over all word positions in the test document.
Problems with Multiplying Lots of Probabilities

• Multiplying lots of probabilities can result in floating-point underflow, especially


when dealing with small probabilities.
• Example: 0.0006 × 0.0007 × 0.0009 × 0.01 × 0.5 × 0.000008 . . .
• Idea: Use logarithms, as log(ab) = log(a) + log(b).
• Instead of multiplying probabilities, we can sum the logarithms of probabilities.
• The Multinomial Naive Bayes classifier can be expressed using logarithms as
follows:  
X
cNB = arg max log(P(cj )) +
 log(P(xi |cj ))
cj ∈C
i∈position

• By taking logarithms, we avoid the issue of floating-point underflow and perform


calculations in the log space.
• The classifier becomes a linear model, where the prediction is the argmax of a
sum of weights (log probabilities) and the inputs (log conditional probabilities):
• Thus, Naı̈ve Bayes is a linear classifier, operating in the log space.
Learning the Multinomial Naive Bayes Model
First attempt: Maximum Likelihood Estimates
• The probabilities are estimated using the observed counts in the training data.
• The prior probability of a class cj is estimated as:

Ncj
P̂(cj ) =
Ntotal

where Ncj is the number of documents in class cj and Ntotal is the total number of
documents.
• The estimate of the probability of word wi given class cj is calculated as:

count(wi , cj )
P̂(wi |cj ) = P
w∈V count(w, cj )

where w ∈ V represents a word in the vocabulary V .


• The denominator is the sum of counts of all words in the vocabulary within class
cj .
Parameter Estimation

To estimate the parameters of the Multinomial Naive Bayes model, we follow these
steps:
• Create a mega-document for each topic cj by concatenating all the documents in
that topic.
• We calculate the frequency of word wi in the mega-document, which represents
the fraction of times word wi appears among all words in the documents of topic
cj .
• The estimated probability P̂(wi |cj ) of word wi given class cj is obtained by
dividing the count of occurrences of wi in the mega-document of topic cj by the
total count of words in the mega-document:

count(wi , cj )
P̂(wi |cj ) = P
w∈V count(w, cj )

Here, count(wi , cj ) represents the number of times word wi appears in the


mega-document of topic cj , and count(w, cj ) is the total count of words in the
mega-document.
Problem with Maximum Likelihood
Zero probabilities and the issue of unseen words
• Consider the scenario where we have not encountered the word ”fantastic” in
any training documents classified as positive (thumbs-up).
• Using maximum likelihood estimation, the probability P̂(”fantastic” | positive)
would be calculated as:
count(”fantastic”, positive)
P̂(”fantastic” | positive) = P
w∈V count(w, positive)

• In this case, the count of the word ”fantastic” in positive documents is zero,
leading to a zero probability:
0
P̂(”fantastic” | positive) = P =0
w∈V count(w, positive)

• However, zero probabilities cannot be conditioned away, regardless of the other


evidence present.
• This poses a problem when calculating the maximum a posteriori (MAP)
estimate, which is used for classification:
Y
cMAP = arg max P̂(c) P̂(xi | c)
c
i

• With a zero probability for a word, the entire expression becomes zero,
regardless of other evidence.
Laplace (Add-1) Smoothing for Naı̈ve Bayes

Handling zero probabilities with Laplace (Add-1) smoothing


• To address the problem of zero probabilities, we can employ Laplace (Add-1)
smoothing technique.
• The smoothed estimate P̂(wi | c) is calculated as:

count(wi , c) + 1
P̂(wi | c) = P
w∈V (count(w, c) + 1)

• Here, an additional count of 1 is added to both the numerator and the


denominator.
• The denominator is adjusted by adding the size of the vocabulary V to ensure
proper normalization.
• By doing so, we prevent zero probabilities and allow some probability mass to be
distributed to unseen words.
• This smoothing technique helps to mitigate the issue of unseen words and
avoids the complete elimination of certain classes during classification.
Multinomial Naı̈ve Bayes: Learning
Learning the Multinomial Naı̈ve Bayes Model
• In order to learn the parameters of the model, we need to calculate the terms
P(cj ) and P(wk | cj ).
• For each class cj in the set of classes C, we perform the following steps:
• Retrieve all the documents docsj that belong to class cj .
• Calculate the term P(wk | cj ) for each word wk in the
vocabulary V :
nk + α
P(wk | cj ) =
n + α · |Vocabulary|

where nk represents the number of occurrences of word wk


in the concatenated document Textj .
• Calculate the prior probability P(cj ):

|docsj |
P(cj ) =
|total number of documents|
• To calculate P(wk | cj ), we need to extract the vocabulary V from the training
corpus.
Unknown Words

Dealing with unknown words in the test data:


• When we encounter unknown words in the test data that do not appear in the
training data or vocabulary, we ignore them.
• We remove these unknown words from the test document as if they were not
present at all.
• We do not assign any probability to these unknown words in the classification
process.
Why don’t we build an unknown word model?
• Building a separate model for unknown words is not generally helpful.
• Knowing which class has more unknown words does not provide useful
information for classification.
Stop Words

Stop words are frequently used words like ”the” and ”a” that are often considered to
have little or no significance in text classification. Some systems choose to ignore stop
words in the classification process. Here is how it is typically done:
• Sort the vocabulary by word frequency in the training set.
• Create a stopword list by selecting the top 10 or 50 most frequent words.
• Remove all stop words from both the training and test sets, treating them as if
they were never there.
However, removing stop words doesn’t usually improve the performance of Naive
Bayes classifiers. Therefore, in practice, most Naive Bayes algorithms use all words
and do not utilize stopword lists.
Worked Sentiment Example

Training data:

Category Text
Negative Just plain boring, entirely predictable and
lacks energy.
Negative No surprises and very few laughs.
Positive Very powerful.
Positive The most fun film of the summer.

Test:
Category Text
? Predictable with no fun.
Worked Sentiment Example
Naive Bayes as a Language Model
• When using individual word features and considering all words in the text, naive
Bayes has an important similarity to language modeling.
• Specifically, a naive Bayes model can be viewed as a set of class-specific
unigram language models, in which the model for each class instantiates a
unigram language model.
• The likelihood features from the naive Bayes model assign a probability to each
word P(word|c), and the model also assigns a probability to each sentence:
Y
P(s|c) = P(wi |c)
i∈positions

Consider a naive Bayes model with the classes positive (+) and negative (-) and the
following model parameters:

w P(w|+) P(w|−)
I 0.1 0.2
love 0.1 0.001
this 0.01 0.01
fun 0.05 0.005
film 0.1 0.1
... ... ...
Naive Bayes as a Language Model

• Each of the two columns above instantiates a language model that can assign a
probability to the sentence ”I love this fun film”:

P(”I love this fun film”|+) = 0.1 × 0.1 × 0.01 × 0.05 × 0.1 = 0.0000005

P(”I love this fun film”|−) = 0.2 × 0.001 × 0.01 × 0.005 × 0.1 = 0.0000000010
• As it happens, the positive model assigns a higher probability to the sentence:

P(s|pos) > P(s|neg)

• Note that this is just the likelihood part of the naive Bayes model; once we
multiply in the prior, a full naive Bayes model might well make a different
classification decision.
Evaluation

• Let’s consider just binary text classification tasks.


• Imagine you’re the CEO of Delicious Pie Company.
• You want to know what people are saying about your pies.
• So you build a ”Delicious Pie” tweet detector with the
following classes:
• Positive class: tweets about Delicious Pie Co
• Negative class: all other tweets
The 2-by-2 Confusion Matrix
System Positive System Negative
Gold Positive True Positive (TP) False Negative (FN)
Gold Negative False Positive (FP) True Negative (TN)

Recall (also known as Sensitivity or True Positive Rate):

TP
Recall =
TP + FN
Precision:
TP
Precision =
TP + FP
Accuracy:

TP + TN
Accuracy =
TP + FP + TN + FN
Evaluation: Accuracy

Why don’t we use accuracy as our metric?


Imagine we saw 1 million tweets:
• 100 of them talked about Delicious Pie Co.
• 999,900 talked about something else.
We could build a dumb classifier that just labels every tweet
”not about pie”:
• It would get 99.99% accuracy!!! Wow!!!!
• But it would be useless! It doesn’t return the comments we
are looking for!
That’s why we use precision and recall instead.
Evaluation: Precision and Recall

Precision measures the percentage of items the system


detected (i.e., items the system labeled as positive) that are in
fact positive (according to the human gold labels).

True Positives
Precision =
True Positives + False Positives
Recall measures the percentage of items that were correctly
identified by the system out of all the items that should have
been identified.

True Positives
Recall =
True Positives + False Negatives
Why Precision and Recall?

Consider our dumb pie-classifier that just labels nothing as


”about pie.”
• Accuracy = 99.99% (it correctly labels most tweets as not
about pie)
• Recall = 0 (it doesn’t detect any of the 100 pie-related
tweets)
Precision and recall, unlike accuracy, emphasize true positives:
• They focus on finding the things that we are supposed to
be looking for.
A Combined Measure: F-measure
The F-measure is a single number that combines precision (P)
and recall (R), defined as:

(β 2 + 1)PR
Fβ =
β2P + R

The F-measure, defined with the parameter β, differentially


weights the importance of recall and precision.
• β > 1 favors recall
• β < 1 favors precision
When β = 1, precision and recall are equal, and we have the
balanced F1 measure:
2PR
F1 =
P +R
Development Test Sets (”Devsets”)

• To avoid overfitting and provide a more conservative estimate of performance,


we commonly use a three-set approach: training set, devset, and testset.

• Training set: Used to train the model.


• Devset: Used to tune the model and select the best hyperparameters.
• Testset: Used to report the final performance of the model.
• This approach ensures that the model is not tuned specifically to the test set,
avoiding overfitting.
• However, it creates a paradox: we want as much data as possible for training, but
also for the devset.
• How do we split the data?
Cross-validation: Multiple Splits

• Cross-validation allows us to use all our data for training and testing without
having a fixed training set, devset, and test set.
• We choose a number k and partition our data into k disjoint subsets called folds.
• For each iteration, one fold is selected as the test set while the remaining k − 1
folds are used to train the classifier.
• We compute the error rate on the test set and repeat this process k times.
• Finally, we average the error rates from these k runs to obtain an average error
rate.
Cross-validation: Multiple Splits

• 10-fold cross-validation, for example, involves training 10 models on 90% of the


data and testing each model separately.
• The resulting error rates are averaged to obtain the final performance estimate.
• However, cross-validation requires the entire corpus to be blind, preventing
examination of the data for feature suggestion or understanding system behavior.
• To address this, a fixed training set and test set are created, and 10-fold
cross-validation is performed within the training set.
• The error rate is computed conventionally in the test set.
Cross-validation: Multiple Splits
Confusion Matrix for 3-class classification
Combining Binary Metrics

How to combine binary metrics (Precision, Recall, F1 ) from


more than 2 classes to get one metric:
• Macroaveraging:
• Compute the performance metrics (Precision, Recall, F1 )
for each class individually.
• Average the metrics over all classes.
• Microaveraging:
• Collect the decisions for all classes into one confusion
matrix.
• Compute Precision and Recall from the confusion matrix.
Macroaveraging and Microaveraging
Questions?

Thanks for your Attention!


References I

McCallum, A., Nigam, K., et al. (1998).


A comparison of event models for naive bayes text classification.
In AAAI-98 workshop on learning for text categorization, volume 752, pages
41–48. Madison, WI.

You might also like