Naive Bayes
Naive Bayes
Naive Bayes
Machine Intelligence
Lecture # 4
Spring 2024
1
Tentative Course Topics
2
Agenda
Background and Probability Basics
Summary
3
4
https://youtu.be/Q8l0Vip5YUw?si=DodN5Y0HEOU1myo9
5
6
Rev Thomas Bayes
7
Things We’d Like to Do
• Spam Classification
–Given an email, predict whether it is spam or not
• Medical Diagnosis
–Given a list of symptoms, predict whether a patient has disease X or not
• Weather
–Based on temperature, humidity, etc… predict if it will rain tomorrow
Naive Bayes
• In the heart of the Naive Bayes algorithm is the probabilistic model that
computes the conditional probabilities of the input features and assigns
the probability distributions to each of the possible classes.
13
Bayesian Classification
• Problem statement:
–Given features X1,X2,…,Xn
–Predict a label Y
Probability Basics
• Prior, conditional and joint probability for random variables
– Prior probability: P(x)
– Conditional probability: P( x1 | x2 ), P(x 2 | x1 )
– Joint probability: x ( x1 , x2 ), P(x) P(x1 ,x2 )
– Relationship:
– Independence: P( x | x ) P( x ), P( x | x ) P( x ), P(x ,x ) P( x )P( x )
2 1 2 1 2 1 1 2 1 2
• Bayesian Rule
P(x|c)P(c) Likelihood Prior
P(c |x) Posterior
P(x) Evidence
16
Bayes theorem - Example
17
Bayes theorem - Example
18
Bayes theorem - Example
19
Probability Basics
• Prior, conditional and joint probability for random variables
– Prior probability: P(x)
– Conditional probability: P( x1 | x2 ), P(x 2 | x1 )
– Joint probability: x ( x1 , x2 ), P(x) P(x1 ,x2 )
– Relationship:
– Independence: P( x | x ) P( x ), P( x | x ) P( x ), P(x ,x ) P( x )P( x )
2 1 2 1 2 1 1 2 1 2
• Bayesian Rule
P(x|c)P(c) Likelihood Prior
P(c |x) Posterior
P(x) Evidence
20
Probabilistic Classification Principle
P ( c 1 | x ) P( c 2 | x ) P(c L | x)
• To train a discriminative classifier regardless
its probabilistic or non-probabilistic nature,
all training examples of different classes
Discriminative must be jointly used to build up a single
Probabilistic Classifier discriminative classifier.
21
Probabilistic Classification Principle
23
Naïve Bayes
• Bayes classification
P (c | x) P (x | c ) P (c ) P ( x1 , , xn | c ) P (c ) for c c1 ,..., cL .
Difficulty: learning the joint probability P ( x1 , , xn | c ) is infeasible!
• Naïve Bayes classification
– Assume all input features are class conditionally independent!
P( x1 , x2 , , xn |c ) P( x1 | x2 , , xn , c)P( x2 , , xn |c)
Applying the P( x1 |c )P( x2 , , xn |c)
independence
assumption P( x1 |c )P( x2 |c ) P( xn |c)
– Apply the MAP classification rule: assign x' (a1 , a2 , , an ) to c* if
[ P(a1 | c * ) P(an | c* )]P(c * ) [ P(a1 | c) P(an | c)]P (c), c c * , c c1 , , cL
estimate of P ( a1 , , a n | c * ) esitmate of P (a1 , , an | c)
24
Bayes theorem - Example
25
Bayes theorem - Example
26
Bayes theorem
Postrior
Prior
27
Naïve Bayes for text classification
0.1 0.3
0.1 0.5
0.8 0.2
28
Who is the sender of “life deal”?
0.1 0.3
0.1 0.5
0.8 0.2
29
Naïve Bayes
x' ( a1 , , a n )
[Pˆ (a1 |c * ) Pˆ (an |c * )]Pˆ (c * ) [Pˆ (a1 |ci ) Pˆ (an |ci )]Pˆ (ci ), ci c * , ci c1 , , cL
30
Example (Will Nadal Play Tennis?)
• Columns denote features Xi
• Rows denote labeled instances
Class label denotes whether a te
31
Learning Phase
32
Example
• Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phase
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
33
Zero conditional probability
A given class and feature value never occur together in the training set.
• If no example contains the feature value
– In this circumstance, we face a zero conditional probability problem
during the test.
Pˆ (x1 | ci ) Pˆ (a jk | ci ) Pˆ (xn | ci ) 0 for xj ajk , Pˆ(ajk | ci ) 0
– For a remedy, class conditional probabilities are re-estimated with
n mp
Pˆ ( a jk | ci ) c (m-estimate)
nm
nc : number of training examples for which x j a jk and c ci
n : number of training examples for which c ci
p : prior estimate (usually, p 1 / t for t possible values of x j )
m : weight to prior (number of " virtual" examples, m 1)
36
Zero conditional probability
• Example: in the play-tennis
dataset
– Adding m “virtual” examples (m: up to 1% of #training example)
• In this dataset, # of training examples for the “no” class is 5.
• We can only add m=1 “virtual” example in our m-estimate remedy.
– The “outlook” feature can take only 3 values. So p=1/3.
– Re-estimate P(outlook|no) with the m-estimate
37
Pros & Cons
Ease of Implementation
Low training time (good for large datasets or low-budget hardware)
Low prediction time, since all the probabilities are pre-computed in the Naive Bayes
algorithm, the prediction time of this algorithm is very efficient.
Transparency, since the predictions of Naive Bayes algorithms are based on the
posterior probability of each conditional feature, it is easy to understand which features
are influencing the predictions. This helps users to understand the predictions.
38
Summary
• Probabilistic Classification Principle
– Discriminative vs. Generative models: learning P(c|x) vs. P(x|c)
– Generative models for classification: MAP and Bayesian rule
• Naïve Bayes: the conditional independence assumption
– Training and testing are very efficient.
– Two different data types lead to two different learning algorithms.
– It works well sometimes for data violating the assumption!
• Naïve Bayes: a popular model for classification
– Performance competitive to most state-of-the-art classifiers even in the presence of violating
independence assumption
– Many successful applications, e.g., spam mail filtering
– A good candidate for a base learner in ensemble learning
– Apart from classification, naïve Bayes can do more…
39