Naive Bayes

CSCI 417
Machine Intelligence
Lecture # 4
Spring 2024
1
Tentative Course Topics
1.Machine Learning Basics

2.Classifying with k-Nearest Neighbors
3.Splitting datasets one feature at a time: decision trees
4.Classifying with probability theory: naïve Bayes
5.Logistic regression
6.Support vector machines
7.Model Evaluation and Improvement: Cross-validation, Grid Search, Evaluation Metrics, and
Scoring
8.Ensemble learning and improving classification with the AdaBoost meta-algorithm.
9.Introduction to Neural Networks - Building NN for classification (binary/multiclass)
10.Convolutional Neural Network (CNN)
11.Pretrained models (VGG, Alexnet,..)
12.Machine learning pipeline and use cases.
2
Agenda
Background and Probability Basics
Probabilistic Classification Principle

Probabilistic discriminative models
Generative models and their application to classification
MAP and converting generative into discriminative
Zero Conditional Probability and Treatment
Summary
3
4
https://youtu.be/Q8l0Vip5YUw?si=DodN5Y0HEOU1myo9
5
6
Rev Thomas Bayes
 A religious man who tried to prove

God’s existence.
 So, he found a very common algorithm

to allocate a linear decision surface.
7
Things We’d Like to Do
• Spam Classification
–Given an email, predict whether it is spam or not
• Medical Diagnosis
–Given a list of symptoms, predict whether a patient has disease X or not
• Weather
–Based on temperature, humidity, etc… predict if it will rain tomorrow
Naive Bayes
• Naive Bayes is a supervised machine learning algorithm that can be

trained to classify data into multi-class categories.
• In the heart of the Naive Bayes algorithm is the probabilistic model that
computes the conditional probabilities of the input features and assigns
the probability distributions to each of the possible classes.
13
Bayesian Classification
• Problem statement:
–Given features X1,X2,…,Xn
–Predict a label Y
Probability Basics
• Prior, conditional and joint probability for random variables
– Prior probability: P(x)
– Conditional probability: P( x1 | x2 ), P(x 2 | x1 )
– Joint probability: x  ( x1 , x2 ), P(x)  P(x1 ,x2 )
– Relationship:
– Independence: P( x | x )  P( x ), P( x | x )  P( x ), P(x ,x )  P( x )P( x )
2 1 2 1 2 1 1 2 1 2
• Bayesian Rule
P(x|c)P(c) Likelihood  Prior
P(c |x)  Posterior 
P(x) Evidence
16
Bayes theorem - Example
17
18
19
Probability Basics
• Prior, conditional and joint probability for random variables
– Prior probability: P(x)
– Conditional probability: P( x1 | x2 ), P(x 2 | x1 )
– Joint probability: x  ( x1 , x2 ), P(x)  P(x1 ,x2 )
– Relationship:
– Independence: P( x | x )  P( x ), P( x | x )  P( x ), P(x ,x )  P( x )P( x )
2 1 2 1 2 1 1 2 1 2
• Bayesian Rule
P(x|c)P(c) Likelihood  Prior
P(c |x)  Posterior 
P(x) Evidence
20
• Establishing a probabilistic model for classification

– Discriminative model P(c|x) c  c1 ,  , cL , x  (x1 ,  , xn )
P ( c 1 | x ) P( c 2 | x ) P(c L | x)
• To train a discriminative classifier regardless

its probabilistic or non-probabilistic nature,
all training examples of different classes
Discriminative must be jointly used to build up a single
Probabilistic Classifier discriminative classifier.
• Output L probabilities for L class labels in a

 probabilistic classifier while a single label is
x1 x2 xn achieved by a non-probabilistic classifier .
x  ( x1 , x2 ,  , xn )
21
• Maximum A Posterior (MAP) classification rule

– For an input x, find the largest one from L probabilities output by a
discriminative probabilistic classifier P ( c1 | x ), ..., P ( c L | x ).
– Assign x to label c* if P(c * |x) is the largest.
• Generative classification with the MAP rule
– Apply Bayesian rule to convert them into posterior probabilities
P(x|ci )P(ci )
P(ci |x)   P(x|ci )P(ci )
P(x) Common factor for
all L probabilities
for i  1, 2 ,  , L
– Then apply the MAP rule to assign a label
23
Naïve Bayes
• Bayes classification
P (c | x)  P (x | c ) P (c )  P ( x1 ,  , xn | c ) P (c ) for c  c1 ,..., cL .
Difficulty: learning the joint probability P ( x1 ,  , xn | c ) is infeasible!
• Naïve Bayes classification
– Assume all input features are class conditionally independent!
P( x1 , x2 ,  , xn |c )  P( x1 | x2 ,  , xn , c)P( x2 ,  , xn |c)
Applying the  P( x1 |c )P( x2 ,  , xn |c)
independence
assumption  P( x1 |c )P( x2 |c )    P( xn |c)
– Apply the MAP classification rule: assign x'  (a1 , a2 ,  , an ) to c* if
[ P(a1 | c * )    P(an | c* )]P(c * )  [ P(a1 | c)    P(an | c)]P (c), c  c * , c  c1 ,  , cL
estimate of P ( a1 ,  , a n | c * ) esitmate of P (a1 ,  , an | c)
24
25
26
Bayes theorem
Postrior
Prior
27
Naïve Bayes for text classification
0.1 0.3
0.1 0.5
0.8 0.2
28
Who is the sender of “life deal”?
0.1 0.3
0.1 0.5
0.8 0.2
29
Naïve Bayes
For each target value of ci (ci  c1 ,  , cL )

Pˆ (ci )  estimate P(ci ) with examples in S;
For every feature value x jk of each feature x j ( j  1,  , F ; k  1,  , N j )
Pˆ ( x j  x jk |ci )  estimate P( x jk |ci ) with examples in S;
x'  ( a1 ,   , a n )
[Pˆ (a1 |c * )    Pˆ (an |c * )]Pˆ (c * )  [Pˆ (a1 |ci )    Pˆ (an |ci )]Pˆ (ci ), ci  c * , ci  c1 ,  , cL
30
Example (Will Nadal Play Tennis?)
• Columns denote features Xi
• Rows denote labeled instances
Class label denotes whether a te
31
Learning Phase
Outlook Play=Yes Play=No Temperature Play=Yes Play=No

Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5
Humidity Play=Yes Play=No Wind Play=Yes Play=No

High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
32
Example
• Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phase
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
– Decision-making with the MAP rule

P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
33
Zero conditional probability
A given class and feature value never occur together in the training set.
• If no example contains the feature value
– In this circumstance, we face a zero conditional probability problem
during the test.
Pˆ (x1 | ci )    Pˆ (a jk | ci )    Pˆ (xn | ci )  0 for xj  ajk , Pˆ(ajk | ci )  0
– For a remedy, class conditional probabilities are re-estimated with
n  mp
Pˆ ( a jk | ci )  c (m-estimate)
nm
nc : number of training examples for which x j  a jk and c  ci
n : number of training examples for which c  ci
p : prior estimate (usually, p  1 / t for t possible values of x j )
m : weight to prior (number of " virtual" examples, m  1)
36
Zero conditional probability
• Example: in the play-tennis
dataset
– Adding m “virtual” examples (m: up to 1% of #training example)
• In this dataset, # of training examples for the “no” class is 5.
• We can only add m=1 “virtual” example in our m-estimate remedy.
– The “outlook” feature can take only 3 values. So p=1/3.
– Re-estimate P(outlook|no) with the m-estimate
37
Pros & Cons
 Ease of Implementation
 Low training time (good for large datasets or low-budget hardware)
 Low prediction time, since all the probabilities are pre-computed in the Naive Bayes
algorithm, the prediction time of this algorithm is very efficient.
 Transparency, since the predictions of Naive Bayes algorithms are based on the
posterior probability of each conditional feature, it is easy to understand which features
are influencing the predictions. This helps users to understand the predictions.
 A great dependency on the application (very specific)
38
Summary
• Probabilistic Classification Principle
– Discriminative vs. Generative models: learning P(c|x) vs. P(x|c)
– Generative models for classification: MAP and Bayesian rule
• Naïve Bayes: the conditional independence assumption
– Training and testing are very efficient.
– Two different data types lead to two different learning algorithms.
– It works well sometimes for data violating the assumption!
• Naïve Bayes: a popular model for classification
– Performance competitive to most state-of-the-art classifiers even in the presence of violating
independence assumption
– Many successful applications, e.g., spam mail filtering
– A good candidate for a base learner in ensemble learning
– Apart from classification, naïve Bayes can do more…
39

Naive Bayes

Uploaded by

Copyright:

Available Formats

Naive Bayes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Naive Bayes

Uploaded by

Copyright:

Available Formats

CSCI 417

1.Machine Learning Basics

Probabilistic Classification Principle

Zero Conditional Probability and Treatment

 A religious man who tried to prove

 So, he found a very common algorithm

• Naive Bayes is a supervised machine learning algorithm that can be

• Establishing a probabilistic model for classification

• Output L probabilities for L class labels in a

• Maximum A Posterior (MAP) classification rule

For each target value of ci (ci  c1 ,  , cL )

Outlook Play=Yes Play=No Temperature Play=Yes Play=No

Humidity Play=Yes Play=No Wind Play=Yes Play=No

P(Play=Yes) = 9/14 P(Play=No) = 5/14

– Decision-making with the MAP rule

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

 A great dependency on the application (very specific)

You might also like