Naive Bayes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

CSCI 417

Machine Intelligence
Lecture # 4

Spring 2024

1
Tentative Course Topics

1.Machine Learning Basics


2.Classifying with k-Nearest Neighbors
3.Splitting datasets one feature at a time: decision trees
4.Classifying with probability theory: naïve Bayes
5.Logistic regression
6.Support vector machines
7.Model Evaluation and Improvement: Cross-validation, Grid Search, Evaluation Metrics, and
Scoring
8.Ensemble learning and improving classification with the AdaBoost meta-algorithm.
9.Introduction to Neural Networks - Building NN for classification (binary/multiclass)
10.Convolutional Neural Network (CNN)
11.Pretrained models (VGG, Alexnet,..)
12.Machine learning pipeline and use cases.

2
Agenda
Background and Probability Basics

Probabilistic Classification Principle


Probabilistic discriminative models
Generative models and their application to classification
MAP and converting generative into discriminative

Zero Conditional Probability and Treatment

Summary

3
4
https://youtu.be/Q8l0Vip5YUw?si=DodN5Y0HEOU1myo9
5
6
Rev Thomas Bayes

 A religious man who tried to prove


God’s existence.

 So, he found a very common algorithm


to allocate a linear decision surface.

7
Things We’d Like to Do

• Spam Classification
–Given an email, predict whether it is spam or not

• Medical Diagnosis
–Given a list of symptoms, predict whether a patient has disease X or not

• Weather
–Based on temperature, humidity, etc… predict if it will rain tomorrow
Naive Bayes

• Naive Bayes is a supervised machine learning algorithm that can be


trained to classify data into multi-class categories.

• In the heart of the Naive Bayes algorithm is the probabilistic model that
computes the conditional probabilities of the input features and assigns
the probability distributions to each of the possible classes.

13
Bayesian Classification
• Problem statement:
–Given features X1,X2,…,Xn
–Predict a label Y
Probability Basics
• Prior, conditional and joint probability for random variables
– Prior probability: P(x)
– Conditional probability: P( x1 | x2 ), P(x 2 | x1 )
– Joint probability: x  ( x1 , x2 ), P(x)  P(x1 ,x2 )
– Relationship:
– Independence: P( x | x )  P( x ), P( x | x )  P( x ), P(x ,x )  P( x )P( x )
2 1 2 1 2 1 1 2 1 2

• Bayesian Rule
P(x|c)P(c) Likelihood  Prior
P(c |x)  Posterior 
P(x) Evidence

16
Bayes theorem - Example

17
Bayes theorem - Example

18
Bayes theorem - Example

19
Probability Basics
• Prior, conditional and joint probability for random variables
– Prior probability: P(x)
– Conditional probability: P( x1 | x2 ), P(x 2 | x1 )
– Joint probability: x  ( x1 , x2 ), P(x)  P(x1 ,x2 )
– Relationship:
– Independence: P( x | x )  P( x ), P( x | x )  P( x ), P(x ,x )  P( x )P( x )
2 1 2 1 2 1 1 2 1 2

• Bayesian Rule
P(x|c)P(c) Likelihood  Prior
P(c |x)  Posterior 
P(x) Evidence

20
Probabilistic Classification Principle

• Establishing a probabilistic model for classification


– Discriminative model P(c|x) c  c1 ,  , cL , x  (x1 ,  , xn )

P ( c 1 | x ) P( c 2 | x ) P(c L | x)
• To train a discriminative classifier regardless

its probabilistic or non-probabilistic nature,
all training examples of different classes
Discriminative must be jointly used to build up a single
Probabilistic Classifier discriminative classifier.

• Output L probabilities for L class labels in a


 probabilistic classifier while a single label is
x1 x2 xn achieved by a non-probabilistic classifier .
x  ( x1 , x2 ,  , xn )

21
Probabilistic Classification Principle

• Maximum A Posterior (MAP) classification rule


– For an input x, find the largest one from L probabilities output by a
discriminative probabilistic classifier P ( c1 | x ), ..., P ( c L | x ).
– Assign x to label c* if P(c * |x) is the largest.
• Generative classification with the MAP rule
– Apply Bayesian rule to convert them into posterior probabilities
P(x|ci )P(ci )
P(ci |x)   P(x|ci )P(ci )
P(x) Common factor for
all L probabilities
for i  1, 2 ,  , L
– Then apply the MAP rule to assign a label

23
Naïve Bayes

• Bayes classification
P (c | x)  P (x | c ) P (c )  P ( x1 ,  , xn | c ) P (c ) for c  c1 ,..., cL .
Difficulty: learning the joint probability P ( x1 ,  , xn | c ) is infeasible!
• Naïve Bayes classification
– Assume all input features are class conditionally independent!
P( x1 , x2 ,  , xn |c )  P( x1 | x2 ,  , xn , c)P( x2 ,  , xn |c)
Applying the  P( x1 |c )P( x2 ,  , xn |c)
independence
assumption  P( x1 |c )P( x2 |c )    P( xn |c)
– Apply the MAP classification rule: assign x'  (a1 , a2 ,  , an ) to c* if
[ P(a1 | c * )    P(an | c* )]P(c * )  [ P(a1 | c)    P(an | c)]P (c), c  c * , c  c1 ,  , cL
estimate of P ( a1 ,  , a n | c * ) esitmate of P (a1 ,  , an | c)
24
Bayes theorem - Example

25
Bayes theorem - Example

26
Bayes theorem

Postrior
Prior

27
Naïve Bayes for text classification

0.1 0.3
0.1 0.5
0.8 0.2

28
Who is the sender of “life deal”?

0.1 0.3
0.1 0.5
0.8 0.2

29
Naïve Bayes

For each target value of ci (ci  c1 ,  , cL )


Pˆ (ci )  estimate P(ci ) with examples in S;
For every feature value x jk of each feature x j ( j  1,  , F ; k  1,  , N j )
Pˆ ( x j  x jk |ci )  estimate P( x jk |ci ) with examples in S;

x'  ( a1 ,   , a n )

[Pˆ (a1 |c * )    Pˆ (an |c * )]Pˆ (c * )  [Pˆ (a1 |ci )    Pˆ (an |ci )]Pˆ (ci ), ci  c * , ci  c1 ,  , cL

30
Example (Will Nadal Play Tennis?)
• Columns denote features Xi
• Rows denote labeled instances
Class label denotes whether a te

31
Learning Phase

Outlook Play=Yes Play=No Temperature Play=Yes Play=No


Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5

Humidity Play=Yes Play=No Wind Play=Yes Play=No


High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5

P(Play=Yes) = 9/14 P(Play=No) = 5/14

32
Example
• Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phase
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

– Decision-making with the MAP rule


P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

33
Zero conditional probability
A given class and feature value never occur together in the training set.
• If no example contains the feature value
– In this circumstance, we face a zero conditional probability problem
during the test.
Pˆ (x1 | ci )    Pˆ (a jk | ci )    Pˆ (xn | ci )  0 for xj  ajk , Pˆ(ajk | ci )  0
– For a remedy, class conditional probabilities are re-estimated with
n  mp
Pˆ ( a jk | ci )  c (m-estimate)
nm
nc : number of training examples for which x j  a jk and c  ci
n : number of training examples for which c  ci
p : prior estimate (usually, p  1 / t for t possible values of x j )
m : weight to prior (number of " virtual" examples, m  1)

36
Zero conditional probability
• Example: in the play-tennis
dataset
– Adding m “virtual” examples (m: up to 1% of #training example)
• In this dataset, # of training examples for the “no” class is 5.
• We can only add m=1 “virtual” example in our m-estimate remedy.
– The “outlook” feature can take only 3 values. So p=1/3.
– Re-estimate P(outlook|no) with the m-estimate

37
Pros & Cons
 Ease of Implementation
 Low training time (good for large datasets or low-budget hardware)
 Low prediction time, since all the probabilities are pre-computed in the Naive Bayes
algorithm, the prediction time of this algorithm is very efficient.
 Transparency, since the predictions of Naive Bayes algorithms are based on the
posterior probability of each conditional feature, it is easy to understand which features
are influencing the predictions. This helps users to understand the predictions.

 A great dependency on the application (very specific)

38
Summary
• Probabilistic Classification Principle
– Discriminative vs. Generative models: learning P(c|x) vs. P(x|c)
– Generative models for classification: MAP and Bayesian rule
• Naïve Bayes: the conditional independence assumption
– Training and testing are very efficient.
– Two different data types lead to two different learning algorithms.
– It works well sometimes for data violating the assumption!
• Naïve Bayes: a popular model for classification
– Performance competitive to most state-of-the-art classifiers even in the presence of violating
independence assumption
– Many successful applications, e.g., spam mail filtering
– A good candidate for a base learner in ensemble learning
– Apart from classification, naïve Bayes can do more…

39

You might also like