Lecture 4

MAP for Gaussian mean and variance
• Conjugate priors
– Mean: Gaussian prior
– Variance: Wishart Distribution
• Prior for mean:
= N(h,l2)
1
MAP for Gaussian Mean
(Assuming known
variance s2)
Independent of s2 if
l2 = s2/s
MAP under Gauss-Wishart prior - Homework
2
Bayes Optimal Classifier
Aarti Singh
Machine Learning 10-701/15-781

Sept 15, 2010
Classification
Goal:
Sports
Science
News
Features, X Labels, Y
Probability of Error
4
Optimal Classification
Optimal predictor:
(Bayes classifier)
0.5 Bayes risk

0
• Even the optimal classifier makes mistakes R(f*) > 0

• Optimal classifier depends on unknown distribution
5
Optimal Classifier
Bayes Rule:
Optimal classifier:
Class conditional Class prior

density
6
Example Decision Boundaries
• Gaussian class conditional densities (1-dimension/feature)
Decision Boundary
7
Example Decision Boundaries
• Gaussian class conditional densities (2-dimensions/features)
Decision Boundary
8
Learning the Optimal Classifier
Optimal classifier:
Class conditional Class prior

density
Need to know Prior P(Y = y) for all y

Likelihood P(X=x|Y = y) for all x,y
9
Task: Predict whether or not a picnic spot is enjoyable
Training Data: X = (X1 X2 X3 … … Xd) Y
n rows
Lets learn P(Y|X) – how many parameters?

Prior: P(Y = y) for all y K-1 if K labels
Likelihood: P(X=x|Y = y) for all x,y (2d – 1)K if d binary features
10
Task: Predict whether or not a picnic spot is enjoyable
Training Data: X = (X1 X2 X3 … … Xd) Y
n rows
Lets learn P(Y|X) – how many parameters?

2dK – 1 (K classes, d binary features)
Need n >> 2dK – 1 number of training data to learn all parameters
11
Conditional Independence
• X is conditionally independent of Y given Z:
probability distribution governing X is independent of the value
of Y, given the value of Z
• Equivalent to:
• e.g.,
Note: does NOT mean Thunder is independent of Rain
12
Conditional vs. Marginal
Independence
• C calls A and B separately and tells them a number n ϵ {1,…,10}
• Due to noise in the phone, A and B each imperfectly (and
independently) draw a conclusion about what the number was.
• A thinks the number was na and B thinks it was nb.
• Are na and nb marginally independent?
– No, we expect e.g. P(na = 1 | nb = 1) > P(na = 1)
• Are na and nb conditionally independent given n?
– Yes, because if we know the true number, the outcomes na
and nb are purely determined by the noise in each phone.
P(na = 1 | nb = 1, n = 2) = P(na = 1 | n = 2)
13
Prediction using Conditional
Independence
• Predict Lightening
• From two conditionally Independent features
– Thunder
– Rain
# parameters needed to learn likelihood given L

P(T,R|L) (22-1)2 = 6
With conditional independence assumption
P(T,R|L) = P(T|L) P(R|L) (2-1)2 + (2-1)2 = 4
14
Naïve Bayes Assumption
• Naïve Bayes assumption:
– Features are independent given class:
– More generally:
• How many parameters now? (2-1)dK vs. (2d-1)K

• Suppose X is composed of d binary features
15
Naïve Bayes Classifier
• Given:
– Class Prior P(Y)
– d conditionally independent features X given the class Y
– For each Xi, we have likelihood P(Xi|Y)
• Decision rule:
• If conditional independence assumption holds, NB is

optimal classifier! But worse otherwise.
16
Naïve Bayes Algo – Discrete features
• Training Data
• Maximum Likelihood Estimates
– For Class Prior
– For Likelihood
• NB Prediction for test data
17
Subtlety 1 – Violation of NB
Assumption
• Usually, features are not conditionally independent:
• Actual probabilities P(Y|X) often biased towards 0 or 1

(Why?)
• Nonetheless, NB is the single most used classifier out there
– NB often performs well, even when assumption is violated
– [Domingos & Pazzani ’96] discuss some conditions for good
performance
18
Subtlety 2 – Insufficient training data
• What if you never see a training instance where

X1=a when Y=b?
– e.g., Y={SpamEmail}, X1={‘Earn’}
– P(X1=a | Y=b) = 0
• Thus, no matter what the values X2,…,Xd take:

– P(Y=b | X1=a,X2,…,Xd) = 0
• What now???
19
MLE vs. MAP
What if we toss the coin too few times?

• You say: Probability next toss is a head = 0
• Billionaire says: You’re fired! …with prob 1 
• Beta prior equivalent to extra coin flips

• As N → 1, prior is “forgotten”
• But, for small sample size, prior is important! 20
Naïve Bayes Algo – Discrete features
• Training Data
• Maximum A Posteriori Estimates – add m “virtual” examples
Assume priors
MAP Estimate
# virtual examples
with Y = b
Now, even if you never observe a class/feature posterior
probability never zero.
21
Case Study: Text Classification
• Classify e-mails
– Y = {Spam,NotSpam}
• Classify news articles
– Y = {what is the topic of the article?}
• Classify webpages
– Y = {Student, professor, project, …}
• What about the features X?

– The text!
22
Features X are entire document – Xi
for ith word in article
23
NB for Text Classification
• P(X|Y) is huge!!!
– Article at least 1000 words, X={X1,…,X1000}
– Xi represents ith word in document, i.e., the domain of Xi is entire
vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc.
• NB assumption helps a lot!!!

– P(Xi=xi|Y=y) is just the probability of observing word xi at the ith position
in a document on topic y
24
Bag of words model
• Typical additional assumption – Position in document doesn’t
matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y)
– “Bag of words” model – order of words on the page ignored
– Sounds really silly, but often works very well!
When the lecture is over, remember to wake up the

person sitting next to you in the lecture room.
25
Bag of words model
• Typical additional assumption – Position in document doesn’t
matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y)
– “Bag of words” model – order of words on the page ignored
– Sounds really silly, but often works very well!
in is lecture lecture next over person remember room

sitting the the the to to up wake when you
26
Bag of words approach
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1
…
Zaire 0
27
NB with Bag of Words for text
classification
• Learning phase:
– Class Prior P(Y) Explore in HW
– P(Xi|Y)
• Test phase:
– For each document
• Use naïve Bayes decision rule
28
Twenty news groups results
29
Learning curve for twenty news
groups
30
What if features are continuous?
Eg., character recognition: Xi is intensity at ith pixel
Gaussian Naïve Bayes (GNB):
Different mean and variance for each class k and each pixel i.
Sometimes assume variance
• is independent of Y (i.e., si),
• or independent of Xi (i.e., sk)
• or both (i.e., s) 31
Estimating parameters:
Y discrete, Xi continuous
Maximum likelihood estimates:
kth class
jth training image
ith pixel in
jth training image
32
Example: GNB for classifying mental
states [Mitchell et al.]
~1 mm resolution
~2 images per sec.
15,000 voxels/image
non-invasive, safe
measures Blood Oxygen
Level Dependent (BOLD)
33
response
Gaussian Naïve Bayes: Learned voxel,word
[Mitchell et al.]
15,000 voxels
or features
10 training
examples or
subjects per
class
34
Learned Naïve Bayes Models –
Means for P(BrainActivity | WordCategory)
Pairwise classification accuracy: 85% [Mitchell et al.]
People words Animal words
35
What you should know…
• Optimal decision using Bayes Classifier
• Naïve Bayes classifier
– What’s the assumption
– Why we use it
– How do we learn it
– Why is Bayesian estimation important
• Text classification
– Bag of words model
• Gaussian NB
– Features are still conditionally independent
– Each feature has a Gaussian distribution given class
36

Lecture 4

Uploaded by

Copyright:

Available Formats

Lecture 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4

Uploaded by

Copyright:

Available Formats

MAP for Gaussian mean and variance

• Prior for mean:

MAP under Gauss-Wishart prior - Homework

Machine Learning 10-701/15-781

0.5 Bayes risk

• Even the optimal classifier makes mistakes R(f*) > 0

Class conditional Class prior

Class conditional Class prior

Need to know Prior P(Y = y) for all y

Training Data: X = (X1 X2 X3 … … Xd) Y

Lets learn P(Y|X) – how many parameters?

Training Data: X = (X1 X2 X3 … … Xd) Y

Lets learn P(Y|X) – how many parameters?

# parameters needed to learn likelihood given L

• How many parameters now? (2-1)dK vs. (2d-1)K

• If conditional independence assumption holds, NB is

• NB Prediction for test data

• Actual probabilities P(Y|X) often biased towards 0 or 1

• What if you never see a training instance where

• Thus, no matter what the values X2,…,Xd take:

What if we toss the coin too few times?

• Beta prior equivalent to extra coin flips

• What about the features X?

• NB assumption helps a lot!!!

When the lecture is over, remember to wake up the

in is lecture lecture next over person remember room

Gaussian Naïve Bayes (GNB):

You might also like