Lecture 4
Lecture 4
Lecture 4
• Conjugate priors
– Mean: Gaussian prior
– Variance: Wishart Distribution
= N(h,l2)
1
MAP for Gaussian Mean
(Assuming known
variance s2)
Independent of s2 if
l2 = s2/s
2
Bayes Optimal Classifier
Aarti Singh
Sports
Science
News
Features, X Labels, Y
Probability of Error
4
Optimal Classification
Optimal predictor:
(Bayes classifier)
Optimal classifier:
Decision Boundary
7
Example Decision Boundaries
• Gaussian class conditional densities (2-dimensions/features)
Decision Boundary
8
Learning the Optimal Classifier
Optimal classifier:
n rows
n rows
• Equivalent to:
• e.g.,
Note: does NOT mean Thunder is independent of Rain
12
Conditional vs. Marginal
Independence
• C calls A and B separately and tells them a number n ϵ {1,…,10}
• Due to noise in the phone, A and B each imperfectly (and
independently) draw a conclusion about what the number was.
• A thinks the number was na and B thinks it was nb.
• Are na and nb marginally independent?
– No, we expect e.g. P(na = 1 | nb = 1) > P(na = 1)
• Are na and nb conditionally independent given n?
– Yes, because if we know the true number, the outcomes na
and nb are purely determined by the noise in each phone.
P(na = 1 | nb = 1, n = 2) = P(na = 1 | n = 2)
13
Prediction using Conditional
Independence
• Predict Lightening
• From two conditionally Independent features
– Thunder
– Rain
– More generally:
• Decision rule:
– For Likelihood
17
Subtlety 1 – Violation of NB
Assumption
• Usually, features are not conditionally independent:
18
Subtlety 2 – Insufficient training data
• What now???
19
MLE vs. MAP
MAP Estimate
# virtual examples
with Y = b
Now, even if you never observe a class/feature posterior
probability never zero.
21
Case Study: Text Classification
• Classify e-mails
– Y = {Spam,NotSpam}
• Classify news articles
– Y = {what is the topic of the article?}
• Classify webpages
– Y = {Student, professor, project, …}
23
NB for Text Classification
• P(X|Y) is huge!!!
– Article at least 1000 words, X={X1,…,X1000}
– Xi represents ith word in document, i.e., the domain of Xi is entire
vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc.
24
Bag of words model
• Typical additional assumption – Position in document doesn’t
matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y)
– “Bag of words” model – order of words on the page ignored
– Sounds really silly, but often works very well!
25
Bag of words model
• Typical additional assumption – Position in document doesn’t
matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y)
– “Bag of words” model – order of words on the page ignored
– Sounds really silly, but often works very well!
26
Bag of words approach
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1
…
Zaire 0
27
NB with Bag of Words for text
classification
• Learning phase:
– Class Prior P(Y) Explore in HW
– P(Xi|Y)
• Test phase:
– For each document
• Use naïve Bayes decision rule
28
Twenty news groups results
29
Learning curve for twenty news
groups
30
What if features are continuous?
Eg., character recognition: Xi is intensity at ith pixel
Different mean and variance for each class k and each pixel i.
Sometimes assume variance
• is independent of Y (i.e., si),
• or independent of Xi (i.e., sk)
• or both (i.e., s) 31
Estimating parameters:
Y discrete, Xi continuous
Maximum likelihood estimates:
kth class
jth training image
ith pixel in
jth training image
32
Example: GNB for classifying mental
states [Mitchell et al.]
~1 mm resolution
~2 images per sec.
15,000 voxels/image
non-invasive, safe
measures Blood Oxygen
Level Dependent (BOLD)
33
response
Gaussian Naïve Bayes: Learned voxel,word
[Mitchell et al.]
15,000 voxels
or features
10 training
examples or
subjects per
class
34
Learned Naïve Bayes Models –
Means for P(BrainActivity | WordCategory)
Pairwise classification accuracy: 85% [Mitchell et al.]
People words Animal words
35
What you should know…
• Optimal decision using Bayes Classifier
• Naïve Bayes classifier
– What’s the assumption
– Why we use it
– How do we learn it
– Why is Bayesian estimation important
• Text classification
– Bag of words model
• Gaussian NB
– Features are still conditionally independent
– Each feature has a Gaussian distribution given class
36