Showfile

Download as pdf or txt
Download as pdf or txt
You are on page 1of 130

Supervised Learning – Classification (1)

COMP9417 Machine Learning and Data Mining

March 14, 2017

COMP9417 ML & DM Classification (1) March 14, 2017 1 / 130


Acknowledgements

Material derived from slides for the book


“Elements of Statistical Learning (2nd Ed.)” by T. Hastie,
R. Tibshirani & J. Friedman. Springer (2009)
http://statweb.stanford.edu/~tibs/ElemStatLearn/

Material derived from slides for the book


“Machine Learning: A Probabilistic Perspective” by P. Murphy
MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook

Material derived from slides for the book


“Machine Learning” by P. Flach
Cambridge University Press (2012)
http://cs.bris.ac.uk/~flach/mlbook

Material derived from slides for the book


“Bayesian Reasoning and Machine Learning” by D. Barber
Cambridge University Press (2012)
http://www.cs.ucl.ac.uk/staff/d.barber/brml

Material derived from slides for the book


“Machine Learning” by T. Mitchell
McGraw-Hill (1997)
http://www-2.cs.cmu.edu/~tom/mlbook.html

Material derived from slides for the course


“Machine Learning” by A. Srinivasan

BITS Pilani, Goa, India (2016)

COMP9417 ML & DM Classification (1) March 14, 2017 2 / 130


Aims

This lecture will introduce you to machine learning approaches to the


problem of numerical prediction. Following it you should be able to
reproduce theoretical results, outline algorithmic techniques and describe
practical applications for the topics:
• describe distance measures and how they are used in classification
• outline the basic k-nearest neighbour classification method
• define MAP and ML inference using Bayes theorem
• define the Bayes optimal classification rule in terms of MAP inference
• outline the Naive Bayes classification algorithm
• describe typical applications of Naive Bayes for text classification
• outline the Perceptron classification algorithm

COMP9417 ML & DM Classification (1) March 14, 2017 3 / 130


Introduction

Introduction

Classification methods dominate machine learning . . .


. . . however, they often don’t have nice theoretical properties like
regression. so are more complicated to analyse. We will mostly focus on
their advantages and disadvantages as learning methods first, and point to
unifying ideas and concepts where applicable. In this lecture we focus on
classification methods that are essentially linear models.

COMP9417 ML & DM Classification (1) March 14, 2017 4 / 130


Nearest neighbour classification

Nearest neighbour classification


• Related to the simplest form of learning: rote learning or
memorization
• Training instances are searched for instance that most closely
resembles new or query instance
• The instances themselves represent the knowledge
• Called: instance-based, memory-based learning or case-based learning;
often a form of local learning
• The similarity or distance function defines “learning”, i.e., how to go
beyond simple memorization
• Intuitive idea — instances “close by”, i.e., neighbours or exemplars,
should be classified similarly
• Instance-based learning is lazy learning
• Methods: nearest-neighbour, k-nearest-neighbour, IBk, . . .
• Also important for unsupervised methods, e.g., clustering (later
lectures)
COMP9417 ML & DM Classification (1) March 14, 2017 5 / 130
Distance-based models

Minkowski distance

Minkowski distance If X = Rd , the Minkowski distance of order p > 0 is


defined as
 1/p
X d
Disp (x, y) =  |xj − yj |p  = ||x − y||p
j=1

P 1/p
d p
where ||z||p = j=1 |z j | is the p-norm (sometimes denoted Lp
norm) of the vector z.

COMP9417 ML & DM Classification (1) March 14, 2017 6 / 130


Distance-based models

Minkowski distance

• The 2-norm refers to the familiar Euclidean distance


v
u d q
uX
Dis2 (x, y) = t (xj − yj )2 = (x − y)T (x − y)
j=1

which measures distance ‘as the crow flies’.


• The 1-norm denotes Manhattan distance, also called cityblock
distance:
d
X
Dis1 (x, y) = |xj − yj |
j=1

This is the distance if we can only travel along coordinate axes.

COMP9417 ML & DM Classification (1) March 14, 2017 7 / 130


Distance-based models

Minkowski distance

• If we now let p grow larger, the distance will be more and more
dominated by the largest coordinate-wise distance, from which we can
infer that Dis∞ (x, y) = maxj |xj − yj |; this is also called Chebyshev
distance.
• You will sometimes see references to the 0-norm (or L0 norm) which
counts the number of non-zero elements in a vector. The
corresponding distance then counts the number of positions in which
vectors x and y differ. This is not strictly a Minkowski distance;
however, we can define it as
d
X d
X
0
Dis0 (x, y) = (xj − yj ) = I[xj = yj ]
j=1 j=1

under the understanding that x0 = 0 for x = 0 and 1 otherwise.

COMP9417 ML & DM Classification (1) March 14, 2017 8 / 130


Distance-based models

Minkowski distance

Sometimes the data X is not naturally in Rd , but if we can turn it into


Boolean features, or character sequesnces, we can still apply distance
measures. For example:
• If x and y are binary strings, this is also called the Hamming
distance. Alternatively, we can see the Hamming distance as the
number of bits that need to be flipped to change x into y.
• For non-binary strings of unequal length this can be generalised to the
notion of edit distance or Levenshtein distance.

COMP9417 ML & DM Classification (1) March 14, 2017 9 / 130


Distance-based models

Circles and ellipses

Lines connecting points at order-p Minkowski distance 1 from the origin


for (from inside) p = 0.8; p = 1 (Manhattan distance, the rotated square
in red); p = 1.5; p = 2 (Euclidean distance, the violet circle); p = 4;
p = 8; and p = ∞ (Chebyshev distance, the blue rectangle). Notice that
for points on the coordinate axes all distances agree. For the other points,
our reach increases with p; however, if we require a rotation-invariant
distance metric then Euclidean distance is our only choice.
COMP9417 ML & DM Classification (1) March 14, 2017 10 / 130
Distance-based models

Distance metric

Distance metric Given an instance space X , a distance metric is a


function Dis : X × X → R such that for any x, y, z ∈ X :
• distances between a point and itself are zero: Dis(x, x) = 0;
• all other distances are larger than zero: if x 6= y then Dis(x, y) > 0;
• distances are symmetric: Dis(y, x) = Dis(x, y);
• detours can not shorten the distance:
Dis(x, z) ≤ Dis(x, y) + Dis(y, z).
If the second condition is weakened to a non-strict inequality – i.e.,
Dis(x, y) may be zero even if x 6= y – the function Dis is called a
pseudo-metric.

COMP9417 ML & DM Classification (1) March 14, 2017 11 / 130


Distance-based models

The triangle inequality – Minkowski distance for p = 2

B !

A ! !
C

The green circle connects points the same Euclidean distance (i.e.,
Minkowski distance of order p = 2) away from the origin as A. The orange
circle shows that B and C are equidistant from A. The red circle
demonstrates that C is closer to the origin than B, which conforms to the
triangle inequality.
COMP9417 ML & DM Classification (1) March 14, 2017 12 / 130
Distance-based models

The triangle inequality – Minkowski distance for p ≤ 1

B !

A ! !
C

With Manhattan distance (p = 1), B and C are equally close to the origin
and also equidistant from A. With p < 1 (here, p = 0.8) C is further away
from the origin than B; since both are again equidistant from A, it follows
that travelling from the origin to C via A is quicker than going there
directly, which violates the triangle inequality.
COMP9417 ML & DM Classification (1) March 14, 2017 13 / 130
Distance-based models

Mahalanobis distance

Often, the shape of the ellipse is estimated from data as the inverse of the
covariance matrix: M = Σ−1 . This leads to the definition of the
Mahalanobis distance
q
DisM (x, y|Σ) = (x − y)T Σ−1 (x − y)

Using the covariance matrix in this way has the effect of decorrelating and
normalising the features.

Clearly, Euclidean distance is a special case of Mahalanobis distance with


the identity matrix I as covariance matrix: Dis2 (x, y) = DisM (x, y|I).

COMP9417 ML & DM Classification (1) March 14, 2017 14 / 130


Distance-based models Neighbours and exemplars

Means and distances

The arithmetic mean minimises squared Euclidean distance The


arithmetic mean µ of a set of data points D in a Euclidean space is the
unique point that minimises the sum of squared Euclidean distances to
those data points.
Proof. We will show that arg miny x∈D ||x − y||2 = µ, where || · ||
P
denotes the 2-norm. We find this minimum by taking the gradient (the
vector of partial derivatives with respect to yi ) of the sum and setting it to
the zero vector:
X X X
∇y ||x − y||2 = −2 (x − y) = −2 x + 2|D|y = 0
x∈D x∈D x∈D

1 P
from which we derive y = |D| x∈D x = µ.

COMP9417 ML & DM Classification (1) March 14, 2017 15 / 130


Distance-based models Neighbours and exemplars

Means and distances

• Notice that minimising the sum of squared Euclidean distances of a


given set of points is the same as minimising the average squared
Euclidean distance.
• You may wonder what happens if we drop the square here: wouldn’t
it be more natural to take the point that minimises total Euclidean
distance as exemplar?
• This point is known as the geometric median, as for univariate data it
corresponds to the median or ‘middle value’ of a set of numbers.
However, for multivariate data there is no closed-form expression for
the geometric median, which needs to be calculated by successive
approximation.

COMP9417 ML & DM Classification (1) March 14, 2017 16 / 130


Distance-based models Neighbours and exemplars

Means and distances

• In certain situations it makes sense to restrict an exemplar to be one


of the given data points. In that case, we speak of a medoid, to
distinguish it from a centroid which is an exemplar that doesn’t have
to occur in the data.
• Finding a medoid requires us to calculate, for each data point, the
total distance to all other data points, in order to choose the point
that minimises it. Regardless of the distance metric used, this is an
O(n2 ) operation for n points.
• So for medoids there is no computational reason to prefer one
distance metric over another.
• There may be more than one medoid.

COMP9417 ML & DM Classification (1) March 14, 2017 17 / 130


Distance-based models Neighbours and exemplars

Centroids and medoids

1.5

0.5

−0.5

−1
data points
squared 2−norm centroid (mean)
−1.5 2−norm centroid (geometric median)
squared 2−norm medoid
2−norm medoid
1−norm medoid
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

A small data set of 10 points, with circles indicating centroids and squares
indicating medoids (the latter must be data points), for different distance
metrics. Notice how the outlier on the bottom-right ‘pulls’ the mean away
from the geometric median; as a result the corresponding medoid changes
as well.
COMP9417 ML & DM Classification (1) March 14, 2017 18 / 130
Distance-based models Neighbours and exemplars

The basic linear classifier is distance-based

• The basic linear classifier constructs the decision boundary as the


perpendicular bisector of the line segment connecting the two
exemplars (one for each class).
• An alternative, distance-based way to classify instances without direct
reference to a decision boundary is by the following decision rule: if x
is nearest to µ⊕ then classify it as positive, otherwise as negative; or
equivalently, classify an instance to the class of the nearest exemplar.
• If we use Euclidean distance as our closeness measure, simple
geometry tells us we get exactly the same decision boundary.
• So the basic linear classifier can be interpreted from a distance-based
perspective as constructing exemplars that minimise squared
Euclidean distance within each class, and then applying a
nearest-exemplar decision rule.

COMP9417 ML & DM Classification (1) March 14, 2017 19 / 130


Distance-based models Neighbours and exemplars

Two-exemplar decision boundaries

(left) For two exemplars the nearest-exemplar decision rule with Euclidean
distance results in a linear decision boundary coinciding with the
perpendicular bisector of the line connecting the two exemplars. (right)
Using Manhattan distance the circles are replaced by diamonds.
COMP9417 ML & DM Classification (1) March 14, 2017 20 / 130
Distance-based models Neighbours and exemplars

Three-exemplar decision boundaries

(left) Decision regions defined by the 2-norm nearest-exemplar decision


rule for three exemplars. (right) With Manhattan distance the decision
regions become non-convex.
COMP9417 ML & DM Classification (1) March 14, 2017 21 / 130
Distance-based models Neighbours and exemplars

Distance-based models

To summarise, the main ingredients of distance-based models are


• distance metrics, which can be Euclidean, Manhattan, Minkowski or
Mahalanobis, among many others;
• exemplars: centroids that find a centre of mass according to a chosen
distance metric, or medoids that find the most centrally located data
point; and
• distance-based decision rules, which take a vote among the k nearest
exemplars.

COMP9417 ML & DM Classification (1) March 14, 2017 22 / 130


Distance-based models k-Nearest neighbour Classification

Nearest Neighbour

Stores all training examples hxi , f (xi )i.


Nearest neighbour:
• Given query instance xq , first locate nearest training example xn , then
estimate fˆ(xq ) ← f (xn )
k-Nearest neighbour:
• Given xq , take vote among its k nearest neighbours (if discrete-valued
target function)
• take mean of f values of k nearest neighbours (if real-valued)
Pk
i=1 f (xi )
fˆ(xq ) ←
k

COMP9417 ML & DM Classification (1) March 14, 2017 23 / 130


Distance-based models k-Nearest neighbour Classification

k-Nearest Neighbour Algorithm

Training algorithm:
• For each training example hxi , f (xi )i, add the example to the list
training examples.
Classification algorithm:
• Given a query instance xq to be classified,
• Let x1 . . . xk be the k instances from training examples that are
nearest to xq by the distance function
• Return
k
X
fˆ(xq ) ← arg max δ(v, f (xi ))
v∈V i=1

where δ(a, b) = 1 if a = b and 0 otherwise.

COMP9417 ML & DM Classification (1) March 14, 2017 24 / 130


Distance-based models k-Nearest neighbour Classification

“Hypothesis Space” for Nearest Neighbour




+ +
xq

+ −
+

2 classes, + and − and query point xq . On left, note effect of varying k.


On right, 1−NN induces a Voronoi tessellation of the instance space.
Formed by the perpendicular bisectors of lines between points.

COMP9417 ML & DM Classification (1) March 14, 2017 25 / 130


Distance-based models k-Nearest neighbour Classification

Distance function again

The distance function defines what is learned.


Instance x is described by a feature vector (list of attribute-value pairs)

ha1 (x), a2 (x), . . . am (x)i

where ar (x) denotes the value of the rth attribute of x.


Most commonly used distance function is Euclidean distance . . .
• distance between two instances xi and xj is defined to be
v
um
uX
d(xi , xj ) = t (ar (xi ) − ar (xj ))2
r=1

COMP9417 ML & DM Classification (1) March 14, 2017 26 / 130


Distance-based models k-Nearest neighbour Classification

Distance function again

Many other distance functions could be used . . .


• e.g., Manhattan or city-block distance (sum of absolute values of
differences between attributes)
m
X
d(xi , xj ) = |ar (xi ) − ar (xj )|
r=1

Vector-based formalization – use norm L1 , L2 , . . .


The idea of distance functions will appear again in kernel methods.

COMP9417 ML & DM Classification (1) March 14, 2017 27 / 130


Distance-based models k-Nearest neighbour Classification

Normalization and other issues

• Different attributes measured on different scales


• Need to be normalized (why ?)

vr − minvr
ar =
maxvr − −minvr
where vr is the actual value of attribute r
• Nominal attributes: distance either 0 or 1
• Common policy for missing values: assumed to be maximally distant
(given normalized attributes)

COMP9417 ML & DM Classification (1) March 14, 2017 28 / 130


Distance-based models k-Nearest neighbour Classification

When To Consider Nearest Neighbour

• Instances map to points in <n


• Less than 20 attributes per instance
• or number of attributes can be reduced . . .
• Lots of training data
• No requirement for “explanatory” model to be learned

COMP9417 ML & DM Classification (1) March 14, 2017 29 / 130


Distance-based models k-Nearest neighbour Classification

When To Consider Nearest Neighbour

Advantages:
• Statisticians have used k-NN since early 1950s
• Can be very accurate
• As n → ∞ and k/n → 0 error approaches minimum
• Training is very fast
• Can learn complex target functions
• Don’t lose information by generalization - keep all instances

COMP9417 ML & DM Classification (1) March 14, 2017 30 / 130


Distance-based models k-Nearest neighbour Classification

When To Consider Nearest Neighbour

Disadvantages:
• Slow at query time: basic algorithm scans entire training data to
derive a prediction
• Assumes all attributes are equally important, so easily fooled by
irrelevant attributes
• Remedy: attribute selection or weights
• Problem of noisy instances:
• Remedy: remove from data set
• not easy – how to know which are noisy ?

COMP9417 ML & DM Classification (1) March 14, 2017 31 / 130


Distance-based models k-Nearest neighbour Classification

When To Consider Nearest Neighbour

What is the inductive bias of k-NN ?


• an assumption that the classification of query instance xq will be
most similar to the classification of other instances that are nearby
according to the distance function
k-NN uses terminology from statistical pattern recognition (see below)
• Regression approximating a real-valued target function
• Residual the error fˆ(x) − f (x) in approximating the target function
• Kernel function function of distance used to determine weight of each
training example, i.e., kernel function is the function K s.t.
wi = K(d(xi , xq ))

COMP9417 ML & DM Classification (1) March 14, 2017 32 / 130


Distance-based models k-Nearest neighbour Classification

Nearest-neighbour classifier

• kNN uses the training data as exemplars, so training is O(n) (but


prediction is also O(n)!)
• 1NN perfectly separates training data, so low bias but high variance
• By increasing the number of neighbours k we increase bias and
decrease variance (what happens when k = n?)
• Easily adapted to real-valued targets, and even to structured objects
(nearest-neighbour retrieval). Can also output probabilities when
k>1
• Warning: in high-dimensional spaces everything is far away from
everything and so pairwise distances are uninformative (curse of
dimensionality)

COMP9417 ML & DM Classification (1) March 14, 2017 33 / 130


Distance-based models k-Nearest neighbour Classification

Distance-Weighted kNN

• Might want to weight nearer neighbours more heavily ...


• Use distance function to construct a weight wi
• Replace the final line of the classification algorithm by:

k
X
fˆ(xq ) ← arg max wi δ(v, f (xi ))
v∈V i=1

where
1
wi ≡
d(xq , xi )2
and d(xq , xi ) is distance between xq and xi

COMP9417 ML & DM Classification (1) March 14, 2017 34 / 130


Distance-based models k-Nearest neighbour Classification

Distance-Weighted kNN

For real-valued target functions replace the final line of the algorithm by:
Pk
i=1 wi f (xi )
fˆ(xq ) ← Pk
i=1 wi

(denominator normalizes contribution of individual weights).


Now we can consider using all the training examples instead of just k
→ using all examples (i.e., when k = n) with the rule above is called
Shepard’s method

COMP9417 ML & DM Classification (1) March 14, 2017 35 / 130


Distance-based models k-Nearest neighbour Classification

Evaluation

Lazy learners do not construct an explicit model, so how do we evaluate


the output of the learning process ?
• 1-NN – training set error is always zero !
• each training example is always closest to itself
• k-NN – overfitting may be hard to detect

Leave-one-out cross-validation (LOOCV) – leave out each example and


predict it given the rest:

(x1 , y1 ), (x2 , y2 ), . . . , (xi−1 , yi−1 ), (xi+1 , yi+1 ), . . . , (xn , yn )

Error is mean over all predicted examples. Fast – no models to be built !

COMP9417 ML & DM Classification (1) March 14, 2017 36 / 130


Distance-based models k-Nearest neighbour Classification

Curse of Dimensionality

Bellman (1960) coined this term in the context of dynamic programming


Imagine instances described by 20 attributes, but only 2 are relevant to
target function — “similar” examples will appear “distant”.

Curse of dimensionality: nearest neighbour is easily mislead when


high-dimensional X – problem of irrelevant attributes

One approach:
• Stretch jth axis by weight zj , where z1 , . . . , zn chosen to minimize
prediction error
• Use cross-validation to automatically choose weights z1 , . . . , zn
• Note setting zj to zero eliminates this dimension altogether

COMP9417 ML & DM Classification (1) March 14, 2017 37 / 130


Distance-based models k-Nearest neighbour Classification

Curse of Dimensionality

• number of “cells” in the instance space grows exponentially in the


number of features
• with exponentially many cells we would need exponentially many data
points to ensure that each cell is sufficiently populated to make
nearest-neighbour predictions reliably
COMP9417 ML & DM Classification (1) March 14, 2017 38 / 130
Distance-based models k-Nearest neighbour Classification

Curse of Dimensionality

See Moore and Lee (1994) “Efficient Algorithms for Minimizing Cross Validation Error”

Instance-based methods (IBk)


• attribute weighting: class-specific weights may be used (can result in
unclassified instances and multiple classifications)
• get Euclidean distance with weights
qX
wr (ar (xq ) − ar (x))2

• Updating of weights based on nearest neighbour


• Class correct/incorrect: weight increased/decreased
• |ar (xq ) − ar (x)| small/large: amount large/small

COMP9417 ML & DM Classification (1) March 14, 2017 39 / 130


Distance-based models k-Nearest neighbour Classification

Instance-based learning (IBk)

Recap – Practical problems of 1-NN scheme:


• Slow (but fast k − d tree-based approaches exist)
• Remedy: removing irrelevant data
• Noise (but k-NN copes quite well with noise)
• Remedy: removing noisy instances
• All attributes deemed equally important
• Remedy: attribute weighting (or simply selection)
• Doesn’t perform explicit generalization
• Remedy: rule-based NN approach (like LWR)

COMP9417 ML & DM Classification (1) March 14, 2017 40 / 130


Distance-based models k-Nearest neighbour Classification

Edited NN

• Edited NN classifiers discard some of the training instances before


making predictions
• Saves memory and speeds up classification
• IB2: incremental NN learner: only incorporates misclassified instances
into the classifier
• Problem: noisy data gets incorporated
• IB3: store classification performance information with each instance
& only use in prediction if above a threshold

COMP9417 ML & DM Classification (1) March 14, 2017 41 / 130


Distance-based models k-Nearest neighbour Classification

Dealing with noise

Use larger values of k (why ?) How to find the “right” k ?


• One way: cross-validation-based k-NN classifier (but slow)
• Different approach: discarding instances that don’t perform well by
keeping success records of how well an instance does at prediction
(IB3)
• Computes confidence interval for an instance’s success rate and for
default accuracy of its class
• If lower limit of first interval is above upper limit of second one,
instance is accepted (IB3: 5%-level)
• If upper limit of first interval is below lower limit of second one,
instance is rejected (IB3: 12.5%-level)

COMP9417 ML & DM Classification (1) March 14, 2017 42 / 130


Naive Bayes

Uncertainty

As far as the laws of mathematics refer to reality, they are not


certain; as far as they are certain, they do not refer to reality.
–Albert Einstein

COMP9417 ML & DM Classification (1) March 14, 2017 43 / 130


Naive Bayes

Two Roles for Bayesian Methods

Provides practical learning algorithms:


• Naive Bayes classifier learning
• Bayesian network learning, etc.
• Combines prior knowledge (prior probabilities) with observed data
• How to get prior probabilities ?

Provides useful conceptual framework:


• Provides a “gold standard” for evaluating other learning algorithms
• Some additional insight into Occam’s razor

COMP9417 ML & DM Classification (1) March 14, 2017 44 / 130


Naive Bayes

Bayes Theorem

P (D|h)P (h)
P (h|D) =
P (D)
where
P (h) = prior probability of hypothesis h
P (D) = prior probability of training data D
P (h|D) = probability of h given D
P (D|h) = probability of D given h

COMP9417 ML & DM Classification (1) March 14, 2017 45 / 130


Naive Bayes

Choosing Hypotheses

P (D|h)P (h)
P (h|D) =
P (D)
Generally want the most probable hypothesis given the training data
Maximum a posteriori hypothesis hM AP :

hM AP = arg max P (h|D)


h∈H
P (D|h)P (h)
= arg max
h∈H P (D)
= arg max P (D|h)P (h)
h∈H

COMP9417 ML & DM Classification (1) March 14, 2017 46 / 130


Naive Bayes

Choosing Hypotheses

If assume P (hi ) = P (hj ) then can further simplify, and choose the
Maximum likelihood (ML) hypothesis

hM L = arg max P (D|hi )


hi ∈H

COMP9417 ML & DM Classification (1) March 14, 2017 47 / 130


Naive Bayes

Applying Bayes Theorem

Does patient have cancer or not?


A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the cases
in which the disease is actually present, and a correct negative
result in only 97% of the cases in which the disease is not
present. Furthermore, .008 of the entire population have this
cancer.

P (cancer) = P (¬cancer) =
P (⊕ | cancer) = P ( | cancer) =
P (⊕ | ¬cancer) = P ( | ¬cancer) =

COMP9417 ML & DM Classification (1) March 14, 2017 48 / 130


Naive Bayes

Applying Bayes Theorem

Does patient have cancer or not?


A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the cases
in which the disease is actually present, and a correct negative
result in only 97% of the cases in which the disease is not
present. Furthermore, .008 of the entire population have this
cancer.

P (cancer) = .008 P (¬cancer) = .992


P (⊕ | cancer) = .98 P ( | cancer) = .02
P (⊕ | ¬cancer) = .03 P ( | ¬cancer) = .97

COMP9417 ML & DM Classification (1) March 14, 2017 49 / 130


Naive Bayes

Applying Bayes Theorem

Does patient have cancer or not?


We can find the maximum a posteriori (MAP) hypothesis

P (⊕ | cancer)P (cancer) = 0.98 × 0.008 = 0.00784


P (⊕ | ¬cancer)P (¬cancer) = 0.03 × 0.992 = 0.02976

Thus hM AP = . . ..

COMP9417 ML & DM Classification (1) March 14, 2017 50 / 130


Naive Bayes

Applying Bayes Theorem

Does patient have cancer or not?


We can find the maximum a posteriori (MAP) hypothesis

P (⊕ | cancer)P (cancer) = 0.98 × 0.008 = 0.00784


P (⊕ | ¬cancer)P (¬cancer) = 0.03 × 0.992 = 0.02976

Thus hM AP = ¬cancer.

Also note: posterior probability of hypothesis cancer higher than prior.

COMP9417 ML & DM Classification (1) March 14, 2017 51 / 130


Naive Bayes

Applying Bayes Theorem

How to get the posterior probability of a hypothesis h ?


Divide by P (⊕), probability of data, to normalize result for h:

P (D|h)P (h)
P (h|D) = P
hi ∈H P (D|hi )P (hi )

Denominator ensures we obtain posterior probabilities that sum to 1.


Sum for all possible numerator values, since hypotheses are mutually
exclusive (e.g., patient either has cancer or does not).
Marginal likelihood (marginalizing out over hypothesis space) — prior
probability of the data.

COMP9417 ML & DM Classification (1) March 14, 2017 52 / 130


Naive Bayes

Basic Formulas for Probabilities

Product Rule: probability P (A ∧ B) of a conjunction of two events A and


B:
P (A ∧ B) = P (A|B)P (B) = P (B|A)P (A)
Sum Rule: probability of a disjunction of two events A and B:

P (A ∨ B) = P (A) + P (B) − P (A ∧ B)

Theorem
P of total probability : if events A1 , . . . , An are mutually exclusive
with ni=1 P (Ai ) = 1, then:
n
X
P (B) = P (B|Ai )P (Ai )
i=1

COMP9417 ML & DM Classification (1) March 14, 2017 53 / 130


Naive Bayes

Basic Formulas for Probabilities

Also worth remembering:


• Conditional Probability: probability of A given B:

P (A ∧ B)
P (A|B) =
P (B)
• Rearrange sum rule to get:

P (A ∧ B) = P (A) + P (B) − P (A ∨ B)

Exercise: Derive Bayes Theorem.

COMP9417 ML & DM Classification (1) March 14, 2017 54 / 130


Naive Bayes

Brute Force MAP Hypothesis Learner

Idea: view learning as finding the most probable hypothesis

• For each hypothesis h in H, calculate the posterior probability

P (D|h)P (h)
P (h|D) =
P (D)

• Output the hypothesis hM AP with the highest posterior probability

hM AP = arg max P (h|D)


h∈H

COMP9417 ML & DM Classification (1) March 14, 2017 55 / 130


A Bayesian approach to learning algorithms Concept learning in Bayesian terms

Relation to Concept Learning

Consider our usual concept learning task:

• instance space X, hypothesis space H, training examples D


• consider a learning algorithm that outputs most specific hypothesis
from the version space V SH,D ) (set of all consistent or ”zero-error”
classification rules

What would Bayes rule produce as the MAP hypothesis?

Does this algorithm output a MAP hypothesis??

COMP9417 ML & DM Classification (1) March 14, 2017 56 / 130


A Bayesian approach to learning algorithms Concept learning in Bayesian terms

Relation to Concept Learning

Brute Force MAP Framework for Concept Learning:

Assume fixed set of instances hx1 , . . . , xm i

Assume D is the set of classifications D = hc(x1 ), . . . , c(xm )i

Choose P (h) to be uniform distribution:


1
• P (h) = |H| for all h in H

Choose P (D|h):
• P (D|h) = 1 if h consistent with D
• P (D|h) = 0 otherwise

COMP9417 ML & DM Classification (1) March 14, 2017 57 / 130


A Bayesian approach to learning algorithms Concept learning in Bayesian terms

Relation to Concept Learning

Then:

 1
 |V SH,D | if h is consistent with D
P (h|D) =

0 otherwise

COMP9417 ML & DM Classification (1) March 14, 2017 58 / 130


A Bayesian approach to learning algorithms Concept learning in Bayesian terms

Relation to Concept Learning

Note that since likelihood is zero if h is inconsistent then the posterior is


also zero. But how did we obtain the posterior for consistent h ?

1
1· |H|
P (h|D) =
P (D)
1
1 · |H|
= |V SH,D |
|H|
1
=
|V SH,D |

COMP9417 ML & DM Classification (1) March 14, 2017 59 / 130


A Bayesian approach to learning algorithms Concept learning in Bayesian terms

Relation to Concept Learning

How did we obtain P (D) ? From theorem of total probability:

X
P (D) = P (D|Hi )P (hi )
hi ∈H
X 1 X 1
= 1· + 0·
|H| |H|
hi ∈V SH,D hi 6∈V SH,D
X 1
= 1·
|H|
hi ∈V SH,D
|V SH,D |
=
|H|

COMP9417 ML & DM Classification (1) March 14, 2017 60 / 130


A Bayesian approach to learning algorithms Concept learning in Bayesian terms

Evolution of Posterior Probabilities

P ( h) P(h|D1) P(h|D1,D2)

hypotheses hypotheses hypotheses


(a) ( b) ( c)

COMP9417 ML & DM Classification (1) March 14, 2017 61 / 130


A Bayesian approach to learning algorithms Concept learning in Bayesian terms

Relation to Concept Learning

Every hypothesis consistent with D is a MAP hypothesis, if we assume


• uniform probability over H
• target function c ∈ H
• deterministic, noise-free data
• etc. (see above)
So, this learning algorithm will output a MAP hypothesis, even though it
does not explicitly use probabilities in learning.

COMP9417 ML & DM Classification (1) March 14, 2017 62 / 130


A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms

Learning A Real Valued Function

E.g., learning a linear target function f from noisy examples:

hML

COMP9417 ML & DM Classification (1) March 14, 2017 63 / 130


A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms

Learning A Real Valued Function

Consider any real-valued target function f


Training examples hxi , di i, where di is noisy training value
• di = f (xi ) + ei
• ei is random variable (noise) drawn independently for each xi
according to some Gaussian (normal) distribution with mean=0
Then the maximum likelihood hypothesis hM L is the one that
minimizes the sum of squared errors:
m
X
hM L = arg min (di − h(xi ))2
h∈H
i=1

COMP9417 ML & DM Classification (1) March 14, 2017 64 / 130


A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms

Learning A Real Valued Function

How did we obtain this ?

hM L = arg max p(D|h)


h∈H
m
Y
= arg max p(di |h)
h∈H i=1
m
Y 1 1 di −h(xi ) 2
= arg max √ e− 2 ( σ
)
h∈H i=1 2πσ 2

Recall that we treat each probability p(D|h) as if h = f , i.e., we assume


µ = f (xi ) = h(xi ), which is the key idea behind maximum likelihood !

COMP9417 ML & DM Classification (1) March 14, 2017 65 / 130


A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms

Learning A Real Valued Function

Maximize natural log to give simpler expression:

m  2
X 1 1 di − h(xi )
hM L = arg max ln √ −
h∈H 2πσ 2 2 σ
i=1
m  2
X 1 di − h(xi )
= arg max −
h∈H 2 σ
i=1
m
X
= arg max − (di − h(xi ))2
h∈H i=1

Equivalently, we can minimize the positive version of the expression:


m
X
hM L = arg min (di − h(xi ))2
h∈H i=1

COMP9417 ML & DM Classification (1) March 14, 2017 66 / 130


A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms

Discriminative and generative probabilistic models


• Discriminative models model the posterior probability distribution
P (Y |X), where Y is the target variable and X are the features. That
is, given X they return a probability distribution over Y .
• Generative models model the joint distribution P (Y, X) of the target
Y and the feature vector X. Once we have access to this joint
distribution we can derive any conditional or marginal distribution
involving Pthe same variables. In particular, since
P (X) = y P (Y = y, X) it follows that the posterior distribution
can be obtained as P (Y |X) = P PP(Y,X) .
y (Y =y,X)
• Alternatively, generative models can be described by the likelihood
function P (X|Y ), since P (Y, X) = P (X|Y )P (Y ) and the target or
prior distribution (usually abbreviated to ‘prior’) can be easily
estimated or postulated.
• Such models are called ‘generative’ because we can sample from the
joint distribution to obtain new data points together with their labels.
Alternatively,
COMP9417 ML & DM we can use P (Y ) to sample
Classification (1) a class andMarch
P (X|Y ) to 67 / 130
14, 2017
A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms

Assessing uncertainty in estimates


Suppose we want to estimate the probability θ that an arbitrary e-mail is
spam, so that we can use the appropriate prior distribution.
• The natural thing to do is to inspect n e-mails, determine the number
of spam e-mails d, and set θ̂ = d/n; we don’t really need any
complicated statistics to tell us that.
• However, while this is the most likely estimate of θ – the maximum a
posteriori (MAP) estimate – this doesn’t mean that other values of θ
are completely ruled out.
• We model this by a probability distribution over θ (a Beta distribution
in this case) which is updated each time new information comes in.
This is further illustrated in the figure for a distribution that is more
and more skewed towards spam.
• For each curve, its bias towards spam is given by the area under the
curve and to the right of θ = 1/2.

COMP9417 ML & DM Classification (1) March 14, 2017 68 / 130


A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms

Assessing uncertainty in estimates


5

0.2 0.4 0.6 0.8

Each time we inspect an e-mail, we are reducing our uncertainty regarding


the prior spam probability θ. After we inspect two e-mails and observe one
spam, the possible θ values are characterised by a symmetric distribution
around 1/2. If we inspect a third, fourth, . . . , tenth e-mail and each time
(except the first one) it is spam, then this distribution narrows and shifts a
little bit to the right each time. The distribution for n e-mails reaches its
maximum at θ̂MAP = n−1 n (e.g., θ̂MAP = 0.8 for n = 5).
COMP9417 ML & DM Classification (1) March 14, 2017 69 / 130
A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms

The Bayesian perspective


Explicitly modelling the posterior distribution over the parameter θ has a
number of advantages that are usually associated with the ‘Bayesian’
perspective:
• We can precisely characterise the uncertainty that remains about our
estimate by quantifying the spread of the posterior distribution.
• We can obtain a generative model for the parameter by sampling
from the posterior distribution, which contains much more
information than a summary statistic such as the MAP estimate can
convey – so, rather than using a single e-mail with θ = θMAP , our
generative model can contain a number of e-mails with θ sampled
from the posterior distribution.
• We can quantify the probability of statements such as ‘e-mails are
biased towards ham’ (the tiny shaded area in the figure demonstrates
that after observing one ham and nine spam e-mails this probability is
very small, about 0.6%).
• We can use one of these distributions to encode our prior beliefs: e.g.,
COMP9417 ML & DM Classification (1) March 14, 2017 70 / 130
Minimum Description Length Principle

Minimum Description Length Principle

Once again, the MAP hypothesis

hM AP = arg max P (D|h)P (h)


h∈H

Which is equivalent to

hM AP = arg max log2 P (D|h) + log2 P (h)


h∈H

Or
hM AP = arg min − log2 P (D|h) − log2 P (h)
h∈H

COMP9417 ML & DM Classification (1) March 14, 2017 71 / 130


Minimum Description Length Principle

Minimum Description Length Principle

Interestingly, this is an expression about a quantity of bits.

hM AP = arg min − log2 P (D|h) − log2 P (h) (1)


h∈H

From information theory:


The optimal (shortest expected coding length) code for an event
with probability p is − log2 p bits.

COMP9417 ML & DM Classification (1) March 14, 2017 72 / 130


Minimum Description Length Principle

Minimum Description Length Principle

So interpret (1):
• − log2 P (h) is length of h under optimal code
• − log2 P (D|h) is length of D given h under optimal code
Note well: assumes optimal encodings, when the priors and likelihoods
are known. In practice, this is difficult, and makes a difference.

COMP9417 ML & DM Classification (1) March 14, 2017 73 / 130


Minimum Description Length Principle

Minimum Description Length Principle

Occam’s razor: prefer the shortest hypothesis

MDL: prefer the hypothesis h that minimizes

hM DL = arg min LC1 (h) + LC2 (D|h)


h∈H

where LC (x) is the description length of x under optimal encoding C

COMP9417 ML & DM Classification (1) March 14, 2017 74 / 130


Minimum Description Length Principle

Minimum Description Length Principle

Example: H = decision trees, D = training data labels


• LC1 (h) is # bits to describe tree h
• LC2 (D|h) is # bits to describe D given h
• Note LC2 (D|h) = 0 if examples classified perfectly by h. Need only
describe exceptions
• Hence hM DL trades off tree size for training errors
• i.e., prefer the hypothesis that minimizes

length(h) + length(misclassif ications)

COMP9417 ML & DM Classification (1) March 14, 2017 75 / 130


Bayes Optimal Classifier

Most Probable Classification of New Instances

So far we’ve sought the most probable hypothesis given the data D (i.e.,
hM AP )

Given new instance x, what is its most probable classification?


• hM AP (x) is not the most probable classification!

COMP9417 ML & DM Classification (1) March 14, 2017 76 / 130


Bayes Optimal Classifier

Most Probable Classification of New Instances

Consider:
• Three possible hypotheses:
P (h1 |D) = .4, P (h2 |D) = .3, P (h3 |D) = .3
• Given new instance x,
h1 (x) = +, h2 (x) = −, h3 (x) = −
• What’s most probable classification of x?

COMP9417 ML & DM Classification (1) March 14, 2017 77 / 130


Bayes Optimal Classifier

Bayes Optimal Classifier

Bayes optimal classification:


X
arg max P (vj |hi )P (hi |D)
vj ∈V
hi ∈H

Example:

P (h1 |D) = .4, P (−|h1 ) = 0, P (+|h1 ) = 1


P (h2 |D) = .3, P (−|h2 ) = 1, P (+|h2 ) = 0
P (h3 |D) = .3, P (−|h3 ) = 1, P (+|h3 ) = 0

COMP9417 ML & DM Classification (1) March 14, 2017 78 / 130


Bayes Optimal Classifier

Bayes Optimal Classifier

therefore
X
P (+|hi )P (hi |D) = .4
hi ∈H
X
P (−|hi )P (hi |D) = .6
hi ∈H

and
X
arg max P (vj |hi )P (hi |D) = −
vj ∈V
hi ∈H

No other classification method using the same hypothesis space and same
prior knowledge can outperform this method on average

COMP9417 ML & DM Classification (1) March 14, 2017 79 / 130


Naive Bayes

Naive Bayes Classifier

Along with decision trees, neural networks, nearest neighbour, one of the
most practical learning methods.

When to use
• Moderate or large training set available
• Attributes that describe instances are conditionally independent given
classification
Successful applications:
• Classifying text documents
• Gaussian Naive Bayes for real-valued data

COMP9417 ML & DM Classification (1) March 14, 2017 80 / 130


Naive Bayes

Naive Bayes Classifier

Assume target function f : X → V , where each instance x described by


attributes ha1 , a2 . . . an i.
Most probable value of f (x) is:

vM AP = arg max P (vj |a1 , a2 . . . an )


vj ∈V
P (a1 , a2 . . . an |vj )P (vj )
vM AP = arg max
vj ∈V P (a1 , a2 . . . an )
= arg max P (a1 , a2 . . . an |vj )P (vj )
vj ∈V

COMP9417 ML & DM Classification (1) March 14, 2017 81 / 130


Naive Bayes

Naive Bayes Classifier

Naive Bayes assumption:


Y
P (a1 , a2 . . . an |vj ) = P (ai |vj )
i

• Attributes are statistically independent (given the class value)


• which means knowledge about the value of a particular attribute tells
us nothing about the value of another attribute (if the class is known)
which gives
Y
Naive Bayes classifier: vN B = arg max P (vj ) P (ai |vj )
vj ∈V i

COMP9417 ML & DM Classification (1) March 14, 2017 82 / 130


Naive Bayes

Naive Bayes Algorithm

Naive Bayes Learn(examples)


For each target value vj
P̂ (vj ) ← estimate P (vj )
For each attribute value ai of each attribute a
P̂ (ai |vj ) ← estimate P (ai |vj )

Classify New Instance(x)


Y
vN B = arg max P̂ (vj ) P̂ (ai |vj )
vj ∈V ai ∈x

COMP9417 ML & DM Classification (1) March 14, 2017 83 / 130


Naive Bayes

Naive Bayes Example

Consider PlayTennis again . . .


Outlook Temperature Humidity Windy
Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1

Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5

Play
Yes No
9 5
9/14 5/14

COMP9417 ML & DM Classification (1) March 14, 2017 84 / 130


Naive Bayes

Naive Bayes Example

Say we have the new instance:

hOutlk = sun, T emp = cool, Humid = high, W ind = truei

We want to compute:
Y
vN B = arg max P (vj ) P (ai |vj )
vj ∈{“yes”,“no”} i

COMP9417 ML & DM Classification (1) March 14, 2017 85 / 130


Naive Bayes

Naive Bayes Example

So we first calculate the likelihood of the two classes, “yes” and “no”

for “yes” = P (y) × P (sun|y) × P (cool|y) × P (high|y) × P (true|y)


9 2 3 3 3
0.0053 = × × × ×
14 9 9 9 9

for “no” = P (n) × P (sun|n) × P (cool|n) × P (high|n) × P (true|n)


5 3 1 4 3
0.0206 = × × × ×
14 5 5 5 5

COMP9417 ML & DM Classification (1) March 14, 2017 86 / 130


Naive Bayes

Naive Bayes Example

Then convert to a probability by normalisation

0.0053
P (“yes”) =
(0.0053 + 0.0206)
= 0.205
0.0206
P (“no”) =
(0.0053 + 0.0206)
= 0.795

The Naive Bayes classification is “no”.

COMP9417 ML & DM Classification (1) March 14, 2017 87 / 130


Naive Bayes

Naive Bayes: Subtleties


Conditional independence assumption is often violated
Y
P (a1 , a2 . . . an |vj ) = P (ai |vj )
i

• ...but it works surprisingly well anyway. Note don’t need estimated


posteriors P̂ (vj |x) to be correct; need only that
Y
arg max P̂ (vj ) P̂ (ai |vj ) = arg max P (vj )P (a1 . . . , an |vj )
vj ∈V i vj ∈V

i.e. maximum probability is assigned to correct class


• see [Domingos & Pazzani, 1996] for analysis
• Naive Bayes posteriors often unrealistically close to 1 or 0
• adding too many redundant attributes will cause problems (e.g.
identical attributes)
COMP9417 ML & DM Classification (1) March 14, 2017 88 / 130
Naive Bayes

Naive Bayes: “zero-frequency” problem

What if none of the training instances with target value vj have attribute
value ai ? Then
P̂ (ai |vj ) = 0, and...
Y
P̂ (vj ) P̂ (ai |vj ) = 0
i

Pseudo-counts add 1 to each count (a version of the Laplace Estimator)

COMP9417 ML & DM Classification (1) March 14, 2017 89 / 130


Naive Bayes

Naive Bayes: “zero-frequency” problem

• In some cases adding a constant different from 1 might be more


appropriate
• Example: attribute outlook for class yes
Sunny Overcast Rainy
2+ µ
2
4+ µ
3
3+ µ
3
9+µ 9+µ 9+µ
• Weights don’t need to be equal (if they sum to 1) – a form of prior
Sunny Overcast Rainy
2+µp1 4+µp2 3+µp3
9+µ 9+µ 9+µ

COMP9417 ML & DM Classification (1) March 14, 2017 90 / 130


Naive Bayes

Naive Bayes: “zero-frequency” problem

This generalisation is a Bayesian estimate for P̂ (ai |vj )


nc + mp
P̂ (ai |vj ) ←
n+m
where
• n is number of training examples for which v = vj ,
• nc number of examples for which v = vj and a = ai
• p is prior estimate for P̂ (ai |vj )
• m is weight given to prior (i.e. number of “virtual” examples)
This is called the m-estimate of probability.

COMP9417 ML & DM Classification (1) March 14, 2017 91 / 130


Naive Bayes

Naive Bayes: missing values

• Training: instance is not included in frequency count for attribute


value-class combination
• Classification: attribute will be omitted from calculation

COMP9417 ML & DM Classification (1) March 14, 2017 92 / 130


Naive Bayes

Naive Bayes: numeric attributes

• Usual assumption: attributes have a normal or Gaussian probability


distribution (given the class)
• The probability density function for the normal distribution is defined
by two parameters:
The sample mean µ:
n
1X
µ= xi
n i=1
The standard deviation σ:
n
1 X
σ= (xi − µ)2
n − 1 i=1

COMP9417 ML & DM Classification (1) March 14, 2017 93 / 130


Naive Bayes

Naive Bayes: numeric attributes

Then we have the density function f (x):

1 (x−µ)2
f (x) = √ e− 2σ2
2πσ
Example: continuous attribute temperature with mean = 73 and standard
deviation = 6.2. Density value

1 (66−73)2

f (temperature = 66|“yes”) = √ e 2×6.22 = 0.0340
2π6.2
Missing values during training are not included in calculation of mean and
standard deviation.

COMP9417 ML & DM Classification (1) March 14, 2017 94 / 130


Naive Bayes

Naive Bayes: numeric attributes

Note: the normal distribution is based on the simple exponential function


m
f (x) = e−|x|

As the power m in the exponent increases, the function approaches a step


function.
Where m = 2
2
f (x) = e−|x|
and this is the basis of the normal distribution – the various constants are
the result of scaling so that the integral (the area under the curve from
−∞ to +∞) is equal to 1.
from “Statistical Computing” by Michael J. Crawley (2002) Wiley.

COMP9417 ML & DM Classification (1) March 14, 2017 95 / 130


Naive Bayes

Categorical random variables


Categorical variables or features (also called discrete or nominal) are
ubiquitous in machine learning.
• Perhaps the most common form of the Bernoulli distribution models
whether or not a word occurs in a document. That is, for the i-th
word in our vocabulary we have a random variable Xi governed by a
Bernoulli distribution. The joint distribution over the bit vector
X = (X1 , . . . , Xk ) is called a multivariate Bernoulli distribution.
• Variables with more than two outcomes are also common: for
example, every word position in an e-mail corresponds to a categorical
variable with k outcomes, where k is the size of the vocabulary. The
multinomial distribution manifests itself as a count vector: a
histogram of the number of occurrences of all vocabulary words in a
document. This establishes an alternative way of modelling text
documents that allows the number of occurrences of a word to
influence the classification of a document.
COMP9417 ML & DM Classification (1) March 14, 2017 96 / 130
Naive Bayes

Categorical random variables

Both these document models are in common use. Despite their


differences, they both assume independence between word occurrences,
generally referred to as the naive Bayes assumption.
• In the multinomial document model, this follows from the very use of
the multinomial distribution, which assumes that words at different
word positions are drawn independently from the same categorical
distribution.
• In the multivariate Bernoulli model we assume that the bits in a bit
vector are statistically independent, which allows us to compute the
joint probability of a particular bit vector (x1 , . . . , xk ) as the product
of the probabilities of each component P (Xi = xi ).
• In practice, such word independence assumptions are often not true:
if we know that an e-mail contains the word ‘Viagra’, we can be quite
sure that it will also contain the word ‘pill’. Violated independence
assumptions violate the quality of probability estimates but may still
allow good classification performance.
COMP9417 ML & DM Classification (1) March 14, 2017 97 / 130
Naive Bayes Learning and using a naive Bayes model for classification

Example application: Learning to Classify Text

In machine learning, the classic example of applications of Naive Bayes is


learning to classify text documents.

Here is a simplified version in the multinomial document model.

COMP9417 ML & DM Classification (1) March 14, 2017 98 / 130


Naive Bayes Learning and using a naive Bayes model for classification

Learning to Classify Text

Why?
• Learn which news articles are of interest
• Learn to classify web pages by topic

Naive Bayes is among most effective algorithms

What attributes shall we use to represent text documents??

COMP9417 ML & DM Classification (1) March 14, 2017 99 / 130


Naive Bayes Learning and using a naive Bayes model for classification

Learning to Classify Text

Target concept Interesting? : Document → {+, −}


1 Represent each document by vector of words
• one attribute per word position in document
2 Learning: Use training examples to estimate
• P (+)
• P (−)
• P (doc|+)
• P (doc|−)

COMP9417 ML & DM Classification (1) March 14, 2017 100 / 130


Naive Bayes Learning and using a naive Bayes model for classification

Learning to Classify Text

Naive Bayes conditional independence assumption


length(doc)
Y
P (doc|vj ) = P (ai = wk |vj )
i=1

where P (ai = wk |vj ) is probability that word in position i is wk , given vj

one more assumption: P (ai = wk |vj ) = P (am = wk |vj ), ∀i, m

“bag of words”

COMP9417 ML & DM Classification (1) March 14, 2017 101 / 130


Naive Bayes Learning and using a naive Bayes model for classification

Learning to Classify Text

Learn naive Bayes text(Examples, V )


// collect all words and other tokens that occur in Examples
V ocabulary ← all distinct words and other tokens in Examples
// calculate the required P (vj ) and P (wk |vj ) probability terms
for each target value vj in V do
docsj ← subset of Examples for which the target value is vj
|docsj |
P (vj ) ← |Examples|
T extj ← a single document created by concatenating all members of docsj
n ← total number of words in T extj (counting duplicate words multiple times)
for each word wk in V ocabulary
nk ← number of times word wk occurs in T extj
nk +1
P (wk |vj ) ← n+|V ocabulary|

COMP9417 ML & DM Classification (1) March 14, 2017 102 / 130


Naive Bayes Learning and using a naive Bayes model for classification

Learning to Classify Text

Classify naive Bayes text(Doc)


• positions ← all word positions in Doc that contain tokens found in
V ocabulary
• Return vN B , where
Y
vN B = arg max P (vj ) P (ai |vj )
vj ∈V i∈positions

COMP9417 ML & DM Classification (1) March 14, 2017 103 / 130


Naive Bayes Learning and using a naive Bayes model for classification

Application: 20 Newsgroups
Given: 1000 training documents from each group
Learning task: classify each new document by newsgroup it came from
comp.graphics misc.forsale
comp.os.ms-windows.misc rec.autos
comp.sys.ibm.pc.hardware rec.motorcycles
comp.sys.mac.hardware rec.sport.baseball
comp.windows.x rec.sport.hockey
alt.atheism sci.space
soc.religion.christian sci.crypt
talk.religion.misc sci.electronics
talk.politics.mideast sci.med
talk.politics.misc
talk.politics.guns

Naive Bayes: 89% classification accuracy


COMP9417 ML & DM Classification (1) March 14, 2017 104 / 130
Naive Bayes Learning and using a naive Bayes model for classification

Article from rec.sport.hockey

Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu
From: [email protected] (John Doe)
Subject: Re: This year’s biggest and worst (opinion)...
Date: 5 Apr 93 09:53:39 GMT

I can only comment on the Kings, but the most obvious candidate
for pleasant surprise is Alex Zhitnik. He came highly touted as
a defensive defenseman, but he’s clearly much more than that.
Great skater and hard shot (though wish he were more accurate).
In fact, he pretty much allowed the Kings to trade away that
huge defensive liability Paul Coffey. Kelly Hrudey is only the
biggest disappointment if you thought he was any good to begin
with. But, at best, he’s only a mediocre goaltender. A better
choice would be Tomas Sandstrom, though not through any fault
of his own, but because some thugs in Toronto decided ...

COMP9417 ML & DM Classification (1) March 14, 2017 105 / 130


Naive Bayes Learning and using a naive Bayes model for classification

Learning Curve for 20 Newsgroups

20News
100
90
80
70 Bayes
TFIDF
60 PRTFIDF
50
40
30
20
10
0
100 1000 10000

Accuracy vs. Training set size (1/3 withheld for test)

COMP9417 ML & DM Classification (1) March 14, 2017 106 / 130


Naive Bayes Learning and using a naive Bayes model for classification

Probabilistic decision rules

We have chosen one of the possible distributions to model our data X as


coming from either class.
• The more different P (X|Y = spam) and P (X|Y = ham) are, the
more useful the features X are for classification.
• Thus, for a specific e-mail x we calculate both P (X = x|Y = spam)
and P (X = x|Y = ham), and apply one of several possible decision
rules:
maximum likelihood (ML) – predict arg maxy P (X = x|Y = y);
maximum a posteriori (MAP) – predict
arg maxy P (X = x|Y = y)P (Y = y);
The relation between the first two decision rules is that ML classification is
equivalent to MAP classification with a uniform class distribution.

COMP9417 ML & DM Classification (1) March 14, 2017 107 / 130


Naive Bayes Learning and using a naive Bayes model for classification

Probabilistic decision rules

We again use the example of Naive Bayes for text classification to


illustrate, using both the multinomial and multivariate Bernoulli models.

COMP9417 ML & DM Classification (1) March 14, 2017 108 / 130


Naive Bayes Learning and using a naive Bayes model for classification

Prediction using a naive Bayes model

Suppose our vocabulary contains three words a, b and c, and we use a


multivariate Bernoulli model for our e-mails, with parameters

θ ⊕ = (0.5, 0.67, 0.33) θ = (0.67, 0.33, 0.33)

This means, for example, that the presence of b is twice as likely in spam
(+), compared with ham.
The e-mail to be classified contains words a and b but not c, and hence is
described by the bit vector x = (1, 1, 0). We obtain likelihoods

P (x|⊕) = 0.5 · 0.67 · (1 − 0.33) = 0.222


P (x| ) = 0.67 · 0.33 · (1 − 0.33) = 0.148

The ML classification of x is thus spam.

COMP9417 ML & DM Classification (1) March 14, 2017 109 / 130


Naive Bayes Learning and using a naive Bayes model for classification

Prediction using a naive Bayes model


In the case of two classes it is often convenient to work with likelihood
ratios and odds.
• The likelihood ratio can be calculated as
P (x|⊕) 0.5 0.67 1 − 0.33
= = 3/2 > 1
P (x| ) 0.67 0.33 1 − 0.33
• This means that the MAP classification of x is also spam if the prior
odds are more than 2/3, but ham if they are less than that.
• For example, with 33% spam and 67% ham the prior odds are
P (⊕) 0.33
P ( ) = 0.67 = 1/2, resulting in a posterior odds of

P (⊕|x) P (x|⊕) P (⊕)


= = 3/2 · 1/2 = 3/4 < 1
P ( |x) P (x| ) P ( )
In this case the likelihood ratio for x is not strong enough to push the
decision away from the prior.
COMP9417 ML & DM Classification (1) March 14, 2017 110 / 130
Naive Bayes Learning and using a naive Bayes model for classification

Prediction using a naive Bayes model


Alternatively, we can employ a multinomial model. The parameters of a
multinomial establish a distribution over the words in the vocabulary, say
θ ⊕ = (0.3, 0.5, 0.2) θ = (0.6, 0.2, 0.2)
The e-mail to be classified contains three occurrences of word a, one single
occurrence of word b and no occurrences of word c, and hence is described
by the count vector x = (3, 1, 0). The total number of vocabulary word
occurrences is n = 4. We obtain likelihoods
0.33 0.51 0.20
P (x|⊕) = 4! = 0.054
3! 1! 0!
0.63 0.21 0.20
P (x| ) = 4! = 0.1728
3! 1! 0!
0.3 3 0.5 1 0.2 0
  
The likelihood ratio is 0.6 0.2 0.2 = 5/16. The ML classification
of x is thus ham, the opposite of the multivariate Bernoulli model. This is
mainly because of the three occurrences of word a, which provide strong
evidence for ham.
COMP9417 ML & DM Classification (1) March 14, 2017 111 / 130
Naive Bayes Learning and using a naive Bayes model for classification

Training data for naive Bayes

A small e-mail data set described by count vectors.

E-mail #a #b #c Class
e1 0 3 0 +
e2 0 3 3 +
e3 3 0 0 +
e4 2 3 0 +
e5 4 3 0 −
e6 4 0 3 −
e7 3 0 0 −
e8 0 0 0 −

COMP9417 ML & DM Classification (1) March 14, 2017 112 / 130


Naive Bayes Learning and using a naive Bayes model for classification

Training data for naive Bayes

The same data set described by bit vectors.

E-mail a? b? c? Class
e1 0 1 0 +
e2 0 1 1 +
e3 1 0 0 +
e4 1 1 0 +
e5 1 1 0 −
e6 1 0 1 −
e7 1 0 0 −
e8 0 0 0 −

COMP9417 ML & DM Classification (1) March 14, 2017 113 / 130


Naive Bayes Learning and using a naive Bayes model for classification

Training a naive Bayes model

Consider the following e-mails consisting of five words a, b, c, d, e:


e1 : bdebbde e5 : abababaed
e2 : bcebbddecc e6 : acacacaed
e3 : adadeaee e7 : eaedaea
e4 : badbedab e8 : deded
We are told that the e-mails on the left are spam and those on the right
are ham, and so we use them as a small training set to train our Bayesian
classifier.
• First, we decide that d and e are so-called stop words that are too
common to convey class information.
• The remaining words, a, b and c, constitute our vocabulary.

COMP9417 ML & DM Classification (1) March 14, 2017 114 / 130


Naive Bayes Learning and using a naive Bayes model for classification

Training a naive Bayes model

For the multinomial model, we represent each e-mail as a count vector, as


before.
• In order to estimate the parameters of the multinomial, we sum up
the count vectors for each class, which gives (5, 9, 3) for spam and
(11, 3, 3) for ham.
• To smooth these probability estimates we add one pseudo-count for
each vocabulary word, which brings the total number of occurrences
of vocabulary words to 20 for each class.
• The estimated parameter vectors are thus
θ̂ ⊕ = (6/20, 10/20, 4/20) = (0.3, 0.5, 0.2) for spam and
θ̂ = (12/20, 4/20, 4/20) = (0.6, 0.2, 0.2) for ham.

COMP9417 ML & DM Classification (1) March 14, 2017 115 / 130


Naive Bayes Learning and using a naive Bayes model for classification

Training a naive Bayes model

In the multivariate Bernoulli model e-mails are represented by bit vectors,


as before.
• Adding the bit vectors for each class results in (2, 3, 1) for spam and
(3, 1, 1) for ham.
• Each count is to be divided by the number of documents in a class, in
order to get an estimate of the probability of a document containing a
particular vocabulary word.
• Probability smoothing now means adding two pseudo-documents, one
containing each word and one containing none of them.
• This results in the estimated parameter vectors
θ̂ ⊕ = (3/6, 4/6, 2/6) = (0.5, 0.67, 0.33) for spam and
θ̂ = (4/6, 2/6, 2/6) = (0.67, 0.33, 0.33) for ham.

COMP9417 ML & DM Classification (1) March 14, 2017 116 / 130


Linear discriminants

Linear discriminants

Many forms of linear discriminant from statistics and machine learning,


e.g.,
• Fisher’s Linear Discriminant Analysis
• the basic linear classifier in lecture 1 is a version of this
• Logistic Regression
• a probabilistic linear classifier
• Perceptron
• a linear threshold classifier

• an early version of an artificial “neuron”

• still a useful method, and source of ideas

COMP9417 ML & DM Classification (1) March 14, 2017 117 / 130


Linear discriminants

Logistic regression

In the case of a two-class problem, model the probability of one class


P (Y = 1) vs. the alternative (1 − P (Y = 1)):
1
P (Y = 1|x) =
1 + e−wT x
or

P (Y = 1|x)
ln = wT x
1 − P (Y = 1|x)
The quantity on the l.h.s. is called the logit and all we are doing is a linear
model for the logit.
Note: to fit this is actually more complex than linear regression, so we
omit the details.
Generalises to multiple class versions (Y can have more than two values).

COMP9417 ML & DM Classification (1) March 14, 2017 118 / 130


Perceptrons

Perceptron

x1 w1 x0=1
w0
x2 w2

.
. Σ n

{
n
Σ wi xi 1 if S w x > 0
. wn i=0 o= i=0 i i
-1 otherwise
xn


1 if w0 + w1 x1 + · · · + wn xn > 0
o(x1 , . . . , xn ) =
−1 otherwise.

COMP9417 ML & DM Classification (1) March 14, 2017 119 / 130


Perceptrons

Perceptron

Sometimes simpler vector notation used:



1 if w · x > 0
o(x) =
−1 otherwise.

COMP9417 ML & DM Classification (1) March 14, 2017 120 / 130


Perceptrons

Decision Surface of a Perceptron

x2 x2
+
+
- + -
+
x1 x1
- - +
-

(a) (b)

Represents some useful functions


• What weights represent g(x1 , x2 ) = AN D(x1 , x2 )?

COMP9417 ML & DM Classification (1) March 14, 2017 121 / 130


Perceptrons

Decision Surface of a Perceptron

But some functions not representable


• e.g., not linearly separable
• a labelled data set is linearly separable if there is a linear decision
boundary that separates the classes
• for non-linearly separable data we’ll need something else
• e.g., networks of these ...
• next lecture

COMP9417 ML & DM Classification (1) March 14, 2017 122 / 130


Perceptrons

Perceptron training rule


Learning is “finding a good set of weights”
Perceptron training rule is a weight-update scheme:

wi ← wi + ∆wi
where
∆wi = η(t − o)xi
Where:
• t = c(x) is target value
• o is perceptron output
• η is a small constant called learning rate
• between 0 and 1
• to simplify things you may assume η = 1
• but in practice usually set at less than 0.2, e.g., 0.1
• η can be varied during learning

COMP9417 ML & DM Classification (1) March 14, 2017 123 / 130


Perceptrons

Perceptron training rule

Can prove it will converge


• If training data is linearly separable
• and η sufficiently small

COMP9417 ML & DM Classification (1) March 14, 2017 124 / 130


Perceptrons

Perceptron learning

A linear classifier that will achieve perfect separation on linearly separable


data is the perceptron, originally proposed as a simple neural network. The
perceptron iterates over the training set, updating the weight vector every
time it encounters an incorrectly classified example.
• For example, let xi be a misclassified positive example, then we have
yi = +1 and w · xi < t. We therefore want to find w0 such that
w0 · xi > w · xi , which moves the decision boundary towards and
hopefully past xi .
• This can be achieved by calculating the new weight vector as
w0 = w + ηxi , where 0 < η ≤ 1 is the learning rate (again, assume set
to 1). We then have w0 · xi = w · xi + ηxi · xi > w · xi as required.
• Similarly, if xj is a misclassified negative example, then we have
yj = −1 and w · xj > t. In this case we calculate the new weight
vector as w0 = w − ηxj , and thus w0 · xj = w · xj − ηxj · xj < w · xj .

COMP9417 ML & DM Classification (1) March 14, 2017 125 / 130


Perceptrons

Perceptron learning

• The two cases can be combined in a single update rule:

w0 = w + ηyi xi

• Here yi acts to change the sign of the update, corresponding to


whether a positive or negative example was misclassified
• This is the basis of the perceptron training algorithm for linear
classification
• The algorithm just iterates over the training examples applying the
weight update rule until all the examples are correctly classified
• If there is a linear model that separates the positive from the negative
examples, i.e., the data is linearly separable, it can be shown that the
perceptron training algorithm will converge in a finite number of steps.

COMP9417 ML & DM Classification (1) March 14, 2017 126 / 130


Perceptrons

Perceptron training

Algorithm Perceptron(D, η) // perceptron training for linear classification


Input: labelled training data D in homogeneous coordinates; learning rate η.
Output: weight vector w defining classifier ŷ = sign(w · x).
1 w ←0 // Other initialisations of the weight vector are possible
2 converged ←false
3 while converged = false do
4 converged ←true
5 for i = 1 to |D| do
6 if yi w · xi ≤ 0 then // i.e., yˆi 6= yi
7 w←w + ηyi xi
8 converged ←false // We changed w so haven’t converged yet
9 end
10 end
11 end

COMP9417 ML & DM Classification (1) March 14, 2017 127 / 130


Perceptrons

Perceptron training – varying learning rate


3 3 3

2 2 2

1 1 1

0 0 0

−1 −1 −1

−2 −2 −2

−3 −3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

(left) A perceptron trained with a small learning rate (η = 0.2). The


circled examples are the ones that trigger the weight update. (middle)
Increasing the learning rate to η = 0.5 leads in this case to a rapid
convergence. (right) Increasing the learning rate further to η = 1 may lead
to too aggressive weight updating, which harms convergence.
The starting point in all three cases was the basic linear classifier.
COMP9417 ML & DM Classification (1) March 14, 2017 128 / 130
Summary

Summary

• Two frameworks for classification by linear models


Distance-based. The key ideas are geometric.
Probabilistic. The key ideas are Bayesian.

COMP9417 ML & DM Classification (1) March 14, 2017 129 / 130


Summary

Summary

We also discussed the Perceptron, a simple form of threshold model.


• The perceptron is a linear classifier
• Early attempt to model artificial “neuron”
• One of the oldest ideas in machine learning, but still useful
• For example, it is an incremental, or online, learning algorithm
• With the “right” features, the target concept may be linearly
separable
• The weight vector is a linear combination of the training instances
• All of the methods we looked at do classification with models that are
essentially linear in the parameters

COMP9417 ML & DM Classification (1) March 14, 2017 130 / 130

You might also like