Showfile

Supervised Learning – Classification (1)
COMP9417 Machine Learning and Data Mining
March 14, 2017
COMP9417 ML & DM Classification (1) March 14, 2017 1 / 130

Acknowledgements
Material derived from slides for the book

“Elements of Statistical Learning (2nd Ed.)” by T. Hastie,
R. Tibshirani & J. Friedman. Springer (2009)
http://statweb.stanford.edu/~tibs/ElemStatLearn/

“Machine Learning: A Probabilistic Perspective” by P. Murphy
MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook

“Machine Learning” by P. Flach
Cambridge University Press (2012)
http://cs.bris.ac.uk/~flach/mlbook

“Bayesian Reasoning and Machine Learning” by D. Barber
Cambridge University Press (2012)
http://www.cs.ucl.ac.uk/staff/d.barber/brml

“Machine Learning” by T. Mitchell
McGraw-Hill (1997)
http://www-2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course

“Machine Learning” by A. Srinivasan
BITS Pilani, Goa, India (2016)

Aims
This lecture will introduce you to machine learning approaches to the

problem of numerical prediction. Following it you should be able to
reproduce theoretical results, outline algorithmic techniques and describe
practical applications for the topics:
• describe distance measures and how they are used in classification
• outline the basic k-nearest neighbour classification method
• define MAP and ML inference using Bayes theorem
• define the Bayes optimal classification rule in terms of MAP inference
• outline the Naive Bayes classification algorithm
• describe typical applications of Naive Bayes for text classification
• outline the Perceptron classification algorithm

Introduction
Introduction
Classification methods dominate machine learning . . .

. . . however, they often don’t have nice theoretical properties like
regression. so are more complicated to analyse. We will mostly focus on
their advantages and disadvantages as learning methods first, and point to
unifying ideas and concepts where applicable. In this lecture we focus on
classification methods that are essentially linear models.

Nearest neighbour classification
Nearest neighbour classification

• Related to the simplest form of learning: rote learning or
memorization
• Training instances are searched for instance that most closely
resembles new or query instance
• The instances themselves represent the knowledge
• Called: instance-based, memory-based learning or case-based learning;
often a form of local learning
• The similarity or distance function defines “learning”, i.e., how to go
beyond simple memorization
• Intuitive idea — instances “close by”, i.e., neighbours or exemplars,
should be classified similarly
• Instance-based learning is lazy learning
• Methods: nearest-neighbour, k-nearest-neighbour, IBk, . . .
• Also important for unsupervised methods, e.g., clustering (later
lectures)
Distance-based models
Minkowski distance
Minkowski distance If X = Rd , the Minkowski distance of order p > 0 is

defined as
 1/p
X d
Disp (x, y) =  |xj − yj |p  = ||x − y||p
j=1
P 1/p
d p
where ||z||p = j=1 |z j | is the p-norm (sometimes denoted Lp
norm) of the vector z.

Minkowski distance
• The 2-norm refers to the familiar Euclidean distance

v
u d q
uX
Dis2 (x, y) = t (xj − yj )2 = (x − y)T (x − y)
j=1
which measures distance ‘as the crow flies’.

• The 1-norm denotes Manhattan distance, also called cityblock
distance:
d
X
Dis1 (x, y) = |xj − yj |
j=1
This is the distance if we can only travel along coordinate axes.

Minkowski distance
• If we now let p grow larger, the distance will be more and more
dominated by the largest coordinate-wise distance, from which we can
infer that Dis∞ (x, y) = maxj |xj − yj |; this is also called Chebyshev
distance.
• You will sometimes see references to the 0-norm (or L0 norm) which
counts the number of non-zero elements in a vector. The
corresponding distance then counts the number of positions in which
vectors x and y differ. This is not strictly a Minkowski distance;
however, we can define it as
d
X d
X
0
Dis0 (x, y) = (xj − yj ) = I[xj = yj ]
j=1 j=1
under the understanding that x0 = 0 for x = 0 and 1 otherwise.

Minkowski distance
Sometimes the data X is not naturally in Rd , but if we can turn it into

Boolean features, or character sequesnces, we can still apply distance
measures. For example:
• If x and y are binary strings, this is also called the Hamming
distance. Alternatively, we can see the Hamming distance as the
number of bits that need to be flipped to change x into y.
• For non-binary strings of unequal length this can be generalised to the
notion of edit distance or Levenshtein distance.

Circles and ellipses
Lines connecting points at order-p Minkowski distance 1 from the origin

for (from inside) p = 0.8; p = 1 (Manhattan distance, the rotated square
in red); p = 1.5; p = 2 (Euclidean distance, the violet circle); p = 4;
p = 8; and p = ∞ (Chebyshev distance, the blue rectangle). Notice that
for points on the coordinate axes all distances agree. For the other points,
our reach increases with p; however, if we require a rotation-invariant
distance metric then Euclidean distance is our only choice.
Distance metric
Distance metric Given an instance space X , a distance metric is a

function Dis : X × X → R such that for any x, y, z ∈ X :
• distances between a point and itself are zero: Dis(x, x) = 0;
• all other distances are larger than zero: if x 6= y then Dis(x, y) > 0;
• distances are symmetric: Dis(y, x) = Dis(x, y);
• detours can not shorten the distance:
Dis(x, z) ≤ Dis(x, y) + Dis(y, z).
If the second condition is weakened to a non-strict inequality – i.e.,
Dis(x, y) may be zero even if x 6= y – the function Dis is called a
pseudo-metric.

The triangle inequality – Minkowski distance for p = 2
B !
A ! !
C
The green circle connects points the same Euclidean distance (i.e.,
Minkowski distance of order p = 2) away from the origin as A. The orange
circle shows that B and C are equidistant from A. The red circle
demonstrates that C is closer to the origin than B, which conforms to the
triangle inequality.
The triangle inequality – Minkowski distance for p ≤ 1
B !
A ! !
C
With Manhattan distance (p = 1), B and C are equally close to the origin
and also equidistant from A. With p < 1 (here, p = 0.8) C is further away
from the origin than B; since both are again equidistant from A, it follows
that travelling from the origin to C via A is quicker than going there
directly, which violates the triangle inequality.
Mahalanobis distance
Often, the shape of the ellipse is estimated from data as the inverse of the
covariance matrix: M = Σ−1 . This leads to the definition of the
Mahalanobis distance
q
DisM (x, y|Σ) = (x − y)T Σ−1 (x − y)
Using the covariance matrix in this way has the effect of decorrelating and
normalising the features.
Clearly, Euclidean distance is a special case of Mahalanobis distance with

the identity matrix I as covariance matrix: Dis2 (x, y) = DisM (x, y|I).

Distance-based models Neighbours and exemplars
Means and distances
The arithmetic mean minimises squared Euclidean distance The

arithmetic mean µ of a set of data points D in a Euclidean space is the
unique point that minimises the sum of squared Euclidean distances to
those data points.
Proof. We will show that arg miny x∈D ||x − y||2 = µ, where || · ||
P
denotes the 2-norm. We find this minimum by taking the gradient (the
vector of partial derivatives with respect to yi ) of the sum and setting it to
the zero vector:
X X X
∇y ||x − y||2 = −2 (x − y) = −2 x + 2|D|y = 0
x∈D x∈D x∈D
1 P
from which we derive y = |D| x∈D x = µ.

Means and distances
• Notice that minimising the sum of squared Euclidean distances of a

given set of points is the same as minimising the average squared
Euclidean distance.
• You may wonder what happens if we drop the square here: wouldn’t
it be more natural to take the point that minimises total Euclidean
distance as exemplar?
• This point is known as the geometric median, as for univariate data it
corresponds to the median or ‘middle value’ of a set of numbers.
However, for multivariate data there is no closed-form expression for
the geometric median, which needs to be calculated by successive
approximation.

Means and distances
• In certain situations it makes sense to restrict an exemplar to be one

of the given data points. In that case, we speak of a medoid, to
distinguish it from a centroid which is an exemplar that doesn’t have
to occur in the data.
• Finding a medoid requires us to calculate, for each data point, the
total distance to all other data points, in order to choose the point
that minimises it. Regardless of the distance metric used, this is an
O(n2 ) operation for n points.
• So for medoids there is no computational reason to prefer one
distance metric over another.
• There may be more than one medoid.

Centroids and medoids
1.5
0.5
−0.5
−1
data points
squared 2−norm centroid (mean)
−1.5 2−norm centroid (geometric median)
squared 2−norm medoid
2−norm medoid
1−norm medoid
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
A small data set of 10 points, with circles indicating centroids and squares
indicating medoids (the latter must be data points), for different distance
metrics. Notice how the outlier on the bottom-right ‘pulls’ the mean away
from the geometric median; as a result the corresponding medoid changes
as well.
The basic linear classifier is distance-based
• The basic linear classifier constructs the decision boundary as the

perpendicular bisector of the line segment connecting the two
exemplars (one for each class).
• An alternative, distance-based way to classify instances without direct
reference to a decision boundary is by the following decision rule: if x
is nearest to µ⊕ then classify it as positive, otherwise as negative; or
equivalently, classify an instance to the class of the nearest exemplar.
• If we use Euclidean distance as our closeness measure, simple
geometry tells us we get exactly the same decision boundary.
• So the basic linear classifier can be interpreted from a distance-based
perspective as constructing exemplars that minimise squared
Euclidean distance within each class, and then applying a
nearest-exemplar decision rule.

Two-exemplar decision boundaries
(left) For two exemplars the nearest-exemplar decision rule with Euclidean
distance results in a linear decision boundary coinciding with the
perpendicular bisector of the line connecting the two exemplars. (right)
Using Manhattan distance the circles are replaced by diamonds.
Three-exemplar decision boundaries
(left) Decision regions defined by the 2-norm nearest-exemplar decision

rule for three exemplars. (right) With Manhattan distance the decision
regions become non-convex.
To summarise, the main ingredients of distance-based models are

• distance metrics, which can be Euclidean, Manhattan, Minkowski or
Mahalanobis, among many others;
• exemplars: centroids that find a centre of mass according to a chosen
distance metric, or medoids that find the most centrally located data
point; and
• distance-based decision rules, which take a vote among the k nearest
exemplars.

Distance-based models k-Nearest neighbour Classification
Nearest Neighbour
Stores all training examples hxi , f (xi )i.

Nearest neighbour:
• Given query instance xq , first locate nearest training example xn , then
estimate fˆ(xq ) ← f (xn )
k-Nearest neighbour:
• Given xq , take vote among its k nearest neighbours (if discrete-valued
target function)
• take mean of f values of k nearest neighbours (if real-valued)
Pk
i=1 f (xi )
fˆ(xq ) ←
k

k-Nearest Neighbour Algorithm
Training algorithm:
• For each training example hxi , f (xi )i, add the example to the list
training examples.
Classification algorithm:
• Given a query instance xq to be classified,
• Let x1 . . . xk be the k instances from training examples that are
nearest to xq by the distance function
• Return
k
X
fˆ(xq ) ← arg max δ(v, f (xi ))
v∈V i=1
where δ(a, b) = 1 if a = b and 0 otherwise.

“Hypothesis Space” for Nearest Neighbour
−
−
−
+ +
xq
+ −
+
−
2 classes, + and − and query point xq . On left, note effect of varying k.

On right, 1−NN induces a Voronoi tessellation of the instance space.
Formed by the perpendicular bisectors of lines between points.

Distance function again
The distance function defines what is learned.

Instance x is described by a feature vector (list of attribute-value pairs)
ha1 (x), a2 (x), . . . am (x)i
where ar (x) denotes the value of the rth attribute of x.

Most commonly used distance function is Euclidean distance . . .
• distance between two instances xi and xj is defined to be
v
um
uX
d(xi , xj ) = t (ar (xi ) − ar (xj ))2
r=1

Distance function again
Many other distance functions could be used . . .

• e.g., Manhattan or city-block distance (sum of absolute values of
differences between attributes)
m
X
d(xi , xj ) = |ar (xi ) − ar (xj )|
r=1
Vector-based formalization – use norm L1 , L2 , . . .

The idea of distance functions will appear again in kernel methods.

Normalization and other issues
• Different attributes measured on different scales

• Need to be normalized (why ?)
vr − minvr
ar =
maxvr − −minvr
where vr is the actual value of attribute r
• Nominal attributes: distance either 0 or 1
• Common policy for missing values: assumed to be maximally distant
(given normalized attributes)

When To Consider Nearest Neighbour
• Instances map to points in <n

• Less than 20 attributes per instance
• or number of attributes can be reduced . . .
• Lots of training data
• No requirement for “explanatory” model to be learned

Advantages:
• Statisticians have used k-NN since early 1950s
• Can be very accurate
• As n → ∞ and k/n → 0 error approaches minimum
• Training is very fast
• Can learn complex target functions
• Don’t lose information by generalization - keep all instances

Disadvantages:
• Slow at query time: basic algorithm scans entire training data to
derive a prediction
• Assumes all attributes are equally important, so easily fooled by
irrelevant attributes
• Remedy: attribute selection or weights
• Problem of noisy instances:
• Remedy: remove from data set
• not easy – how to know which are noisy ?

What is the inductive bias of k-NN ?

• an assumption that the classification of query instance xq will be
most similar to the classification of other instances that are nearby
according to the distance function
k-NN uses terminology from statistical pattern recognition (see below)
• Regression approximating a real-valued target function
• Residual the error fˆ(x) − f (x) in approximating the target function
• Kernel function function of distance used to determine weight of each
training example, i.e., kernel function is the function K s.t.
wi = K(d(xi , xq ))

Nearest-neighbour classifier
• kNN uses the training data as exemplars, so training is O(n) (but

prediction is also O(n)!)
• 1NN perfectly separates training data, so low bias but high variance
• By increasing the number of neighbours k we increase bias and
decrease variance (what happens when k = n?)
• Easily adapted to real-valued targets, and even to structured objects
(nearest-neighbour retrieval). Can also output probabilities when
k>1
• Warning: in high-dimensional spaces everything is far away from
everything and so pairwise distances are uninformative (curse of
dimensionality)

Distance-Weighted kNN
• Might want to weight nearer neighbours more heavily ...

• Use distance function to construct a weight wi
• Replace the final line of the classification algorithm by:
k
X
fˆ(xq ) ← arg max wi δ(v, f (xi ))
v∈V i=1
where
1
wi ≡
d(xq , xi )2
and d(xq , xi ) is distance between xq and xi

Distance-Weighted kNN
For real-valued target functions replace the final line of the algorithm by:
Pk
i=1 wi f (xi )
fˆ(xq ) ← Pk
i=1 wi
(denominator normalizes contribution of individual weights).

Now we can consider using all the training examples instead of just k
→ using all examples (i.e., when k = n) with the rule above is called
Shepard’s method

Evaluation
Lazy learners do not construct an explicit model, so how do we evaluate

the output of the learning process ?
• 1-NN – training set error is always zero !
• each training example is always closest to itself
• k-NN – overfitting may be hard to detect
Leave-one-out cross-validation (LOOCV) – leave out each example and

predict it given the rest:
(x1 , y1 ), (x2 , y2 ), . . . , (xi−1 , yi−1 ), (xi+1 , yi+1 ), . . . , (xn , yn )
Error is mean over all predicted examples. Fast – no models to be built !

Curse of Dimensionality
Bellman (1960) coined this term in the context of dynamic programming

Imagine instances described by 20 attributes, but only 2 are relevant to
target function — “similar” examples will appear “distant”.
Curse of dimensionality: nearest neighbour is easily mislead when

high-dimensional X – problem of irrelevant attributes
One approach:
• Stretch jth axis by weight zj , where z1 , . . . , zn chosen to minimize
prediction error
• Use cross-validation to automatically choose weights z1 , . . . , zn
• Note setting zj to zero eliminates this dimension altogether

• number of “cells” in the instance space grows exponentially in the

number of features
• with exponentially many cells we would need exponentially many data
points to ensure that each cell is sufficiently populated to make
nearest-neighbour predictions reliably
See Moore and Lee (1994) “Efficient Algorithms for Minimizing Cross Validation Error”
Instance-based methods (IBk)

• attribute weighting: class-specific weights may be used (can result in
unclassified instances and multiple classifications)
• get Euclidean distance with weights
qX
wr (ar (xq ) − ar (x))2
• Updating of weights based on nearest neighbour

• Class correct/incorrect: weight increased/decreased
• |ar (xq ) − ar (x)| small/large: amount large/small

Instance-based learning (IBk)
Recap – Practical problems of 1-NN scheme:

• Slow (but fast k − d tree-based approaches exist)
• Remedy: removing irrelevant data
• Noise (but k-NN copes quite well with noise)
• Remedy: removing noisy instances
• All attributes deemed equally important
• Remedy: attribute weighting (or simply selection)
• Doesn’t perform explicit generalization
• Remedy: rule-based NN approach (like LWR)

Edited NN
• Edited NN classifiers discard some of the training instances before

making predictions
• Saves memory and speeds up classification
• IB2: incremental NN learner: only incorporates misclassified instances
into the classifier
• Problem: noisy data gets incorporated
• IB3: store classification performance information with each instance
& only use in prediction if above a threshold

Dealing with noise
Use larger values of k (why ?) How to find the “right” k ?

• One way: cross-validation-based k-NN classifier (but slow)
• Different approach: discarding instances that don’t perform well by
keeping success records of how well an instance does at prediction
(IB3)
• Computes confidence interval for an instance’s success rate and for
default accuracy of its class
• If lower limit of first interval is above upper limit of second one,
instance is accepted (IB3: 5%-level)
• If upper limit of first interval is below lower limit of second one,
instance is rejected (IB3: 12.5%-level)

Naive Bayes
Uncertainty
As far as the laws of mathematics refer to reality, they are not

certain; as far as they are certain, they do not refer to reality.
–Albert Einstein

Naive Bayes
Two Roles for Bayesian Methods
Provides practical learning algorithms:

• Naive Bayes classifier learning
• Bayesian network learning, etc.
• Combines prior knowledge (prior probabilities) with observed data
• How to get prior probabilities ?
Provides useful conceptual framework:

• Provides a “gold standard” for evaluating other learning algorithms
• Some additional insight into Occam’s razor

Naive Bayes
Bayes Theorem
P (D|h)P (h)
P (h|D) =
P (D)
where
P (h) = prior probability of hypothesis h
P (D) = prior probability of training data D
P (h|D) = probability of h given D
P (D|h) = probability of D given h

Naive Bayes
Choosing Hypotheses
P (D|h)P (h)
P (h|D) =
P (D)
Generally want the most probable hypothesis given the training data
Maximum a posteriori hypothesis hM AP :
hM AP = arg max P (h|D)

h∈H
P (D|h)P (h)
= arg max
h∈H P (D)
= arg max P (D|h)P (h)
h∈H

Naive Bayes
Choosing Hypotheses
If assume P (hi ) = P (hj ) then can further simplify, and choose the
Maximum likelihood (ML) hypothesis
hM L = arg max P (D|hi )

hi ∈H

Naive Bayes
Applying Bayes Theorem
Does patient have cancer or not?

A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the cases
in which the disease is actually present, and a correct negative
result in only 97% of the cases in which the disease is not
present. Furthermore, .008 of the entire population have this
cancer.
P (cancer) = P (¬cancer) =
P (⊕ | cancer) = P ( | cancer) =
P (⊕ | ¬cancer) = P ( | ¬cancer) =

Naive Bayes

A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the cases
in which the disease is actually present, and a correct negative
result in only 97% of the cases in which the disease is not
present. Furthermore, .008 of the entire population have this
cancer.
P (cancer) = .008 P (¬cancer) = .992

P (⊕ | cancer) = .98 P ( | cancer) = .02
P (⊕ | ¬cancer) = .03 P ( | ¬cancer) = .97

Naive Bayes

We can find the maximum a posteriori (MAP) hypothesis
P (⊕ | cancer)P (cancer) = 0.98 × 0.008 = 0.00784

P (⊕ | ¬cancer)P (¬cancer) = 0.03 × 0.992 = 0.02976
Thus hM AP = . . ..

Naive Bayes

We can find the maximum a posteriori (MAP) hypothesis
P (⊕ | cancer)P (cancer) = 0.98 × 0.008 = 0.00784

P (⊕ | ¬cancer)P (¬cancer) = 0.03 × 0.992 = 0.02976
Thus hM AP = ¬cancer.
Also note: posterior probability of hypothesis cancer higher than prior.

Naive Bayes
How to get the posterior probability of a hypothesis h ?

Divide by P (⊕), probability of data, to normalize result for h:
P (D|h)P (h)
P (h|D) = P
hi ∈H P (D|hi )P (hi )
Denominator ensures we obtain posterior probabilities that sum to 1.

Sum for all possible numerator values, since hypotheses are mutually
exclusive (e.g., patient either has cancer or does not).
Marginal likelihood (marginalizing out over hypothesis space) — prior
probability of the data.

Naive Bayes
Basic Formulas for Probabilities
Product Rule: probability P (A ∧ B) of a conjunction of two events A and

B:
P (A ∧ B) = P (A|B)P (B) = P (B|A)P (A)
Sum Rule: probability of a disjunction of two events A and B:
P (A ∨ B) = P (A) + P (B) − P (A ∧ B)
Theorem
P of total probability : if events A1 , . . . , An are mutually exclusive
with ni=1 P (Ai ) = 1, then:
n
X
P (B) = P (B|Ai )P (Ai )
i=1

Naive Bayes
Basic Formulas for Probabilities
Also worth remembering:

• Conditional Probability: probability of A given B:
P (A ∧ B)
P (A|B) =
P (B)
• Rearrange sum rule to get:
P (A ∧ B) = P (A) + P (B) − P (A ∨ B)
Exercise: Derive Bayes Theorem.

Naive Bayes
Brute Force MAP Hypothesis Learner
Idea: view learning as finding the most probable hypothesis
• For each hypothesis h in H, calculate the posterior probability
P (D|h)P (h)
P (h|D) =
P (D)
• Output the hypothesis hM AP with the highest posterior probability
hM AP = arg max P (h|D)

h∈H

A Bayesian approach to learning algorithms Concept learning in Bayesian terms
Relation to Concept Learning
Consider our usual concept learning task:
• instance space X, hypothesis space H, training examples D

• consider a learning algorithm that outputs most specific hypothesis
from the version space V SH,D ) (set of all consistent or ”zero-error”
classification rules
What would Bayes rule produce as the MAP hypothesis?
Does this algorithm output a MAP hypothesis??

Brute Force MAP Framework for Concept Learning:
Assume fixed set of instances hx1 , . . . , xm i
Assume D is the set of classifications D = hc(x1 ), . . . , c(xm )i
Choose P (h) to be uniform distribution:

1
• P (h) = |H| for all h in H
Choose P (D|h):
• P (D|h) = 1 if h consistent with D
• P (D|h) = 0 otherwise

Then:
 1
 |V SH,D | if h is consistent with D
P (h|D) =

0 otherwise

Note that since likelihood is zero if h is inconsistent then the posterior is

also zero. But how did we obtain the posterior for consistent h ?
1
1· |H|
P (h|D) =
P (D)
1
1 · |H|
= |V SH,D |
|H|
1
=
|V SH,D |

How did we obtain P (D) ? From theorem of total probability:
X
P (D) = P (D|Hi )P (hi )
hi ∈H
X 1 X 1
= 1· + 0·
|H| |H|
hi ∈V SH,D hi 6∈V SH,D
X 1
= 1·
|H|
hi ∈V SH,D
|V SH,D |
=
|H|

Evolution of Posterior Probabilities
P ( h) P(h|D1) P(h|D1,D2)
hypotheses hypotheses hypotheses

(a) ( b) ( c)

Every hypothesis consistent with D is a MAP hypothesis, if we assume

• uniform probability over H
• target function c ∈ H
• deterministic, noise-free data
• etc. (see above)
So, this learning algorithm will output a MAP hypothesis, even though it
does not explicitly use probabilities in learning.

A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms
Learning A Real Valued Function
E.g., learning a linear target function f from noisy examples:
hML

Consider any real-valued target function f

Training examples hxi , di i, where di is noisy training value
• di = f (xi ) + ei
• ei is random variable (noise) drawn independently for each xi
according to some Gaussian (normal) distribution with mean=0
Then the maximum likelihood hypothesis hM L is the one that
minimizes the sum of squared errors:
m
X
hM L = arg min (di − h(xi ))2
h∈H
i=1

How did we obtain this ?
hM L = arg max p(D|h)

h∈H
m
Y
= arg max p(di |h)
h∈H i=1
m
Y 1 1 di −h(xi ) 2
= arg max √ e− 2 ( σ
)
h∈H i=1 2πσ 2
Recall that we treat each probability p(D|h) as if h = f , i.e., we assume

µ = f (xi ) = h(xi ), which is the key idea behind maximum likelihood !

Maximize natural log to give simpler expression:
m 2
X 1 1 di − h(xi )
hM L = arg max ln √ −
h∈H 2πσ 2 2 σ
i=1
m 2
X 1 di − h(xi )
= arg max −
h∈H 2 σ
i=1
m
X
= arg max − (di − h(xi ))2
h∈H i=1
Equivalently, we can minimize the positive version of the expression:

m
X
hM L = arg min (di − h(xi ))2
h∈H i=1

Discriminative and generative probabilistic models

• Discriminative models model the posterior probability distribution
P (Y |X), where Y is the target variable and X are the features. That
is, given X they return a probability distribution over Y .
• Generative models model the joint distribution P (Y, X) of the target
Y and the feature vector X. Once we have access to this joint
distribution we can derive any conditional or marginal distribution
involving Pthe same variables. In particular, since
P (X) = y P (Y = y, X) it follows that the posterior distribution
can be obtained as P (Y |X) = P PP(Y,X) .
y (Y =y,X)
• Alternatively, generative models can be described by the likelihood
function P (X|Y ), since P (Y, X) = P (X|Y )P (Y ) and the target or
prior distribution (usually abbreviated to ‘prior’) can be easily
estimated or postulated.
• Such models are called ‘generative’ because we can sample from the
joint distribution to obtain new data points together with their labels.
Alternatively,
COMP9417 ML & DM we can use P (Y ) to sample
Classification (1) a class andMarch
P (X|Y ) to 67 / 130
14, 2017
Assessing uncertainty in estimates

Suppose we want to estimate the probability θ that an arbitrary e-mail is
spam, so that we can use the appropriate prior distribution.
• The natural thing to do is to inspect n e-mails, determine the number
of spam e-mails d, and set θ̂ = d/n; we don’t really need any
complicated statistics to tell us that.
• However, while this is the most likely estimate of θ – the maximum a
posteriori (MAP) estimate – this doesn’t mean that other values of θ
are completely ruled out.
• We model this by a probability distribution over θ (a Beta distribution
in this case) which is updated each time new information comes in.
This is further illustrated in the figure for a distribution that is more
and more skewed towards spam.
• For each curve, its bias towards spam is given by the area under the
curve and to the right of θ = 1/2.

Assessing uncertainty in estimates

5
0.2 0.4 0.6 0.8
Each time we inspect an e-mail, we are reducing our uncertainty regarding

the prior spam probability θ. After we inspect two e-mails and observe one
spam, the possible θ values are characterised by a symmetric distribution
around 1/2. If we inspect a third, fourth, . . . , tenth e-mail and each time
(except the first one) it is spam, then this distribution narrows and shifts a
little bit to the right each time. The distribution for n e-mails reaches its
maximum at θ̂MAP = n−1 n (e.g., θ̂MAP = 0.8 for n = 5).
The Bayesian perspective

Explicitly modelling the posterior distribution over the parameter θ has a
number of advantages that are usually associated with the ‘Bayesian’
perspective:
• We can precisely characterise the uncertainty that remains about our
estimate by quantifying the spread of the posterior distribution.
• We can obtain a generative model for the parameter by sampling
from the posterior distribution, which contains much more
information than a summary statistic such as the MAP estimate can
convey – so, rather than using a single e-mail with θ = θMAP , our
generative model can contain a number of e-mails with θ sampled
from the posterior distribution.
• We can quantify the probability of statements such as ‘e-mails are
biased towards ham’ (the tiny shaded area in the figure demonstrates
that after observing one ham and nine spam e-mails this probability is
very small, about 0.6%).
• We can use one of these distributions to encode our prior beliefs: e.g.,
Minimum Description Length Principle
Once again, the MAP hypothesis
hM AP = arg max P (D|h)P (h)

h∈H
Which is equivalent to
hM AP = arg max log2 P (D|h) + log2 P (h)

h∈H
Or
hM AP = arg min − log2 P (D|h) − log2 P (h)
h∈H

Interestingly, this is an expression about a quantity of bits.
hM AP = arg min − log2 P (D|h) − log2 P (h) (1)

h∈H
From information theory:

The optimal (shortest expected coding length) code for an event
with probability p is − log2 p bits.

So interpret (1):
• − log2 P (h) is length of h under optimal code
• − log2 P (D|h) is length of D given h under optimal code
Note well: assumes optimal encodings, when the priors and likelihoods
are known. In practice, this is difficult, and makes a difference.

Occam’s razor: prefer the shortest hypothesis
MDL: prefer the hypothesis h that minimizes
hM DL = arg min LC1 (h) + LC2 (D|h)

h∈H
where LC (x) is the description length of x under optimal encoding C

Example: H = decision trees, D = training data labels

• LC1 (h) is # bits to describe tree h
• LC2 (D|h) is # bits to describe D given h
• Note LC2 (D|h) = 0 if examples classified perfectly by h. Need only
describe exceptions
• Hence hM DL trades off tree size for training errors
• i.e., prefer the hypothesis that minimizes
length(h) + length(misclassif ications)

Bayes Optimal Classifier
Most Probable Classification of New Instances
So far we’ve sought the most probable hypothesis given the data D (i.e.,
hM AP )
Given new instance x, what is its most probable classification?

• hM AP (x) is not the most probable classification!

Most Probable Classification of New Instances
Consider:
• Three possible hypotheses:
P (h1 |D) = .4, P (h2 |D) = .3, P (h3 |D) = .3
• Given new instance x,
h1 (x) = +, h2 (x) = −, h3 (x) = −
• What’s most probable classification of x?

Bayes optimal classification:

X
arg max P (vj |hi )P (hi |D)
vj ∈V
hi ∈H
Example:
P (h1 |D) = .4, P (−|h1 ) = 0, P (+|h1 ) = 1

P (h2 |D) = .3, P (−|h2 ) = 1, P (+|h2 ) = 0
P (h3 |D) = .3, P (−|h3 ) = 1, P (+|h3 ) = 0

therefore
X
P (+|hi )P (hi |D) = .4
hi ∈H
X
P (−|hi )P (hi |D) = .6
hi ∈H
and
X
arg max P (vj |hi )P (hi |D) = −
vj ∈V
hi ∈H
No other classification method using the same hypothesis space and same
prior knowledge can outperform this method on average

Naive Bayes
Naive Bayes Classifier
Along with decision trees, neural networks, nearest neighbour, one of the
most practical learning methods.
When to use
• Moderate or large training set available
• Attributes that describe instances are conditionally independent given
classification
Successful applications:
• Classifying text documents
• Gaussian Naive Bayes for real-valued data

Naive Bayes
Assume target function f : X → V , where each instance x described by

attributes ha1 , a2 . . . an i.
Most probable value of f (x) is:
vM AP = arg max P (vj |a1 , a2 . . . an )

vj ∈V
P (a1 , a2 . . . an |vj )P (vj )
vM AP = arg max
vj ∈V P (a1 , a2 . . . an )
= arg max P (a1 , a2 . . . an |vj )P (vj )
vj ∈V

Naive Bayes
Naive Bayes assumption:

Y
P (a1 , a2 . . . an |vj ) = P (ai |vj )
i
• Attributes are statistically independent (given the class value)

• which means knowledge about the value of a particular attribute tells
us nothing about the value of another attribute (if the class is known)
which gives
Y
Naive Bayes classifier: vN B = arg max P (vj ) P (ai |vj )
vj ∈V i

Naive Bayes
Naive Bayes Algorithm
Naive Bayes Learn(examples)

For each target value vj
P̂ (vj ) ← estimate P (vj )
For each attribute value ai of each attribute a
P̂ (ai |vj ) ← estimate P (ai |vj )
Classify New Instance(x)

Y
vN B = arg max P̂ (vj ) P̂ (ai |vj )
vj ∈V ai ∈x

Naive Bayes
Naive Bayes Example
Consider PlayTennis again . . .

Outlook Temperature Humidity Windy
Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5
Play
Yes No
9 5
9/14 5/14

Naive Bayes
Naive Bayes Example
Say we have the new instance:
hOutlk = sun, T emp = cool, Humid = high, W ind = truei
We want to compute:
Y
vN B = arg max P (vj ) P (ai |vj )
vj ∈{“yes”,“no”} i

Naive Bayes
Naive Bayes Example
So we first calculate the likelihood of the two classes, “yes” and “no”
for “yes” = P (y) × P (sun|y) × P (cool|y) × P (high|y) × P (true|y)

9 2 3 3 3
0.0053 = × × × ×
14 9 9 9 9
for “no” = P (n) × P (sun|n) × P (cool|n) × P (high|n) × P (true|n)

5 3 1 4 3
0.0206 = × × × ×
14 5 5 5 5

Naive Bayes
Naive Bayes Example
Then convert to a probability by normalisation
0.0053
P (“yes”) =
(0.0053 + 0.0206)
= 0.205
0.0206
P (“no”) =
(0.0053 + 0.0206)
= 0.795
The Naive Bayes classification is “no”.

Naive Bayes
Naive Bayes: Subtleties

Conditional independence assumption is often violated
Y
P (a1 , a2 . . . an |vj ) = P (ai |vj )
i
• ...but it works surprisingly well anyway. Note don’t need estimated

posteriors P̂ (vj |x) to be correct; need only that
Y
arg max P̂ (vj ) P̂ (ai |vj ) = arg max P (vj )P (a1 . . . , an |vj )
vj ∈V i vj ∈V
i.e. maximum probability is assigned to correct class

• see [Domingos & Pazzani, 1996] for analysis
• Naive Bayes posteriors often unrealistically close to 1 or 0
• adding too many redundant attributes will cause problems (e.g.
identical attributes)
Naive Bayes
Naive Bayes: “zero-frequency” problem
What if none of the training instances with target value vj have attribute
value ai ? Then
P̂ (ai |vj ) = 0, and...
Y
P̂ (vj ) P̂ (ai |vj ) = 0
i
Pseudo-counts add 1 to each count (a version of the Laplace Estimator)

Naive Bayes
• In some cases adding a constant different from 1 might be more

appropriate
• Example: attribute outlook for class yes
Sunny Overcast Rainy
2+ µ
2
4+ µ
3
3+ µ
3
9+µ 9+µ 9+µ
• Weights don’t need to be equal (if they sum to 1) – a form of prior
Sunny Overcast Rainy
2+µp1 4+µp2 3+µp3
9+µ 9+µ 9+µ

Naive Bayes
This generalisation is a Bayesian estimate for P̂ (ai |vj )

nc + mp
P̂ (ai |vj ) ←
n+m
where
• n is number of training examples for which v = vj ,
• nc number of examples for which v = vj and a = ai
• p is prior estimate for P̂ (ai |vj )
• m is weight given to prior (i.e. number of “virtual” examples)
This is called the m-estimate of probability.

Naive Bayes
Naive Bayes: missing values
• Training: instance is not included in frequency count for attribute

value-class combination
• Classification: attribute will be omitted from calculation

Naive Bayes
Naive Bayes: numeric attributes
• Usual assumption: attributes have a normal or Gaussian probability

distribution (given the class)
• The probability density function for the normal distribution is defined
by two parameters:
The sample mean µ:
n
1X
µ= xi
n i=1
The standard deviation σ:
n
1 X
σ= (xi − µ)2
n − 1 i=1

Naive Bayes
Then we have the density function f (x):
1 (x−µ)2
f (x) = √ e− 2σ2
2πσ
Example: continuous attribute temperature with mean = 73 and standard
deviation = 6.2. Density value
1 (66−73)2
−
f (temperature = 66|“yes”) = √ e 2×6.22 = 0.0340
2π6.2
Missing values during training are not included in calculation of mean and
standard deviation.

Naive Bayes
Note: the normal distribution is based on the simple exponential function

m
f (x) = e−|x|
As the power m in the exponent increases, the function approaches a step

function.
Where m = 2
2
f (x) = e−|x|
and this is the basis of the normal distribution – the various constants are
the result of scaling so that the integral (the area under the curve from
−∞ to +∞) is equal to 1.
from “Statistical Computing” by Michael J. Crawley (2002) Wiley.

Naive Bayes
Categorical random variables

Categorical variables or features (also called discrete or nominal) are
ubiquitous in machine learning.
• Perhaps the most common form of the Bernoulli distribution models
whether or not a word occurs in a document. That is, for the i-th
word in our vocabulary we have a random variable Xi governed by a
Bernoulli distribution. The joint distribution over the bit vector
X = (X1 , . . . , Xk ) is called a multivariate Bernoulli distribution.
• Variables with more than two outcomes are also common: for
example, every word position in an e-mail corresponds to a categorical
variable with k outcomes, where k is the size of the vocabulary. The
multinomial distribution manifests itself as a count vector: a
histogram of the number of occurrences of all vocabulary words in a
document. This establishes an alternative way of modelling text
documents that allows the number of occurrences of a word to
influence the classification of a document.
Naive Bayes
Categorical random variables
Both these document models are in common use. Despite their

differences, they both assume independence between word occurrences,
generally referred to as the naive Bayes assumption.
• In the multinomial document model, this follows from the very use of
the multinomial distribution, which assumes that words at different
word positions are drawn independently from the same categorical
distribution.
• In the multivariate Bernoulli model we assume that the bits in a bit
vector are statistically independent, which allows us to compute the
joint probability of a particular bit vector (x1 , . . . , xk ) as the product
of the probabilities of each component P (Xi = xi ).
• In practice, such word independence assumptions are often not true:
if we know that an e-mail contains the word ‘Viagra’, we can be quite
sure that it will also contain the word ‘pill’. Violated independence
assumptions violate the quality of probability estimates but may still
allow good classification performance.
Naive Bayes Learning and using a naive Bayes model for classification
Example application: Learning to Classify Text
In machine learning, the classic example of applications of Naive Bayes is

learning to classify text documents.
Here is a simplified version in the multinomial document model.

Learning to Classify Text
Why?
• Learn which news articles are of interest
• Learn to classify web pages by topic
Naive Bayes is among most effective algorithms
What attributes shall we use to represent text documents??

Target concept Interesting? : Document → {+, −}

1 Represent each document by vector of words
• one attribute per word position in document
2 Learning: Use training examples to estimate
• P (+)
• P (−)
• P (doc|+)
• P (doc|−)

Naive Bayes conditional independence assumption

length(doc)
Y
P (doc|vj ) = P (ai = wk |vj )
i=1
where P (ai = wk |vj ) is probability that word in position i is wk , given vj
one more assumption: P (ai = wk |vj ) = P (am = wk |vj ), ∀i, m
“bag of words”

Learn naive Bayes text(Examples, V )

// collect all words and other tokens that occur in Examples
V ocabulary ← all distinct words and other tokens in Examples
// calculate the required P (vj ) and P (wk |vj ) probability terms
for each target value vj in V do
docsj ← subset of Examples for which the target value is vj
|docsj |
P (vj ) ← |Examples|
T extj ← a single document created by concatenating all members of docsj
n ← total number of words in T extj (counting duplicate words multiple times)
for each word wk in V ocabulary
nk ← number of times word wk occurs in T extj
nk +1
P (wk |vj ) ← n+|V ocabulary|

Classify naive Bayes text(Doc)

• positions ← all word positions in Doc that contain tokens found in
V ocabulary
• Return vN B , where
Y
vN B = arg max P (vj ) P (ai |vj )
vj ∈V i∈positions

Application: 20 Newsgroups
Given: 1000 training documents from each group
Learning task: classify each new document by newsgroup it came from
comp.graphics misc.forsale
comp.os.ms-windows.misc rec.autos
comp.sys.ibm.pc.hardware rec.motorcycles
comp.sys.mac.hardware rec.sport.baseball
comp.windows.x rec.sport.hockey
alt.atheism sci.space
soc.religion.christian sci.crypt
talk.religion.misc sci.electronics
talk.politics.mideast sci.med
talk.politics.misc
talk.politics.guns
Naive Bayes: 89% classification accuracy

Article from rec.sport.hockey
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu
From: [email protected] (John Doe)
Subject: Re: This year’s biggest and worst (opinion)...
Date: 5 Apr 93 09:53:39 GMT
I can only comment on the Kings, but the most obvious candidate
for pleasant surprise is Alex Zhitnik. He came highly touted as
a defensive defenseman, but he’s clearly much more than that.
Great skater and hard shot (though wish he were more accurate).
In fact, he pretty much allowed the Kings to trade away that
huge defensive liability Paul Coffey. Kelly Hrudey is only the
biggest disappointment if you thought he was any good to begin
with. But, at best, he’s only a mediocre goaltender. A better
choice would be Tomas Sandstrom, though not through any fault
of his own, but because some thugs in Toronto decided ...

Learning Curve for 20 Newsgroups
20News
100
90
80
70 Bayes
TFIDF
60 PRTFIDF
50
40
30
20
10
0
100 1000 10000
Accuracy vs. Training set size (1/3 withheld for test)

Probabilistic decision rules
We have chosen one of the possible distributions to model our data X as

coming from either class.
• The more different P (X|Y = spam) and P (X|Y = ham) are, the
more useful the features X are for classification.
• Thus, for a specific e-mail x we calculate both P (X = x|Y = spam)
and P (X = x|Y = ham), and apply one of several possible decision
rules:
maximum likelihood (ML) – predict arg maxy P (X = x|Y = y);
maximum a posteriori (MAP) – predict
arg maxy P (X = x|Y = y)P (Y = y);
The relation between the first two decision rules is that ML classification is
equivalent to MAP classification with a uniform class distribution.

Probabilistic decision rules
We again use the example of Naive Bayes for text classification to

illustrate, using both the multinomial and multivariate Bernoulli models.

Prediction using a naive Bayes model
Suppose our vocabulary contains three words a, b and c, and we use a

multivariate Bernoulli model for our e-mails, with parameters
θ ⊕ = (0.5, 0.67, 0.33) θ = (0.67, 0.33, 0.33)
This means, for example, that the presence of b is twice as likely in spam
(+), compared with ham.
The e-mail to be classified contains words a and b but not c, and hence is
described by the bit vector x = (1, 1, 0). We obtain likelihoods
P (x|⊕) = 0.5 · 0.67 · (1 − 0.33) = 0.222

P (x| ) = 0.67 · 0.33 · (1 − 0.33) = 0.148
The ML classification of x is thus spam.


In the case of two classes it is often convenient to work with likelihood
ratios and odds.
• The likelihood ratio can be calculated as
P (x|⊕) 0.5 0.67 1 − 0.33
= = 3/2 > 1
P (x| ) 0.67 0.33 1 − 0.33
• This means that the MAP classification of x is also spam if the prior
odds are more than 2/3, but ham if they are less than that.
• For example, with 33% spam and 67% ham the prior odds are
P (⊕) 0.33
P ( ) = 0.67 = 1/2, resulting in a posterior odds of
P (⊕|x) P (x|⊕) P (⊕)

= = 3/2 · 1/2 = 3/4 < 1
P ( |x) P (x| ) P ( )
In this case the likelihood ratio for x is not strong enough to push the
decision away from the prior.

Alternatively, we can employ a multinomial model. The parameters of a
multinomial establish a distribution over the words in the vocabulary, say
θ ⊕ = (0.3, 0.5, 0.2) θ = (0.6, 0.2, 0.2)
The e-mail to be classified contains three occurrences of word a, one single
occurrence of word b and no occurrences of word c, and hence is described
by the count vector x = (3, 1, 0). The total number of vocabulary word
occurrences is n = 4. We obtain likelihoods
0.33 0.51 0.20
P (x|⊕) = 4! = 0.054
3! 1! 0!
0.63 0.21 0.20
P (x| ) = 4! = 0.1728
3! 1! 0!
0.3 3 0.5 1 0.2 0

The likelihood ratio is 0.6 0.2 0.2 = 5/16. The ML classification
of x is thus ham, the opposite of the multivariate Bernoulli model. This is
mainly because of the three occurrences of word a, which provide strong
evidence for ham.
Training data for naive Bayes
A small e-mail data set described by count vectors.
E-mail #a #b #c Class
e1 0 3 0 +
e2 0 3 3 +
e3 3 0 0 +
e4 2 3 0 +
e5 4 3 0 −
e6 4 0 3 −
e7 3 0 0 −
e8 0 0 0 −

Training data for naive Bayes
The same data set described by bit vectors.
E-mail a? b? c? Class
e1 0 1 0 +
e2 0 1 1 +
e3 1 0 0 +
e4 1 1 0 +
e5 1 1 0 −
e6 1 0 1 −
e7 1 0 0 −
e8 0 0 0 −

Training a naive Bayes model
Consider the following e-mails consisting of five words a, b, c, d, e:

e1 : bdebbde e5 : abababaed
e2 : bcebbddecc e6 : acacacaed
e3 : adadeaee e7 : eaedaea
e4 : badbedab e8 : deded
We are told that the e-mails on the left are spam and those on the right
are ham, and so we use them as a small training set to train our Bayesian
classifier.
• First, we decide that d and e are so-called stop words that are too
common to convey class information.
• The remaining words, a, b and c, constitute our vocabulary.

For the multinomial model, we represent each e-mail as a count vector, as

before.
• In order to estimate the parameters of the multinomial, we sum up
the count vectors for each class, which gives (5, 9, 3) for spam and
(11, 3, 3) for ham.
• To smooth these probability estimates we add one pseudo-count for
each vocabulary word, which brings the total number of occurrences
of vocabulary words to 20 for each class.
• The estimated parameter vectors are thus
θ̂ ⊕ = (6/20, 10/20, 4/20) = (0.3, 0.5, 0.2) for spam and
θ̂ = (12/20, 4/20, 4/20) = (0.6, 0.2, 0.2) for ham.

In the multivariate Bernoulli model e-mails are represented by bit vectors,

as before.
• Adding the bit vectors for each class results in (2, 3, 1) for spam and
(3, 1, 1) for ham.
• Each count is to be divided by the number of documents in a class, in
order to get an estimate of the probability of a document containing a
particular vocabulary word.
• Probability smoothing now means adding two pseudo-documents, one
containing each word and one containing none of them.
• This results in the estimated parameter vectors
θ̂ ⊕ = (3/6, 4/6, 2/6) = (0.5, 0.67, 0.33) for spam and
θ̂ = (4/6, 2/6, 2/6) = (0.67, 0.33, 0.33) for ham.

Linear discriminants
Many forms of linear discriminant from statistics and machine learning,

e.g.,
• Fisher’s Linear Discriminant Analysis
• the basic linear classifier in lecture 1 is a version of this
• Logistic Regression
• a probabilistic linear classifier
• Perceptron
• a linear threshold classifier
• an early version of an artificial “neuron”
• still a useful method, and source of ideas

Logistic regression
In the case of a two-class problem, model the probability of one class

P (Y = 1) vs. the alternative (1 − P (Y = 1)):
1
P (Y = 1|x) =
1 + e−wT x
or
P (Y = 1|x)
ln = wT x
1 − P (Y = 1|x)
The quantity on the l.h.s. is called the logit and all we are doing is a linear
model for the logit.
Note: to fit this is actually more complex than linear regression, so we
omit the details.
Generalises to multiple class versions (Y can have more than two values).

Perceptrons
Perceptron
x1 w1 x0=1
w0
x2 w2
.
. Σ n
{
n
Σ wi xi 1 if S w x > 0
. wn i=0 o= i=0 i i
-1 otherwise
xn

1 if w0 + w1 x1 + · · · + wn xn > 0
o(x1 , . . . , xn ) =
−1 otherwise.

Perceptrons
Perceptron
Sometimes simpler vector notation used:

1 if w · x > 0
o(x) =
−1 otherwise.

Perceptrons
Decision Surface of a Perceptron
x2 x2
+
+
- + -
+
x1 x1
- - +
-
(a) (b)
Represents some useful functions

• What weights represent g(x1 , x2 ) = AN D(x1 , x2 )?

Perceptrons
Decision Surface of a Perceptron
But some functions not representable

• e.g., not linearly separable
• a labelled data set is linearly separable if there is a linear decision
boundary that separates the classes
• for non-linearly separable data we’ll need something else
• e.g., networks of these ...
• next lecture

Perceptrons
Perceptron training rule

Learning is “finding a good set of weights”
Perceptron training rule is a weight-update scheme:
wi ← wi + ∆wi
where
∆wi = η(t − o)xi
Where:
• t = c(x) is target value
• o is perceptron output
• η is a small constant called learning rate
• between 0 and 1
• to simplify things you may assume η = 1
• but in practice usually set at less than 0.2, e.g., 0.1
• η can be varied during learning

Perceptrons
Perceptron training rule
Can prove it will converge

• If training data is linearly separable
• and η sufficiently small

Perceptrons
Perceptron learning
A linear classifier that will achieve perfect separation on linearly separable

data is the perceptron, originally proposed as a simple neural network. The
perceptron iterates over the training set, updating the weight vector every
time it encounters an incorrectly classified example.
• For example, let xi be a misclassified positive example, then we have
yi = +1 and w · xi < t. We therefore want to find w0 such that
w0 · xi > w · xi , which moves the decision boundary towards and
hopefully past xi .
• This can be achieved by calculating the new weight vector as
w0 = w + ηxi , where 0 < η ≤ 1 is the learning rate (again, assume set
to 1). We then have w0 · xi = w · xi + ηxi · xi > w · xi as required.
• Similarly, if xj is a misclassified negative example, then we have
yj = −1 and w · xj > t. In this case we calculate the new weight
vector as w0 = w − ηxj , and thus w0 · xj = w · xj − ηxj · xj < w · xj .

Perceptrons
Perceptron learning
• The two cases can be combined in a single update rule:
w0 = w + ηyi xi
• Here yi acts to change the sign of the update, corresponding to

whether a positive or negative example was misclassified
• This is the basis of the perceptron training algorithm for linear
classification
• The algorithm just iterates over the training examples applying the
weight update rule until all the examples are correctly classified
• If there is a linear model that separates the positive from the negative
examples, i.e., the data is linearly separable, it can be shown that the
perceptron training algorithm will converge in a finite number of steps.

Perceptrons
Perceptron training
Algorithm Perceptron(D, η) // perceptron training for linear classification

Input: labelled training data D in homogeneous coordinates; learning rate η.
Output: weight vector w defining classifier ŷ = sign(w · x).
1 w ←0 // Other initialisations of the weight vector are possible
2 converged ←false
3 while converged = false do
4 converged ←true
5 for i = 1 to |D| do
6 if yi w · xi ≤ 0 then // i.e., yˆi 6= yi
7 w←w + ηyi xi
8 converged ←false // We changed w so haven’t converged yet
9 end
10 end
11 end

Perceptrons
Perceptron training – varying learning rate

3 3 3
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−3 −3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
(left) A perceptron trained with a small learning rate (η = 0.2). The

circled examples are the ones that trigger the weight update. (middle)
Increasing the learning rate to η = 0.5 leads in this case to a rapid
convergence. (right) Increasing the learning rate further to η = 1 may lead
to too aggressive weight updating, which harms convergence.
The starting point in all three cases was the basic linear classifier.
Summary
Summary
• Two frameworks for classification by linear models

Distance-based. The key ideas are geometric.
Probabilistic. The key ideas are Bayesian.

Summary
Summary
We also discussed the Perceptron, a simple form of threshold model.

• The perceptron is a linear classifier
• Early attempt to model artificial “neuron”
• One of the oldest ideas in machine learning, but still useful
• For example, it is an incremental, or online, learning algorithm
• With the “right” features, the target concept may be linearly
separable
• The weight vector is a linear combination of the training instances
• All of the methods we looked at do classification with models that are
essentially linear in the parameters

Showfile

Uploaded by

Copyright:

Available Formats

Showfile

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Showfile

Uploaded by

Copyright:

Available Formats

Supervised Learning – Classification (1)

COMP9417 Machine Learning and Data Mining

March 14, 2017

COMP9417 ML & DM Classification (1) March 14, 2017 1 / 130

Material derived from slides for the book

Material derived from slides for the book

Material derived from slides for the book

Material derived from slides for the book

Material derived from slides for the book

Material derived from slides for the course

BITS Pilani, Goa, India (2016)

COMP9417 ML & DM Classification (1) March 14, 2017 2 / 130

This lecture will introduce you to machine learning approaches to the

COMP9417 ML & DM Classification (1) March 14, 2017 3 / 130

Classification methods dominate machine learning . . .

COMP9417 ML & DM Classification (1) March 14, 2017 4 / 130

Nearest neighbour classification

Minkowski distance If X = Rd , the Minkowski distance of order p > 0 is

COMP9417 ML & DM Classification (1) March 14, 2017 6 / 130

• The 2-norm refers to the familiar Euclidean distance

which measures distance ‘as the crow flies’.

This is the distance if we can only travel along coordinate axes.

COMP9417 ML & DM Classification (1) March 14, 2017 7 / 130

under the understanding that x0 = 0 for x = 0 and 1 otherwise.

COMP9417 ML & DM Classification (1) March 14, 2017 8 / 130

Sometimes the data X is not naturally in Rd , but if we can turn it into

COMP9417 ML & DM Classification (1) March 14, 2017 9 / 130

Circles and ellipses

Lines connecting points at order-p Minkowski distance 1 from the origin

Distance metric Given an instance space X , a distance metric is a

COMP9417 ML & DM Classification (1) March 14, 2017 11 / 130

The triangle inequality – Minkowski distance for p = 2

The triangle inequality – Minkowski distance for p ≤ 1

Clearly, Euclidean distance is a special case of Mahalanobis distance with

COMP9417 ML & DM Classification (1) March 14, 2017 14 / 130

Means and distances

The arithmetic mean minimises squared Euclidean distance The

COMP9417 ML & DM Classification (1) March 14, 2017 15 / 130

Means and distances

• Notice that minimising the sum of squared Euclidean distances of a

COMP9417 ML & DM Classification (1) March 14, 2017 16 / 130

Means and distances

• In certain situations it makes sense to restrict an exemplar to be one

COMP9417 ML & DM Classification (1) March 14, 2017 17 / 130

Centroids and medoids

The basic linear classifier is distance-based

• The basic linear classifier constructs the decision boundary as the

COMP9417 ML & DM Classification (1) March 14, 2017 19 / 130

Two-exemplar decision boundaries

Three-exemplar decision boundaries

(left) Decision regions defined by the 2-norm nearest-exemplar decision

To summarise, the main ingredients of distance-based models are

COMP9417 ML & DM Classification (1) March 14, 2017 22 / 130

Stores all training examples hxi , f (xi )i.

COMP9417 ML & DM Classification (1) March 14, 2017 23 / 130

k-Nearest Neighbour Algorithm

where δ(a, b) = 1 if a = b and 0 otherwise.

COMP9417 ML & DM Classification (1) March 14, 2017 24 / 130