Showfile
Showfile
Showfile
Introduction
Minkowski distance
P 1/p
d p
where ||z||p = j=1 |z j | is the p-norm (sometimes denoted Lp
norm) of the vector z.
Minkowski distance
Minkowski distance
• If we now let p grow larger, the distance will be more and more
dominated by the largest coordinate-wise distance, from which we can
infer that Dis∞ (x, y) = maxj |xj − yj |; this is also called Chebyshev
distance.
• You will sometimes see references to the 0-norm (or L0 norm) which
counts the number of non-zero elements in a vector. The
corresponding distance then counts the number of positions in which
vectors x and y differ. This is not strictly a Minkowski distance;
however, we can define it as
d
X d
X
0
Dis0 (x, y) = (xj − yj ) = I[xj = yj ]
j=1 j=1
Minkowski distance
Distance metric
B !
A ! !
C
The green circle connects points the same Euclidean distance (i.e.,
Minkowski distance of order p = 2) away from the origin as A. The orange
circle shows that B and C are equidistant from A. The red circle
demonstrates that C is closer to the origin than B, which conforms to the
triangle inequality.
COMP9417 ML & DM Classification (1) March 14, 2017 12 / 130
Distance-based models
B !
A ! !
C
With Manhattan distance (p = 1), B and C are equally close to the origin
and also equidistant from A. With p < 1 (here, p = 0.8) C is further away
from the origin than B; since both are again equidistant from A, it follows
that travelling from the origin to C via A is quicker than going there
directly, which violates the triangle inequality.
COMP9417 ML & DM Classification (1) March 14, 2017 13 / 130
Distance-based models
Mahalanobis distance
Often, the shape of the ellipse is estimated from data as the inverse of the
covariance matrix: M = Σ−1 . This leads to the definition of the
Mahalanobis distance
q
DisM (x, y|Σ) = (x − y)T Σ−1 (x − y)
Using the covariance matrix in this way has the effect of decorrelating and
normalising the features.
1 P
from which we derive y = |D| x∈D x = µ.
1.5
0.5
−0.5
−1
data points
squared 2−norm centroid (mean)
−1.5 2−norm centroid (geometric median)
squared 2−norm medoid
2−norm medoid
1−norm medoid
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
A small data set of 10 points, with circles indicating centroids and squares
indicating medoids (the latter must be data points), for different distance
metrics. Notice how the outlier on the bottom-right ‘pulls’ the mean away
from the geometric median; as a result the corresponding medoid changes
as well.
COMP9417 ML & DM Classification (1) March 14, 2017 18 / 130
Distance-based models Neighbours and exemplars
(left) For two exemplars the nearest-exemplar decision rule with Euclidean
distance results in a linear decision boundary coinciding with the
perpendicular bisector of the line connecting the two exemplars. (right)
Using Manhattan distance the circles are replaced by diamonds.
COMP9417 ML & DM Classification (1) March 14, 2017 20 / 130
Distance-based models Neighbours and exemplars
Distance-based models
Nearest Neighbour
Training algorithm:
• For each training example hxi , f (xi )i, add the example to the list
training examples.
Classification algorithm:
• Given a query instance xq to be classified,
• Let x1 . . . xk be the k instances from training examples that are
nearest to xq by the distance function
• Return
k
X
fˆ(xq ) ← arg max δ(v, f (xi ))
v∈V i=1
−
−
−
+ +
xq
+ −
+
−
vr − minvr
ar =
maxvr − −minvr
where vr is the actual value of attribute r
• Nominal attributes: distance either 0 or 1
• Common policy for missing values: assumed to be maximally distant
(given normalized attributes)
Advantages:
• Statisticians have used k-NN since early 1950s
• Can be very accurate
• As n → ∞ and k/n → 0 error approaches minimum
• Training is very fast
• Can learn complex target functions
• Don’t lose information by generalization - keep all instances
Disadvantages:
• Slow at query time: basic algorithm scans entire training data to
derive a prediction
• Assumes all attributes are equally important, so easily fooled by
irrelevant attributes
• Remedy: attribute selection or weights
• Problem of noisy instances:
• Remedy: remove from data set
• not easy – how to know which are noisy ?
Nearest-neighbour classifier
Distance-Weighted kNN
k
X
fˆ(xq ) ← arg max wi δ(v, f (xi ))
v∈V i=1
where
1
wi ≡
d(xq , xi )2
and d(xq , xi ) is distance between xq and xi
Distance-Weighted kNN
For real-valued target functions replace the final line of the algorithm by:
Pk
i=1 wi f (xi )
fˆ(xq ) ← Pk
i=1 wi
Evaluation
Curse of Dimensionality
One approach:
• Stretch jth axis by weight zj , where z1 , . . . , zn chosen to minimize
prediction error
• Use cross-validation to automatically choose weights z1 , . . . , zn
• Note setting zj to zero eliminates this dimension altogether
Curse of Dimensionality
Curse of Dimensionality
See Moore and Lee (1994) “Efficient Algorithms for Minimizing Cross Validation Error”
Edited NN
Uncertainty
Bayes Theorem
P (D|h)P (h)
P (h|D) =
P (D)
where
P (h) = prior probability of hypothesis h
P (D) = prior probability of training data D
P (h|D) = probability of h given D
P (D|h) = probability of D given h
Choosing Hypotheses
P (D|h)P (h)
P (h|D) =
P (D)
Generally want the most probable hypothesis given the training data
Maximum a posteriori hypothesis hM AP :
Choosing Hypotheses
If assume P (hi ) = P (hj ) then can further simplify, and choose the
Maximum likelihood (ML) hypothesis
P (cancer) = P (¬cancer) =
P (⊕ | cancer) = P ( | cancer) =
P (⊕ | ¬cancer) = P ( | ¬cancer) =
Thus hM AP = . . ..
Thus hM AP = ¬cancer.
P (D|h)P (h)
P (h|D) = P
hi ∈H P (D|hi )P (hi )
P (A ∨ B) = P (A) + P (B) − P (A ∧ B)
Theorem
P of total probability : if events A1 , . . . , An are mutually exclusive
with ni=1 P (Ai ) = 1, then:
n
X
P (B) = P (B|Ai )P (Ai )
i=1
P (A ∧ B)
P (A|B) =
P (B)
• Rearrange sum rule to get:
P (A ∧ B) = P (A) + P (B) − P (A ∨ B)
P (D|h)P (h)
P (h|D) =
P (D)
Choose P (D|h):
• P (D|h) = 1 if h consistent with D
• P (D|h) = 0 otherwise
Then:
1
|V SH,D | if h is consistent with D
P (h|D) =
0 otherwise
1
1· |H|
P (h|D) =
P (D)
1
1 · |H|
= |V SH,D |
|H|
1
=
|V SH,D |
X
P (D) = P (D|Hi )P (hi )
hi ∈H
X 1 X 1
= 1· + 0·
|H| |H|
hi ∈V SH,D hi 6∈V SH,D
X 1
= 1·
|H|
hi ∈V SH,D
|V SH,D |
=
|H|
P ( h) P(h|D1) P(h|D1,D2)
hML
m 2
X 1 1 di − h(xi )
hM L = arg max ln √ −
h∈H 2πσ 2 2 σ
i=1
m 2
X 1 di − h(xi )
= arg max −
h∈H 2 σ
i=1
m
X
= arg max − (di − h(xi ))2
h∈H i=1
Which is equivalent to
Or
hM AP = arg min − log2 P (D|h) − log2 P (h)
h∈H
So interpret (1):
• − log2 P (h) is length of h under optimal code
• − log2 P (D|h) is length of D given h under optimal code
Note well: assumes optimal encodings, when the priors and likelihoods
are known. In practice, this is difficult, and makes a difference.
So far we’ve sought the most probable hypothesis given the data D (i.e.,
hM AP )
Consider:
• Three possible hypotheses:
P (h1 |D) = .4, P (h2 |D) = .3, P (h3 |D) = .3
• Given new instance x,
h1 (x) = +, h2 (x) = −, h3 (x) = −
• What’s most probable classification of x?
Example:
therefore
X
P (+|hi )P (hi |D) = .4
hi ∈H
X
P (−|hi )P (hi |D) = .6
hi ∈H
and
X
arg max P (vj |hi )P (hi |D) = −
vj ∈V
hi ∈H
No other classification method using the same hypothesis space and same
prior knowledge can outperform this method on average
Along with decision trees, neural networks, nearest neighbour, one of the
most practical learning methods.
When to use
• Moderate or large training set available
• Attributes that describe instances are conditionally independent given
classification
Successful applications:
• Classifying text documents
• Gaussian Naive Bayes for real-valued data
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5
Play
Yes No
9 5
9/14 5/14
We want to compute:
Y
vN B = arg max P (vj ) P (ai |vj )
vj ∈{“yes”,“no”} i
So we first calculate the likelihood of the two classes, “yes” and “no”
0.0053
P (“yes”) =
(0.0053 + 0.0206)
= 0.205
0.0206
P (“no”) =
(0.0053 + 0.0206)
= 0.795
What if none of the training instances with target value vj have attribute
value ai ? Then
P̂ (ai |vj ) = 0, and...
Y
P̂ (vj ) P̂ (ai |vj ) = 0
i
1 (x−µ)2
f (x) = √ e− 2σ2
2πσ
Example: continuous attribute temperature with mean = 73 and standard
deviation = 6.2. Density value
1 (66−73)2
−
f (temperature = 66|“yes”) = √ e 2×6.22 = 0.0340
2π6.2
Missing values during training are not included in calculation of mean and
standard deviation.
Why?
• Learn which news articles are of interest
• Learn to classify web pages by topic
“bag of words”
Application: 20 Newsgroups
Given: 1000 training documents from each group
Learning task: classify each new document by newsgroup it came from
comp.graphics misc.forsale
comp.os.ms-windows.misc rec.autos
comp.sys.ibm.pc.hardware rec.motorcycles
comp.sys.mac.hardware rec.sport.baseball
comp.windows.x rec.sport.hockey
alt.atheism sci.space
soc.religion.christian sci.crypt
talk.religion.misc sci.electronics
talk.politics.mideast sci.med
talk.politics.misc
talk.politics.guns
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu
From: [email protected] (John Doe)
Subject: Re: This year’s biggest and worst (opinion)...
Date: 5 Apr 93 09:53:39 GMT
I can only comment on the Kings, but the most obvious candidate
for pleasant surprise is Alex Zhitnik. He came highly touted as
a defensive defenseman, but he’s clearly much more than that.
Great skater and hard shot (though wish he were more accurate).
In fact, he pretty much allowed the Kings to trade away that
huge defensive liability Paul Coffey. Kelly Hrudey is only the
biggest disappointment if you thought he was any good to begin
with. But, at best, he’s only a mediocre goaltender. A better
choice would be Tomas Sandstrom, though not through any fault
of his own, but because some thugs in Toronto decided ...
20News
100
90
80
70 Bayes
TFIDF
60 PRTFIDF
50
40
30
20
10
0
100 1000 10000
This means, for example, that the presence of b is twice as likely in spam
(+), compared with ham.
The e-mail to be classified contains words a and b but not c, and hence is
described by the bit vector x = (1, 1, 0). We obtain likelihoods
E-mail #a #b #c Class
e1 0 3 0 +
e2 0 3 3 +
e3 3 0 0 +
e4 2 3 0 +
e5 4 3 0 −
e6 4 0 3 −
e7 3 0 0 −
e8 0 0 0 −
E-mail a? b? c? Class
e1 0 1 0 +
e2 0 1 1 +
e3 1 0 0 +
e4 1 1 0 +
e5 1 1 0 −
e6 1 0 1 −
e7 1 0 0 −
e8 0 0 0 −
Linear discriminants
Logistic regression
P (Y = 1|x)
ln = wT x
1 − P (Y = 1|x)
The quantity on the l.h.s. is called the logit and all we are doing is a linear
model for the logit.
Note: to fit this is actually more complex than linear regression, so we
omit the details.
Generalises to multiple class versions (Y can have more than two values).
Perceptron
x1 w1 x0=1
w0
x2 w2
.
. Σ n
{
n
Σ wi xi 1 if S w x > 0
. wn i=0 o= i=0 i i
-1 otherwise
xn
1 if w0 + w1 x1 + · · · + wn xn > 0
o(x1 , . . . , xn ) =
−1 otherwise.
Perceptron
x2 x2
+
+
- + -
+
x1 x1
- - +
-
(a) (b)
wi ← wi + ∆wi
where
∆wi = η(t − o)xi
Where:
• t = c(x) is target value
• o is perceptron output
• η is a small constant called learning rate
• between 0 and 1
• to simplify things you may assume η = 1
• but in practice usually set at less than 0.2, e.g., 0.1
• η can be varied during learning
Perceptron learning
Perceptron learning
w0 = w + ηyi xi
Perceptron training
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−3 −3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Summary
Summary