3.1 Binary Classification

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Lecture 3

In this lecture we begin the study of statistical learning theory in the case of binary
classification. We will characterize the best possible classifier in the binary case, and
relate notions of classification error to each other.

3.1 Binary Classification


A binary classifier is a function
h : X → Y = {0, 1},
where X is a space of features. We consider the loss function
(
1 h(x) 6= y
L(h(x), y) = 1{h(x) 6= y} =
0 h(x) = y.
We would like to learn a classifier from a set of observations
{(x1 , y1 ), . . . , (xn , yn )} ⊂ X × Y. (3.1)
However, the classifier should not only match the data, but generalize in order to be
able to classify unseen data. For this, we assume that the points in (3.1) are drawn
from a probability distribution on X × Y, and replace each data point (xi , yi ) in (3.1)
with a copy (Xi , Yi ) of a random variable (X, Y ) on X × Y. We are now after a
classifier h such that the expected value of the empirical risk
n
1X
R̂(h) = 1{h(Xi ) 6= Yi } (3.2)
n
i=1

is small. We can write this expectation as


n
(1) 1X
E[R̂(h)] = E[1{h(Xi ) 6= Yi }]
n
i=1
n
(2) 1 X
= P(h(Xi ) 6= Yi )
n
i=1
(3)
= P(h(X) 6= Y ) =: R(h),

1
2

where (1) uses the linearity of expectation, (2) expresses the expectation of an indic-
ator function as probability, and (3) uses the fact that all the (Xi , Yi ) are identically
distributed. The function R(h) is the risk: it is the probability that the classifier gets
something wrong.

Example 3.1. Assume that the distribution is such that Y is completely determined
by X, that is, Y = f (X). Then

R(h) = P(h(X) 6= f (X)),

and R(h) = 0 if h = f almost everywhere. If X is a finite or compact set with the


uniform distribution, then R(h) simply measures the proportion of the input space on
which h fails to classify inputs correctly.

While for certain tasks such as image classification there may be a unique label
to each input, in general this need not be the case. In many applications, the input
does not carry enough information to completely determine the output. Consider, for
example, the case where X consists of whole genome sequences and the task is to
predict hypertension (or any other condition) from it. The genome clearly does not
carry enough information to make an accurate prediction, as other factors also play a
role. To account for this lack of information, define the regression function

f (X) = E[Y |X] = 1 · P(Y = 1|X) + 0 · P(Y = 0|X) = P(Y = 1|X).

Note that if we write


Y = f (X) + ,
then E[|X] = 0, because

f (X) = E[Y |X] = E[f (X)|X] + E[|X].


| {z }
=f (X)

While in Example 3.1 we could choose (at least in principle) h(x) = f (x) and
get R(h) = 0, in the presence of noise this is not possible. However, we could define
a classifier h∗ by setting (
∗ 1 f (x) > 21
h (x) =
0 f (x) ≤ 12 ,
We call this the Bayes classifier. The following result shows that this is the best
possible classifier.

Theorem 3.2. The Bayes classifier h∗ satisfies

R(h∗ ) = inf R(h),


h

where the infimum is over all measurable h. Moreover, R(h∗ ) ≤ 1/2.


3.1. BINARY CLASSIFICATION 3

Proof. Let h be any classifier. To compute the risk R(h), we first condition on X and
then average over X:
R(h) = E[1{h(X) 6= Y }] = E [E[1{h(X) 6= Y }|X]] . (3.3)
For the inner expectation, we have
E[1{h(X) 6= Y }|X] = E[1{h(X) = 1, Y = 0} + 1{h(X) = 0, Y = 1}|X]
= E[(1 − Y )1{h(X) = 1}|X] + E[Y 1{h(X) = 0}|X]
(1)
= 1{h(X) = 1} E[(1 − Y )|X] + 1{h(X) = 0} E[Y |X]
= 1{h(X) = 1}(1 − f (X)) + 1{h(X) = 0}f (X).
To see why (1) holds, recall that the random variable E[Y 1{h(X) = 0}|X] takes
values E[Y 1{h(x) = 0}|X = x], and will therefore only be non-zero if h(x) = 0.
We can therefore pull the indicator function out of the expectation.
Hence, using (3.3),
R(h) = E[1{h(X) = 1}(1 − f (X)) + 1{h(X) = 0}f (X)] . (3.4)
| {z } | {z }
(1) (2)

For (1), we decompose


1{h(X) = 1}(1 − f (X)) =1{h(X) = 1, f (X) > 1/2}(1 − f (X))
+ 1{h(X) = 1, f (X) ≤ 1/2}(1 − f (X))
(3.5)
≥1{h(X) = 1, f (X) > 1/2}(1 − f (X))
+ 1{h(X) = 1, f (X) ≤ 1/2}f (X),
where the inequality follows since (1 − f (X)) ≥ f (X) if f (X) ≤ 1/2. By the same
reasoning, for (2) we get
1{h(X) = 0}f (X) ≥1{h(X) = 0, f (X) > 1/2}(1 − f (X))
(3.6)
+ 1{h(X) = 0, f (X) ≤ 1/2}f (X).
Combining the inequalities (3.5) and (3.6) within the bound (3.4) and collecting the
terms that are multiplied with f (X) and those that are multiplied with (1 − f (X)),
we arrive at
R(h) ≥ E[1{f (X) ≥ 1/2}f (X) + 1{f (X) > 1/2}(1 − f (X))]
= E[1{h∗ (X) = 0}f (X) + 1{h∗ (X) = 1}(1 − f (X))] = R(h∗ ),
where the last equality follows from (3.4) applied to h = h∗ . The characterization (3.4)
also shows that
R(h∗ ) = E[1{f (X) > 1/2}(1 − f (X))] + 1{f (X) ≤ 1/2}f (X)]
1
= E [min{f (X), 1 − f (X)}] ≤ ,
2
which completes the proof.
4

We have seen in Example 3.1 that the Bayes risk is 0 if Y is completely determined
by X. At the other extreme, if the response Y does not depend on X at all, then the
Bayes risk is 1/2. This means that for every input, the best possible classifier consists
of “guessing” without any prior information, which means that we have a 50% chance
of being correct!
The error
E(h) = R(h) − R(h∗ )
is called the excess risk or error of h with respect to the best possible classifier.

Approximation and Estimation


We conclude this lecture by relating notions of risk. In what follows, we assume that
a class of classifiers H is given, from which we are allowed to choose. We denote by
ĥn the classifier obtained by minimizing the empirical risk R̂(h) over H, that is
n
X
R̂(ĥn ) = inf 1{h(Xi ) 6= Yi },
h∈H
i=1

where the (Xi , Yi ) are i.i.d. copies of a random variable (X, Y ) on X × Y. Note
that ĥn is what we will typically obtain by computation from samples (xi , yi ). The
way it is defined, it depends on n, the class of functions H, and the random variables
(Xi , Yi ), and as such is itself a random variable.
We want ĥn to generalize well, that is, we want

R(ĥn ) = P(ĥn (X) 6= Y )

to be small. We know that the smallest possible value this risk can attain is given by
R(h∗ ), where h∗ is the Bayes classifier. We can decompose the difference between
the risk of ĥn and that of the Bayes classifier as follows:

R(ĥn ) − R(h∗ ) = R(ĥn ) − inf R(h) + inf R(h) − R(h∗ )


h∈H h∈H
| {z } | {z }
Estimation error Approximation error

The first compares the performance of ĥn against the best possible classifier within
the class H, while the second is a statement about the power of the class H. We can
reduce the estimation error by making the class H smaller, but then the approximation
error increases. Ideally, we would like to find bounds on the estimation error that
converge to 0 as the number of samples n increases.