3.1 Binary Classification
3.1 Binary Classification
3.1 Binary Classification
In this lecture we begin the study of statistical learning theory in the case of binary
classification. We will characterize the best possible classifier in the binary case, and
relate notions of classification error to each other.
1
2
where (1) uses the linearity of expectation, (2) expresses the expectation of an indic-
ator function as probability, and (3) uses the fact that all the (Xi , Yi ) are identically
distributed. The function R(h) is the risk: it is the probability that the classifier gets
something wrong.
Example 3.1. Assume that the distribution is such that Y is completely determined
by X, that is, Y = f (X). Then
While for certain tasks such as image classification there may be a unique label
to each input, in general this need not be the case. In many applications, the input
does not carry enough information to completely determine the output. Consider, for
example, the case where X consists of whole genome sequences and the task is to
predict hypertension (or any other condition) from it. The genome clearly does not
carry enough information to make an accurate prediction, as other factors also play a
role. To account for this lack of information, define the regression function
While in Example 3.1 we could choose (at least in principle) h(x) = f (x) and
get R(h) = 0, in the presence of noise this is not possible. However, we could define
a classifier h∗ by setting (
∗ 1 f (x) > 21
h (x) =
0 f (x) ≤ 12 ,
We call this the Bayes classifier. The following result shows that this is the best
possible classifier.
Proof. Let h be any classifier. To compute the risk R(h), we first condition on X and
then average over X:
R(h) = E[1{h(X) 6= Y }] = E [E[1{h(X) 6= Y }|X]] . (3.3)
For the inner expectation, we have
E[1{h(X) 6= Y }|X] = E[1{h(X) = 1, Y = 0} + 1{h(X) = 0, Y = 1}|X]
= E[(1 − Y )1{h(X) = 1}|X] + E[Y 1{h(X) = 0}|X]
(1)
= 1{h(X) = 1} E[(1 − Y )|X] + 1{h(X) = 0} E[Y |X]
= 1{h(X) = 1}(1 − f (X)) + 1{h(X) = 0}f (X).
To see why (1) holds, recall that the random variable E[Y 1{h(X) = 0}|X] takes
values E[Y 1{h(x) = 0}|X = x], and will therefore only be non-zero if h(x) = 0.
We can therefore pull the indicator function out of the expectation.
Hence, using (3.3),
R(h) = E[1{h(X) = 1}(1 − f (X)) + 1{h(X) = 0}f (X)] . (3.4)
| {z } | {z }
(1) (2)
We have seen in Example 3.1 that the Bayes risk is 0 if Y is completely determined
by X. At the other extreme, if the response Y does not depend on X at all, then the
Bayes risk is 1/2. This means that for every input, the best possible classifier consists
of “guessing” without any prior information, which means that we have a 50% chance
of being correct!
The error
E(h) = R(h) − R(h∗ )
is called the excess risk or error of h with respect to the best possible classifier.
where the (Xi , Yi ) are i.i.d. copies of a random variable (X, Y ) on X × Y. Note
that ĥn is what we will typically obtain by computation from samples (xi , yi ). The
way it is defined, it depends on n, the class of functions H, and the random variables
(Xi , Yi ), and as such is itself a random variable.
We want ĥn to generalize well, that is, we want
to be small. We know that the smallest possible value this risk can attain is given by
R(h∗ ), where h∗ is the Bayes classifier. We can decompose the difference between
the risk of ĥn and that of the Bayes classifier as follows:
The first compares the performance of ĥn against the best possible classifier within
the class H, while the second is a statement about the power of the class H. We can
reduce the estimation error by making the class H smaller, but then the approximation
error increases. Ideally, we would like to find bounds on the estimation error that
converge to 0 as the number of samples n increases.