Solutions To Exercises

Solutions to Exercises
Chapter 2
2.1 Two-oracle variant of the PAC model
• Assume that C is efficiently PAC-learnable using H in the standard PAC model using
algorithm A. Consider the distribution D = 21 (D− + D+ ). Let h ∈ H be the hypothesis
output by A. Choose δ such that:
P[RD (h) ≤ /2] ≥ 1 − δ.
From
RD (h) = P [h(x) 6= c(x)]
x∼D
1
= ( P [h(x) 6= c(x)] + P [h(x) 6= c(x)])
2 x∼D− x∼D+
1
= (RD− (h) + RD+ (h)),
2
it follows that:
P[RD− (h) ≤ ] ≥ 1 − δ and P[RD+ (h) ≤ ] ≥ 1 − δ.
This implies two-oracle PAC-learning with the same computational complexity.
• Assume now that C is efficiently PAC-learnable in the two-oracle PAC model. Thus, there
exists a learning algorithm A such that for c ∈ C, > 0, and δ > 0, there exist m− and m+
polynomial in 1/, 1/δ, and size(c), such that if we draw m− negative examples or more
and m+ positive examples or more, with confidence 1 − δ, the hypothesis h output by A
verifies:
P[RD− (h)] ≤ and P[RD+ (h)] ≤ .
Now, let D be a probability distribution over negative and positive examples. If we could
draw m examples according to D such that m ≥ max{m− , m+ }, m polynomial in 1/, 1/δ,
and size(c), then two-oracle PAC-learning would imply standard PAC-learning:
P[RD (h)]
≤ P[RD (h)|c(x) = 0] P[c(x) = 0] + P[RD (h)|c(x) = 1] P[c(x) = 1]
≤ (P[c(x) = 0] + P[c(x) = 1]) = .
If D is not too biased, that is, if the probability of drawing a positive example, or that of
drawing a negative example is more than , it is not hard to show, using Chernoff bounds
or just Chebyshev’s inequality, that drawing a polynomial number of examples in 1/ and
1/δ suffices to guarantee that m ≥ max{m− , m+ } with high confidence.
Otherwise, D is biased toward negative (or positive examples), in which case returning
h = h0 (respectively h = h1 ) guarantees that P[RD (h)] ≤ .
460 Solutions Manual
To show the claim about the not-too-biased case, let Sm denote the number of positive
examples obtained when drawing m examples when the probability of a positive example
is . By Chernoff bounds,
2
P[Sm ≤ (1 − α)m] ≤ e−mα /2 .
1 2m+
We want to ensure that at least m+ examples are found. With α = 2
and m =
,
P[Sm > m+ ] ≤ e−m+ /4 .
Setting the bound to be less than or equal to δ/2, leads to the following condition on m:
2m+ 8 2
m ≥ min{ , log }
δ
A similar analysis can be done in the case of negative examples. Thus, when D is not too
biased, with confidence 1 − δ, we will find at least m− negative and m+ positive examples
if we draw m examples, with
2m+ 2m− 8 2
m ≥ min{ , , log }.
δ
In both solutions, our training data is the set T and our learned concept L(T ) is the tightest
circle (with minimal radius) which is consistent with the data.
2.2 PAC learning of hyper-rectangles
The proof in the case of hyper-rectangles is similar to the one given presented within the
chapter. The algorithm selects the tightest axis-aligned hyper-rectangle containing all the
sample points. For i ∈ [2n], select a region ri such that PD [ri ] = /(2n) for each edge of the
hyper-rectangle R. Assuming that PD [R − R0 ] > , argue that R0 cannot meet all ri s, so it
must miss at least one. The probability that none of the m sample points falls into region ri
is (1 − /2n)m . By the union bound, this shows that
m
P[R(R0 ) > ] ≤ 2n(1 − /2n)m ≤ 2n exp − . (E.35)
2n
Setting δ to the right-hand side shows that for
2n 2n
m≥ log , (E.36)
δ
with probability at least 1 − δ, RD (R0 ) ≤ .
2.3 Concentric circles
Suppose our target concept c is the circle around the origin with radius r. We will choose a
slightly smaller radius s by
s := inf{s0 : P (s0 ≤ ||x|| ≤ r) < }.
Let A denote the annulus between radii s and r; that is, A := {x : s ≤ ||x|| ≤ r}. By definition
of s,
P (A) ≥ . (E.37)
In addition, our generalization error, P (c ∆ L(T )), must be small if T intersects A. We can
state this as
P (c ∆ L(T )) > =⇒ T ∩ A = ∅. (E.38)
Using (E.37), we know that any point in T chosen according to P will “miss” region A with
probability at most 1 − . Defining error := P (c ∆ L(T )), we can combine this with (E.38) to
see that
P (error > ) ≤ P (T ∩ A = ∅) ≤ (1 − )m ≤ e−m .
Setting δ to be greater than or equal to the right-hand side leads to m ≥ 1 log( 1δ ).
2.4 Non-concentric circles
As in the previous example, it is natural to assume the learning algorithm operates by return-
ing the smallest circle which is consistent with the data. Gertrude is relying on the logical
implication
error > =⇒ T ∩ ri = ∅ for some i, (E.39)
Solutions Manual 461
r1 r1
r3 r3
r2 r2
Figure E.5
Counter-example shows error of tightest circle in gray.
which is not necessarily true here. Figure E.5 illustrates a counterexample. In the figure, we
have one training point in each region ri . The points in r1 and r2 are very close together,
and the point in r3 is very close to region r1 . On this training data (some other points may
be included outside the three regions ri ), our learned circle is the “tightest” circle including
these points, and hence one diameter approximately traverses the corners of r1 . In the figure,
the gray regions are the error of this learned hypotheses versus the target circle, which has a
thick border. Clearly, the error may be greater than even while T ∩ ri 6= ∅ for any i; this
contradicts (E.39) and invalidates poor Gertrude’s proof.
2.5 Triangles
As in the case of axis-aligned rectangles, consider three regions r1 , r2 , r3 , along the sides of
the target concept as indicated in figure E.6. Note that the triangle formed by the points
A”, B”, C” is similar to ABC (same angles) since A”B” must be parallel to AB, and similarly
for the other sides.
Assume that P[ABC] > , otherwise the statement would be trivial. Consider a triangle
A0 B 0 C 0 similar to ABC and consistent with the training sample and such that it meets all
three regions r1 , r2 , r3 .
Since it meets r1 , the line A0 B 0 must be below A”B”. Since it meets r2 and r3 , A0 must be
in r2 and B 0 in r3 (see figure E.6). Now, since the angle A\ 0 B 0 C 0 is equal to A”B”C”,
\ C0
must be necessarily above C”. This implies that triangle A0 B 0 C 0 contains A”B”C”, and thus
error(A0 B 0 C 0 ) ≤ .
error(A0 B 0 C 0 ) > =⇒ ∃i ∈ {1, 2, 3} : A0 B 0 C 0 ∩ ri = ∅.
Thus, by the union bound,
3
X
P[error(A0 B 0 C 0 ) > ] ≤ P[A0 B 0 C 0 ∩ ri = ∅] ≤ 3(1 − /3)m ≤ 3e−3m .
i=1
3
Setting δ to match the right-hand side gives the sample complexity m ≥
log 3δ .
2.8 Learning intervals
C”
A” B”
A’ B’
A B
Figure E.6
Rectangle triangles.
Given a sample S, one algorithm consists of returning the tightest closed interval IS containing
positive points. Let I = [a, b] be the target concept. If P[I] < , then clearly R(IS ) < .
Assume that P[I] ≥ . Consider two intervals IL and IR defined as follows:
IL = [a, x] with x = inf{x : P[a, x] ≥ /2}
IR = [x0 , b] with x0 = sup{x0 : P[x0 , b] ≥ /2}.
By the definition of x, the probability of [a, x[ is less than or equal to /2, similarly the
probability of ]x0 , b] is less than or equal to /2. Thus, if IS overlaps both with IL and IR ,
then its error region has probability at most . Thus, R(IS ) > implies that IS does not
overlap with either IL or IR , that is either none of the training points falls in IL or none falls
in IR . Thus, by the union bound,
P[R(IS ) > ] ≤ P[S ∩ IL = ∅] + P[S ∩ IR = ∅]
≤ 2(1 − /2)m ≤ 2e−m/2 .
2 2
Setting δ to match the right-hand side gives the sample complexity m =
log δ
and proves
the PAC-learning of closed intervals.
2.9 Learning union of intervals
Given a sample S, our algorithm consists of the following steps:
(a) Sort S in ascending order.
(b) Loop through sorted S, marking where intervals of consecutive positively labeled points
begin and end.
(c) Return the union of intervals found on the previous step. This union is represented by a
list of tuples that indicate start and end points of the intervals.
This algorithms works both for p = 2 and for a general p. We will now consider the problem
for C2 . To show that this is a PAC-learning algorithm we need to distinguish between two
cases.
The first case is when our target concept is a disjoint union of two closed intervals: I =
[a, b] ∪ [c, d]. Note, there are two sources of error: false negatives in [a, b] and [c, d] and also
false positives in (b, c). False positives may occur if no sample is drawn from (b, c). By
linearity of expectation and since these two error regions are disjoint, we have that R(hS ) =
RFP (hS ) + RFN,1 (hS ) + RFN,2 (hS ), where
RFP (hS ) = P [x ∈ hS , x 6∈ I],
x∼D
RFN,1 (hS ) = P [x 6∈ hS , x ∈ [a, b]],
x∼D
RFN,2 (hS ) = P [x 6∈ hS , x ∈ [c, d]].
x∼D
Since we need to have that at least one of RFP (hS ), RFN,1 (hS ), RFN,2 (hS ) exceeds /3 in
order for R(hS ) > , by union bound
P(R(hS ) > ) ≤ P(RFP (hS ) > /3 or RFN(hS ),1 > /3 or RFN(hS ),2 > /3)
2
X
≤ P(RFP (hS ) > /3) + P(RFN(hS ),i > /3) (E.40)
i=1
We first bound P(RFP (hS ) > /3). Note that if RFP (hS ) > /3, then P((b, c)) > /3 and
hence
P(RFP (hS ) > /3) ≤ (1 − /3)m ≤ e−m/3 .
Now we can bound P(RFN(hS ),i > /3) by 2e−m/6 using the same argument as in the previous
question. Therefore,
P(R(hS ) > ) ≤ e−m/3 + 4e−m/6 ≤ 5e−m/6 .
Setting, the right-hand side to δ and solving for m yields that m ≥ 6 log 5δ .
The second case that we need to consider is when I = [a, d], that is, [a, b] ∩ [c, d] 6= ∅. In
that case, our algorithm reduces to the one from exercise 2.8 and it was already shown that
only m ≥ 2 log 2δ samples is required to learn this concept. Therefore, we conclude that our
algorithm is indeed a PAC-learning algorithm.
Extension of this result to the case of Cp is straightforward. The only difference is that in
(E.40), one has two summations for p − 1 regions of false positives and 2p regions of false
2(2p−1)
negatives. In that case sample complexity is m ≥
log 3p−1
δ
.
Sorting step of our algorithm takes O(m log m time and steps (b) and (c) are linear in m,
which leads to overall time complexity O(m log m).
2.10 Consistent hypotheses

Since PAC-learning with L is possible for any distribution, let D be the uniform distribution
over Z. Note that, in that case, the cost of an error of a hypothesis h on any point z ∈ Z is
PD [z] = 1/m. Thus, if RD (h) < 1/m, we must have RD (h) = 0 and h is consistent. Thus,
choose = 1/(m + 1). Then, for any δ > 0, with probability at least 1 − δ over samples S with
|S| ≥ P ((m + 1), 1/δ) points (where P is some fixed polynomial) the hypothesis hS returned
by L is consistent with Z since RD (hS ) ≤ 1/(m + 1).
2.11 Senate laws
(a) The true error in the consistent case is bounded as follows:

1 1
RD (h) ≤ (log |H| + log ). (E.41)
m δ
For δ = .05, m = 200 and |H| = 2800, RD (h) ≤ 5.5%.
(b) The true error in the inconsistent case is bounded as:

r
1 1
RD (h) ≤ R
bD (h) + (log 2|H| + log ). (E.42)
2m δ
bD (h) = m0 /m = .1, m = 200 and |H| = 2800, RD (h) ≤ 27.05%.
For δ = .05, R
2.12 Bayesian bound. For any fixed h ∈ H, by Hoeffding’s inequality, for any δ > 0,
 s 
1
log p(h)δ
P R(h) − R bS (h) ≥  ≤ p(h)δ. (E.43)
2m
By the union bound,

 s   s 
1 1
log p(h)δ X log p(h)δ
P ∃h : R(h) − R
bS (h) ≥ ≤ P R(h) − R
bS (h) ≥ 
2m h∈H
2m
X
≤ p(h)δ = δ.
h∈H
In the case of a finite hypothesis set and a uniform prior p(h) = 1/|H|, the bound coincides
with the one presented in the chapter.
2.13 Learning with an unknown parameter.
(a) By definition of acceptance,
bS (h) ≤ 3/4]
P[h is accepted] = P[R
≤ P[RS (h) ≤ 3/4 R(h)]
b (R(h) ≥ )
n
≤ exp − R(h)(1/4)2 (Chernoff bound)
2
R(h) 2i+1
= exp − log (def. of n)
δ
2i+1 δ
= exp − log = i+1 . (R(h) ≥ )
δ 2
(b) By definition, P[h is rejected] = P[R bS (h) ≥ 3 ]. Since R(h) ≤ /2, P[h is rejected] ≤
4
3
P[RS (h) ≥ 4 | R(h) = /2]. By the Chernoff bounds, we can thus write
b
n
P[h is rejected] ≤ exp − (1/2)2 (Chernoff bound)
32
4 2 i+1
= exp − log (def. of n)
3 δ
2 i+1 δ
≤ exp − log = i+1 .
δ 2
(c) The estimate se is then an upper bound on s and thus, by definition of algorithm B,
P[R(hi ) ≤ /2] ≥ 1/2. If a hypothesis h has error at least /2 it is rejected with probability
at most δ/2i+1 ≤ δ/4 ≤ 1/4, therefore, it is accepted with probability at 3/4. Thus, for
se ≥ s, P[hi is accepted] ≥ 1/2 × 1/4 = 3/8.
(d) By the previous question, the probability that algorithm B fails to halt while se ≥ s is at
most 1 − 3/8 = 5/8. Thus, the probability that it does not halt after j iterations is at
2 8

most (5/8)j ≤ (5/8)log δ / log 5 = exp log 2δ / log 85 log 85 = δ/2.
(e) By definition,
2
se ≥ s ⇐⇒ b2(i−1)/ log δ c ≥ s
2
⇐⇒ 2(i−1)/ log ≥ s δ
i−1
⇐⇒ ≥ log2 s
log 2δ
2
⇐⇒ i ≥ 1 + (log2 s) log
δ
2
⇐⇒ i ≥ d1 + (log2 s) log e.
δ
(f) In view of the two previous questions, with probability at least 1 − δ/2, algorithm B halts
after at most j 0 iterations. The probability that the hypothesis it returns be accepted
0
while its error is greater than is at most δ/2j +1 ≤ δ/2. Thus, with probability 1 − δ,
the algorithm halts and the hypothesis it returns has error at most .
Chapter 3
3.1 Growth function of intervals in R

Let {x1 , . . . , xm } be a set of m distinct ordered real numbers and let I be an interval. If
I ∩ {x1 , . . . , xm } contains k numbers, their indices are of the form j, j + 1, . . . , j + k − 1. How
many different set of indices are possible? The answer is m − k + 1 since j can take values in
{1, . . . , m − k + 1}. Thus, the total number of dichotomies for a set of size m is:
m
X m(m + 1) m m m
1+ (m − k + 1) = 1 + = + + ,
k=1
2 0 1 2
which matches the general bound on shattering coefficients.
3.2 Growth function and Rademacher complexity of thresholds in R
Given m distinct points on the line, there are at most m + 1 choices of the threshold θ leading
to distinct classifications: between two points or before/after all points. Since there are two
choices (classifying those to right as positive or those to the left), there are 2(m + 1) choices.
q ≤ 2m and, by the bound on the Rademacher complexity shown in the chapter,
Thus, Πm (H)
2 log(2m)
Rm (H) ≤ m
.
3.3 Growth function of linear combinations
(a) {X+ ∪ {xm+1 }, X− } and {X+ , X− ∪ {xm+1 }} are linearly separable by a hyperplane going
through the origin if and only if there exists w1 ∈ Rd such that
∀x ∈ X+ , w1 · x > 0 ∀x ∈ X− , w1 · x < 0, and w1 · xm+1 > 0 (E.44)
and there exists w2 ∈ Rd such that
∀x ∈ X+ , w2 · x > 0 ∀x ∈ X− , w2 · x < 0, and w2 · xm+1 < 0. (E.45)
For any w1 , w2 , the function f : (t 7→ tw1 + (1 − t)w2 ) · xm+1 is continuous over [0, 1].
(E.44) and (E.45) hold iff f (0) < 0 and f (1) > 0, that is iff there exists w = t0 w1 + (1 −
t0 )w2 linearly separating {X+ , X− } and such at w · xm+1 = 0.
m−1
(b) Repeating the formula, we obtain C(m, d) = m−1
P
k=0 k
C(1, d − k). Since, C(1, n) = 2
if n ≥ 1 and C(1, n) = 0 otherwise, the result follows.
(c) This is a direct application of the result of the previous question.
3.4 Lower bound on growth function
Let X be an arbitrary set. Consider the set of all subsets of X of size less than or equal to d.
The indicator functions of these sets forms a hypothesis class whose shattering coefficient is
equal to the general upper bound.
3.5 Finer Rademacher upper bound
Following the proof given in the chapter and using Jensen’s inequality (at the last inequality),
we can write:
" σ1 " g(z1 ) ##
b m (G) = E sup 1 .. ..
R
S,σ g∈G m σm
. · .
g(zm )
"√ p #
m 2 log |{(g(z1 ), . . . , g(zm )) : g ∈ G}|
≤E (Massart’s Lemma)
S m
"√ p #
m 2 log Π(G, S)
=E
S m
√ p r
m 2 log ES [Π(G, S)] 2 log ES [Π(G, S)]
≤ = .
m m p
Note that in the application of Jensen’s inequality, we are using the fact that log(x) is
concave, which is true because it is the composition of functions that are concave functions
and the outer function is non-decreasing. It is not true in general that the composition of any
two concave functions is concave.
3.6 Singleton hypothesis class
(a)
m m
" # " #
1 X 1 X
Rm (H) = E sup σi h(zi ) = E σi h0 (zi )
S∼Dm ,σ h∈H m S∼Dm ,σ m
i=1 i=1
m
" #
1 X
= Em E[σi ]h0 (zi ) = 0 ,
S∼D m i=1 σ
where the final equality follows from the fact E[σ] = 0 for a Rademacher random variable
σ.
(b) Consider the set A = x0 , where x0 is any vector in Rm with kx0 k = r. Then,
m m
1 X 1 X
E sup σi xi = E[σi ]x0i = 0
σ m x∈A m i=1 σ
i=1
and, p
r 2 log |A|
= 0.
m
3.7 Two function hypothesis class
(a) VCdim(H) = 1 since H can shatter one point and clearly at most one. Observe that
m
X m
X Xm
sup σi h(xi ) = sup σi h(x1 ) = σi . (E.46)

h∈H i=1 h∈H i=1 i=1
Thus, by Jensen’s inequality,
m
b S (H) = 1 E
h X i
R σi

m σ i=1
m
1h h X ii1/2
≤ E ( σi )2
m σ
i=1
m
1 h h X 2 ii1/2
= E σ (E[σi σj ] = 0 for i 6= j)
m σ i=1 i
1
= √ .
m √
By the Khintchine inequality,
pthe upper bound is tight modulo the constant 1/ 2. The
upper bound coincides with d/m.
(b) VCdim(H) = 1 since H can shatter x1 and clearly at most one point. By definition,
m
b S (H) = 1 E sup
h X i
R σi h(xi )
m σ h∈H i=1
m
1 h X i
= E sup σ1 h(x1 ) − σi
m σ h∈H i=2
1 h i
= E sup σ1 h(x1 ) (E[σi ] = 0)
m σ1 h∈H
1 h i 1
= E 1 = .
m σ1 m p p
Here R
b S (H) is a clearly more favorable quantity than d/m = 1/m.
3.8 Rademacher identities
(a) If α ≥ 0, then
m
X m
X m
X
sup σi h(xi ) = sup α σi h(xi ) = α sup σi h(xi ),
h∈αH i=1 h∈H i=1 h∈H i=1
otherwise if α < 0, then
Xm m
X m
X
sup σi h(xi ) = sup α σi h(xi ) = (−α) sup (−σi )h(xi ).
h∈αH i=1 h∈αH i=1 h∈H i=1
Since σi and −σi have the same distribution, this shows that Rm (αH) = |α|Rm (H).
(b) The second equality follows from:
Rm (H + H0 )
m
" #
1 X
0
= E sup σi (h(xi ) + h (xi ))
m σ,S h∈H,h0 ∈H0 i=1
m m
" #
1 X X
= E sup σi h(xi ) + sup σi h0 (xi )
m σ,S h∈H,h0 ∈H0 i=1 h∈H,h0 ∈H0 i=1
m m
" # " #
1 X 1 X
= E sup σi h(xi ) + E sup σi h0 (xi ) .
m σ,S h∈H i=1 m σ,S h0 ∈H0 i=1
1
(c) For the inequality, using the identity max(a, b) = 2
[a + b + |a − b|] valid for all a, b ∈ R,
and the sub-additivity of sup we can write:
Rm ({max(h, h0 ) : h ∈ H, h0 ∈ H0 })
m
" #
1 X
= E sup σi [h(xi ) + h0 (xi ) + |h(xi ) − h0 (xi )|]
2m σ,S h∈H,h0 ∈H0 i=1
m
" #
1 0 1 X
0
≤ [Rm (H) + Rm (H )] + E sup σi |h(xi ) − h (xi )| .
2 2m σ,S h∈H,h0 ∈H0 i=1
Since the absolute value function is 1-Lipschitz, by the contraction lemma, the third term
can be bounded as follows
m
" #
1 X
E sup σi |h(xi ) − h0 (xi )|
2m σ,S h∈H,h0 ∈H0 i=1
m
" #
1 X
≤ E sup σi [h(xi ) − h0 (xi )]
2m σ,S h∈H,h0 ∈H0 i=1
m m
" #
1 X X
0
≤ E sup σi h(xi ) + sup −σi h (xi )]
2m σ,S h∈H,h0 ∈H0 i=1 h∈H,h0 ∈H0 i=1
m m
" # " #
1 X 1 X
0
= E sup σi h(xi ) + E sup −σi h (xi )]
2m σ,S h∈H,h0 ∈H0 i=1 2m σ,S h∈H,h0 ∈H0 i=1
1
= [Rm (H) + Rm (H0 )],
2
using the fact that σi and −σi follow the same distribution.
3.9 Rademacher complexity of intersection of concepts

Observe that for any h1 ∈ H1 and h2 ∈ H2 , we can write h1 h2 = (h1 + h2 − 1)1h1 +h2 −1≥0 =
(h1 + h2 − 1)+ . Since x 7→ (x − 1)+ is 1-Lipschitz over [0, 2], by Talagrand’s contraction
b S (H) ≤ R
lemma, the following holds: R b S (H1 + H2 ) ≤ R
b S (H1 ) + R
b S (H2 ).
Since concepts can be identified with indicator functions, the intersection of two concepts can
be identified with the product two such indicator functions. In view of that, by the result just
proven and after taking expectations, the following holds:
Rm (C) ≤ Rm (C1 ) + Rm (C2 ).
3.10 Rademacher complexity of prediction vector
(a) The following proves the result:

m
" #
1 X
RS (H) =
b E sup σi f (xi )
m σ f ∈{h,−h} i=1
1
= E[max{σ · u, −σ · u}]
mσ
1
= E[|σ · u|]
mσ
1q
≤ E[|σ · u|2 ] (by Jensen’s inequality)
m vσ
u  
m
1u X
u
= tE σi σj ui uj 
m σ i,j=1
v
uX m
1u
= t E[σi σj ]ui uj
m i,j=1 σ
v
um
1u X
= t u2 (E[σi σj ] = E[σi ] E[σj ] = 0 for i 6= j)
m i=1 i σ σ σ
kuk
= .
m
√
n 1
Thus, RS (H) ≤ m . For n = 1, R
b b S (H) ≤ b S (H) ≤
while for n = m, R √1 .
m m
(b) The empirical Rademacher complexity of F + h can be expressed as follows:

m
" #
b S (F + h) = 1 E sup
X
R σi f (xi ) + σi h(xi )
m σ f ∈F i=1
m
" # "m #
1 X 1 X
= E sup σi f (xi ) + E σi h(xi )
m σ f ∈F i=1 m σ i=1
m
1 X
=R
b S (F) + E[σi ]h(xi ) = R
b S (F).
m i=1 σ
The empirical Rademacher complexity of F ± h can be upper bounded as follows using
the first question:
m
" #
1 X
RS (F + h) =
b E sup σi f (xi ) + sup sσi h(xi )
m σ f ∈F i=1 s∈{−1,+1}
m m
" # " #
1 X 1 X
= E sup σi f (xi ) + E sup s σi h(xi )
m σ f ∈F i=1 m σ s∈{−1,+1} i=1
b S (F) + kuk .
≤R
m
3.11 Rademacher complexity of regularized neural networks

(a)
 
m n2
1  X X
RS (H) =
b E sup σi wj σ(uj · xi )
m σ kwk1 ≤Λ0 ,kuj k2 ≤Λ i=1 j=1
 
n2 m
1  X X
= E sup wj σi σ(uj · xi )
m σ kwk1 ≤Λ0 ,kuj k2 ≤Λ j=1 i=1
" m #
Λ0 X
= sup max σi σ(uj · xi ) (all the weight put on largest term)

E
m σ kuj k2 ≤Λ j∈[n2 ] i=1
" m #
Λ 0 X
= sup σi σ(uj · xi )

E
m σ kuj k2 ≤Λ,j∈[n2 ] i=1

" m #
Λ0 X
= sup σi σ(u · xi ) .

E
m σ kuk2 ≤Λ i=1

(b) By Talagrand’s lemma, since σ is L-Lipschitz, the following inequality holds:

" m #
0
b S (H) ≤ Λ L E sup
X
R σi u · xi

m σ h∈H i=1
m
" #
0
ΛL X
= E sup sup s σi u · xi (def. of abs. value)
m σ h∈H s∈{−1,+1} i=1
= Λ0 L R
b S (H0 ).
(c)
m
" #
b S (H0 ) = 1 E
X
R sup σi su · xi
mσ kuk2 ≤Λ,s∈{−1,+1} i=1
" m #
1 X
= sup σi u · xi (def. of abs. val.)

E
mσ kuk2 ≤Λ i=1

" m
#
1 X
= sup u · σi xi

E
m σ kuk ≤Λ
2 i=1

" m #
Λ X
= σi xi (Cauchy-Schwarz eq. case).

E
m σ
i=1

2
Λ m
P
σ x
The last equality holds by setting u = P i=1 i i .
k mi1=1 σi xi k
(d)
" m #
b S (H0 ) = Λ E
X
R σ x

i i
m σ i=1
2
v 
u m 2 
Λ u 
u
X
≤ σi xi  (Jensen’s ineq.)

tE
m σ i=1

2
v
uX m
1u
= t E [σi σj ] (xi · xj )
m i,j=1 σ
v
uX m
Λu
= t 1i=j (xi · xj ) (independence of σi s)
m i,j=1
v
um
Λu X
= t kxi k22 .
m i=1
(e) In view of the previous questions,
v
um
0 0 Λ 0 ΛL uX Λ0 ΛL √ 2 Λ0 ΛLr
b S (H) ≤ Λ L R
R b S (H ) ≤ t kxi k22 ≤ mr = √ .
m i=1
m m
3.12 Rademacher complexity

Consider the simple case where H is reduced to the constant hypothesis h1 : x 7→ 1 and
h−1 : x 7→ −1. Then, by definition of the empirical Rademacher complexity,
m m m
b S (H) = 1 E[max{ 1
X X X
R σi , −σi }] = E[| σi |]
mσ i=1 i=1
m σ i=1
Pm 2
Pm
Let X = i=1 σi . Note that E[X ] = E[ i,j=1 σi σj ]. For any i 6= j, since σi and σj are
independent, E[σi σj ] = E[σi ] E[σj ] = 0. Thus,
m
X m
X
E[X 2 ] = E[σi σi ] = E[σi2 ] = m.
i=1 i=1
Now, by Hölder’s inequality,
m = E[X 2 ] = E[|X|2/3 |X|4/3 ] ≤ E[|X|]2/3 E[X 4 ]1/3 .
Thus,
m3/2 m3/2 m3/2
E[|X|] ≥ = q P = p
E[X 4 ]1/2 m 4
P 2 2
E[ i=1 σi + 3 i6=j σi σj ] m + 3m(m − 1)
3/2 3/2 √
m m m
= p ≥ p = √ .
m(3m − 2) m(3m) 3
This shows that √
Rb S (H) ≥ √m .
3
Since Rm (H) ≥ R b S (H) + O( √1 ), it implies Rm (H) ≥ O( √1 ), which contradicts Rm (H) ≤
m m
1
O( m ).
Note that for the lower bound, we could have used instead a more general result (Khinchine’s
inequality) which states that for any a ∈ Rm ,
kak2
E[| σ · a|] ≥ √ .
2
3.13 VC-dimension of union of k intervals

The VC-dimension of this class is 2k. It is not hard to see that any 2k distinct points on
the real line can be shattered using k intervals; it suffices to shatter each of the k pairs of
consecutive points with an interval. Assume now that 2k + 1 distinct points x1 < · · · < x2k+1
are given. For any i ∈ [2k + 1], label xi with (−1)i+1 , that is alternatively label points with
1 or −1. This leads to k + 1 points labeled positively and requires 2k + 1 intervals to shatter
the set, since no interval can contain two consecutive points. Thus, no set of 2k + 1 points
can be shattered by k intervals, and the VC-dimension of the union of k intervals is 2k.
3.14 VC-dimension of finite hypothesis sets
With a finite set H, at most 2|H| dichotomies can be defined.
3.15 VC-dimension of subsets
The set of three points {0, 3/4, 3/2} can be fully shattered as follows:
+++ α = −2
++− α=0
+−+ α = −1
+−− α = 3/2 − 2 +
−++ α = 3/4 − 2
−+− α=
−−+ α = 3/2
−−− α = 3/2 + ,
where e is a small number, e.g., = .1. No set of four points x1 < x2 < x3 < x4 can be
labeled by + − +−. This is because the three leftmost labels + − + imply that α + 2 ≤ x3
and thus also α + 2 < x4 . Thus, the VC-dimension of the set of subsets Iα is 3. Note that
this does not coincide with the number of parameters used to describe the class.
3.16 VC-dimension of axis-aligned squares and triangles
(a) It is not hard to see that the set of 3 points with coordinates (1, 0), (0, 1), and (−1, 0)
can shattered by axis-aligned squares: e.g., to label positively two of these points, use a
square defined by the axes and with those to points as corners. Thus, the VC-dimension
is at least 3. No set of 4 points can be fully shattered. To see this, let PT be the highest
point, PB the lowest, PL the leftmost, and PR the rightmost, assuming for now that
these can be defined in a unique way (no tie) – the cases where there are ties can be
treated in a simpler fashion. Assume without loss of generality that the difference dBT
of y-coordinates between PT and PB is greater than the difference dLR of x-coordinates
between PL and PR . Then, PT and PB cannot be labeled positively while PL and PR are
labeled negatively. Thus, the VC-dimension of axis-aligned squares in the plane is 3.
(b) Check that the set of 4 points with coordinates (1, 0), (0, 1), (−1, 0), and (0, −1) can be
shattered by such triangles. This is essentially the same as the case with axis aligned
rectangles. To see that no five points can be shattered, the same example or argument
as for axis-aligned rectangles can be used: labeling all points positively except from the
one within the interior of the convex hull is not possible (for the degnerate cases where no
point is in the interior of the convex hull is simpler, this is even easier to see). Thus, the
VC-dimension of this family of triangles is 4.
3.17 VC-dimension of closed balls in Rn .

Let B(a, r) be the ball of radius r centered at a ∈ Rn . Then x ∈ B(a, r) iff
Xn n
X n
X
kxi k2 − 2 ai xi + a2i − r ≤ 0, (E.47)
i=1 i=1 i=1
which is equivalent to
 Pn hW, Xi+ B ≤ 0, (E.48)
kxi k2
 
1 i=1
 −2a   x1 
1
, and B = n 2
P
with W =  , X =  i=1 ai − r. The VC-dimension of
   
 ...   ... 
−2an xn
closed balls in Rn is thus at most equal to the VC-dimension of hyperplanes in Rn+1 , that is,
n + 2.
3.18 VC-dimension of ellipsoids
The general equation of ellipsoids in Rn is
(X − X0 )> A(X − X0 ) ≤ 1, (E.49)
where X, X0 ∈ Rn and A = (aij ) ∈ Sn
+ is a positive semidefinite symmetric matrix. This can
be rewritten as
>
XP AX − 2X> AX0 + X> 0 AX0 ≤ 1, (E.50)
or n n >
P
i,j=1 aij (xi xj + xj xi ) − i=1 2(AX0 )i xi + X0 AX0 − 1 ≤ 0 using the fact that A is
symmetric. Let ai = −2(AX0 )i for i ∈ [n] and let b = X> 0 AX0 − 1. Then this can be viewed
in terms of the following equations of hyperplanes in Rn(n+1)/2+n
W> Z + b ≤ 0, (E.51)
with    
a1 x1 

 ...   ... 

   


 an  xn
   
 

   

 a11   x1 x1 + x1 x1 
W=   Z=    n(n + 1)/2 + n (E.52)
 ...  ...
 
 


 aij   xi xj + xj xi 
   

   

 ...   ...





ann xn xn + xn xn .
Since the VC-dimension of hyperplanes in Rn(n+1)/2+n is n(n+1)/2+n+1 = (n+1)(n/2+1),
the VC-dimension of ellipsoids in Rn is bounded by (n + 1)(n + 2)/2.
3.19 VC-dimension of a vector space of real functions
Show that no set of size m = r + 1 can be shattered by H. Let x1 , . . . , xm be m arbitrary
points. Define the linear mapping l : F → Rm defined by:
l(f ) = (f (x1 ), . . . , f (xm ))
Since the dimension of dim(F ) = m − 1, the rank of l is at most m − 1, and there exists
α ∈ Rm orthogonal to l(F ):
Xm
∀f ∈ F, αi f (xi ) = 0
i=1
We can assume that at least one αi is negative. Then,
Xm m
X
∀f ∈ F, αi f (xi ) = − αi f (xi )
i : αi ≥0 i : αi <0
Now, assume that there exists a set {x : f (x) ≥ 0} selecting exactly the xi s on the left-hand
side. Then all the terms on the left-hand side are non-negative, while those on the right-hand
side are negative, which cannot be. Thus, {x1 , . . . , xm } cannot be shattered.
3.20 VC-dimension of sine functions
(a) Fix x ∈ R and suppose there exists an ω that realizes the labeling −−+−. Thus sin(ωx) <
0, sin(2ωx) < 0, sin(3ωx) ≥ 0 and sin(4ωx) < 0. We will show that this implies sin2 (ωx) <
1
2
and sin2 (ωx) ≥ 34 , a contradiction.
Using the identity sin(2θ) = 2 sin(θ) cos(θ) and the fact that sin(4ωx) < 0 we have
2 sin(2ωx) cos(2ωx) = sin(4ωx) < 0.
Since sin(2ωx) < 0 we can divide both sides of this inequality by 2 sin(2ωx) to conclude
cos(2ωx) > 0. Applying the identity cos(2θ) = 1 − 2 sin2 (θ) yields 1 − 2 sin2 (ωx) > 0, or
sin2 (ωx) < 21 .
Using the identity sin(3θ) = 3 sin(θ) − 4 sin3 (θ) and the fact that sin(3ωx) ≥ 0 we have
3 sin(ωx) − 4 sin3 (ωx) = sin(3ωx) ≥ 0
Since sin(ωx) < 0 we can divide both sides of this inequality by sin(ωx) to conclude
3 − 4 sin2 (ωx) ≤ 0, or sin2 (ωx) ≥ 34 .
(b) For any m > 0, consider the set of points (x1 , . . . , xm
P) withi arbitrary labels (y1 , . . . , ym ) ∈
1−yi
{−1, +1}m . Now, choose the parameter ω = π(1 + m 0 0
i=1 2 yi ) where yi = 2
. We show
that this single parameter will always correctly classify the entire sample for any m > 0
and choice of labels. For any j ∈ [m] we have,
m
X j−1
X m−j
X
ωxj = ω2−j = π(2−j + 2i−j yi0 ) = π(2−j + ( 2i−j yi0 ) + yj0 + ( 2i yi0 )) .
i=1 i=1 i=1
The last term can be dropped from the sum,
P sincei−j
it contributes only multiples of 2π. Since
yi0 ∈ {0, 1} the remaining term π(2−j + ( j−1
Pj−1
i=1 2 yi0 ) + yj0 ) = π( i=1 2−i yi0 + 2−j + yj0 )
can be upper and lower bounded as follows:
j−1
X j
X
π( 2−i yi0 + 2−j + yj0 ) ≤ π( 2−i + yj0 ) < π(1 + yj0 ) ,
i=1 i=1
j−1
X
π( 2−i yi0 + 2−j + yj0 ) > πyj0 .
i=1
Thus, if yj = 1 we have yj0 = 0 and 0 < ωxj < π, which implies sgn(ωxj ) = 1. Similarly,
for yj = −1 we have sgn(ωxj ) = −1.
3.21 VC-dimension of union of halfspaces

3.22 VC-dimension of intersection of halfspaces
Let m ≥ 0. Note the general fact that for any concept class C = {c1 ∩ c2 : c1 ∈ C1 , c2 ∈ C2 },
ΠC (m) ≤ ΠC1 (m) ΠC2 (m). (E.53)
Indeed, fix a set X of m points. Let Y1 , . . . , Yk be the traces of C1 on X. By definition of
ΠC1 (X), k ≤ ΠC1 (X) ≤ ΠC1 (m). By definition of ΠC2 (Yi ), the traces of C2 on a subset Yi are
at most ΠC2 (Yi ) ≤ ΠC2 (m). Thus, the traces of C on X are at most
kΠC2 (Yi ) ≤ ΠC1 (m) ΠC2 (m). (E.54)
For the particular case of Ck , using Sauer’s lemma, this implies that
em k(n+1)

ΠCk (m) ≤ (ΠC1 (m))k ≤ . (E.55)
n+1
k(n+1)
If (em/(n + 1)) < 2m , then the VC-dimension of Ck is less than m. If the VC-dimension
of Ck is m, then ΠCk (m) = 2m ≤ (em/(n + 1))k(n+1) . These inequalities give an upper
bound and a lower bound on VCdim(Ck ). As an example, using the inequality: ∀x ∈ N −
{3}, log2 (x) ≤ x/2, one can verify that:
VCdim(Ck ) ≤ 2(n + 1)k log(3k). (E.56)
3.23 VC-dimension of intersection concepts
(a) Fix a set X of m points. Let Y1 , . . . , Yk be the set of intersections of the concepts of C1
with X. By definition of ΠC1 (X), k ≤ ΠC1 (X) ≤ ΠC1 (m). By definition of ΠC2 (Yi ), the
intersection of the concepts of C2 with Yi are at most ΠC2 (Yi ) ≤ ΠC2 (m). Thus, the
number of sets intersections of concepts of C with X is at most

kΠC2 (Yi ) ≤ ΠC1 (m) ΠC2 (m). (E.57)
(b) In view of the result proved in the previous question, ΠCs (m) ≤ (ΠC1 (m))s . By Sauer’s
lemma, this implies em sd
ΠCs (m) ≤ . (E.58)
sd d
If emd
< 2m , then the VC-dimension of C is less than m. Thus, it suffices to show
s
this inequality holds with m = 2ds log2 (3s). Plugging in that value for m and taking the
log2 yield:
ds log2 (2es log2 (3s)) < 2ds log2 (3s) (E.59)
⇔ log2 (2es log2 (3s)) < 2 log2 (3s) = log2 (9s2 ) (E.60)
⇔ 2es log2 (3s) 9s2< (E.61)
9s
⇔ log2 (3s) < . (E.62)
2e
This last inequality holds for s = 2: log2 (6) ≈ 2.6 < 9/(2e) ≈ 3.3. Since the functions
corresponding to the left-hand-side grows more slowly than the one corresponding to the
right-hand-side (compare derivatives for example), this implies that the inequality holds
for all s ≥ 2.
3.24 VC-dimension of union of concepts
(a) When C = A ∪ B, ΠC (X) ≤ ΠA (X) + ΠB (X) for any set X, since dichotomies in ΠC (X) can
be generated by A or by B. Thus, for all m, ΠC (m) ≤ ΠA (m) + ΠB (m).
(b) For m ≥ dA + dB + 2, by Sauer’s lemma,
dA dB dA dB
X m X m X m X m − i
ΠC (m) ≤ + = +
i=0
i i=0
i i=0
i i=0
i
dA dB
X m X m
= + (E.63)
i=0
i i=m−d
i
B
dA dB
X m X m
≤ + (E.64)
i=0
i i=dA +2
i
m
X m
< = 2m . (E.65)
i=0
i
Thus, the VC-dimension of C is strictly less than dA + dB + 2:
VCdim(C) ≤ dA + dB + 1. (E.66)
3.25 VC-dimension of symmetric difference of concepts

Fix a set S. We can show that the number of classifications of S using H is the same as when
using H∆A. The set of classifications obtained using H can be identified with {S ∩ h : h ∈ H}
and the set of classifications using H∆A can be identified with {S ∩ (h∆A) : h ∈ H}. Observe
that for any h ∈ H,
S ∩ (h∆A) = (S ∩ h)∆(S ∩ A). (E.67)
Figure E.7 helps illustrate this equality in a special case. Now, in view of this inequality, if
S ∩ (h∆A) = S ∩ (h0 ∆A) for h, h0 ∈ H, then
(S ∩ h)∆B = (S ∩ h0 )∆B, (E.68)
with B = S ∩ A. Since two sets that have the same symmetric differences with respect to a
set B must be equal, this implies
S ∩ h = S ∩ h0 . (E.69)
Figure E.7
Illustration of (h∆A) ∩ S = (h ∩ S)∆(A ∩ S) shown in gray.
This shows that φ defined by

φ : S ∩ H → S ∩ (H∆A)
S ∩ h 7→ S ∩ (h∆A)
is a bijection, and thus that the sets S ∩ H and S ∩ (H∆A) have the same cardinality.
3.26 Symmetric functions
(a) For i = 0, . . . , n, let xi ∈ {0, 1}n be defined by xi = (1, . . . , 1, 0, . . . , 0). Then, {x0 , . . . , xn }
| {z }
i 1’s
can be shattered by C. Indeed, let y0 , . . . , yn ∈ 0, 1 be an arbitrary labeling of these points.
Then, the function h defined by:
h(x) = yi (E.70)
for all x with i 1’s is symmetric and h(xi ) = yi . Thus, VCdim(C) ≥ n + 1. Conversely,
a set of n + 2 points cannot be shattered by C, since at least two points would then have
the same number of 1’s and will not be distinguishable by C. Thus,
VCdim(C) = n + 1. (E.71)
(b) Thus, in view of the theorems presented in class, a lower bound on the number of training
examples needed to learn symmetric functions with accuracy 1 − and confidence 1 − δ is
1 1 n
Ω( log + ), (E.72)
δ
and an upper bound is:
1 1 n 1
O( log + log ), (E.73)
δ
which is only within a factor 1 of the lower bound.
(c) For a training data (z0 , t0 ), . . . , (zm , tm ) ∈ {0, 1}n × {0, 1} define h as the symmetric
function such that h(zi ) = ti for all i = 0, . . . , m.
3.27 VC-dimension of neural networks
(a) Let Πu (m) denote the growth function at a node u in the intermediate layer. For a fixed
Q class C the output node can
set of values at the intermediate layer, using the concept
generate at most ΠC (m) distinct labelings. There are u Πu (m) possible sets of values
at the intermediate layer since, by definition, for a sample Q
of size m, at most Πu (m)
(m) × u Πu (m) labelings can be
distinct values are possible at each u. Thus, at most ΠCQ
generated by the neural network and ΠH (m) ≤ ΠC (m) u Πu (m).
(b) For any intermediate node u, Πu (m) = ΠC (m). Thus, for k̃ = k + 1, ΠH (m) ≤ ΠC (m)k̃ .
d dk̃
By Sauer’s lemma, ΠC (m) ≤ em d
, thus ΠH (m) ≤ em
d
. Let m = 2k̃d log2 (ek̃). In
em

view of the inequality given by the hint and ek̃ > 4, this implies m > dk̃ log2 d
, that
dk̃
is 2m > emd
. Thus, the VC-dimension of H is less than
2k̃d log2 (ek̃) = 2(k + 1)d log2 (e(k + 1)).
(c) For threshold functions, the VC-dimension of C is r, thus, the VC-dimension of H is upper
bounded by
2(k + 1)r log2 (e(k + 1)).
3.28 VC-dimension of convex combinations
Following the hint, we can think of this family of functions as a one hidden layer neural
network, where the hidden layer is represented by the functions ht ∈ H, and the top layer is
a threshold function characterized by (α1 , . . . , αT ). Denote this class of threshold functions
by ∆T . From the solution of exercise 3.27(a) we can bound the growth function of FT by:
ΠFT (m) ≤ Π∆T (m) (ΠH (m))T .
From the solution to exercise 3.27(c), the VC dimension of ∆T is at most T , and we may
further denote the VC dimension of H by d. Applying Sauer’s lemma to the growth function
yields: em T em d
Π∆T (m) ≤ , ΠH (m) ≤ .
T d
Thus, we have that em T em T d
ΠFT (m) ≤ .
T d
Finally, we may apply the hint in exercise 3.27(b) with m = max{4T log2 (2e), 2T d log2 (eT )}
to see that em T em T d
< 24T log2 (2e)+2T d log2 (eT ) ,
T d
so that the VC Dimension of FT is bounded by:
2T (2 log2 (2e) + d log2 (eT )).
Note that a coarser but relatively simpler bound would be to write:
em T em T d
< (em)T (d+1) ,
T d
and to apply the hint in exercise 3.27(b) with m = 2T (d + 1) log2 (eT (d + 1)). Notice that this
is actually asymptotically optimal in T and d up to log terms.
3.29 Infinite VC-dimension
d
(a) Theorem 3.20 shows that there exists a distribution that can force an error of Ω( m ).
Thus, for an infinite VC-dimension, this lower bound requires an infinite number of points
to achieve a bounded error and thus implies that PAC-learning is not possible.
(b) Here is a description of the algorithm. Let M be the maximum value observed after
drawing m points and let p be the probability that a point greater than M be drawn. The
probability that all points drawn be smaller than or equal to M is
(1 − p)m ≤ e−pm . (E.74)
Setting δ/2 to match the upper bound, yields δ/2 = e−pm , that is
1 2
p= log . (E.75)
m δ
To bound p by /2, we can impose the following
1 2
log ≤ . (E.76)
m δ 2
Thus, with confidence at least 1 − δ/2, the probability that a point greater than M be
drawn is at most /2 if L draws m ≥ 2 log 2δ points.
In the second stage, the problem is reduced to a finite VC-dimension M . Since PAC-
learning with (/2, δ/2) is possible for a finite dimension, this guarantees the (, δ)-PAC-
learning of the full algorithm.
3.30 VC-dimension generalization bound – realizable case
(a) Let h0 ∈ HS , then we have the following set of inequalities:

bS 0 (h)| > ≥ P |R bS 0 (h0 )| >
h i h i
P sup |R bS (h) − R bS (h0 ) − R
h∈HS 2 2
h i
= P RS 0 (h0 ) >
b
2
b 0 ) > | R(h0 ) > P[R(h0 ) > ]
h i
≥ P R(h
2
h m i
> P B(m, ) > P[R(h0 ) > ] .
2
The second inequality follows from the fact that for any two random events A and B,
P[A] ≥ P[A∧B] = P[A|B] P[B]. The final equality follows, since the event we are concerned
with is the probability that we get at least a fraction of /2 errors on a sample of size m
when the true probability of error is at least . In the case the true error rate equals ,
this exactly describes the probability that B(m, ) ≥ m/2.
(b) We apply Chebyshev’s inequality to the binomial random variable B(m, ), which has
mean m variance m(1 − ).
h m i h m i
P B(m, ) ≤ = P m − B(m, ) ≥
2 2
m(1 − ) 4(1 − ) 4 1
≤ = ≤ ≤
(m/2)2 m m 2
where the last inequality uses the assumption m ≥ 8. Thus, this shows that P[B(m, ) >
m/2] > 1 − 1/2 = 1/2. Plugging the bound into part (a) completes the question.
(c) There are 2m total ways to distribute the l error over the sample T and m

l l
way to
0
distribute the error such that the only hit S . Thus, the probability of all error falling
only into S 0 is bounded as
m
m−i m−i 1
l
2m
= Πl−1
i=0 ≤ Πl−1
i=0 ≤ l.
l
2m − i 2m − 2i 2
(d) This follows from
bS 0 (h) >
h i
P RbS (h) = 0 ∧ R
T ∼D2m : 2
T →(S,S 0 )
bS 0 (h) > ∧ R
bT (h) >
h i
= P bS (h) = 0 ∧ R
R
T ∼D2m : 2 2
T →(S,S 0 )
bT (h) > P[R

bS 0 (h) > R bT (h) > ]
h i
= P bS (h) = 0 ∧ R
R
T ∼D2m : 2 2 2
T →(S,S 0 )
bS 0 (h) > R
bT (h) >
h i
≤ P bS (h) = 0 ∧ R
R
T ∼D2m : 2 2
T →(S,S 0 )
m
≤ 2−l ≤ 2− 2 .
(e) Using the definition of the growth function, we can provide the following union bound
that is then in turn bounded using corollary 3.18:
bS 0 (h) > ≤ ΠH (2m)2− m
h i 2em d
P ∃h ∈ H : R
bS (h) = 0 ∧ R 2 ≤ 2−m/2 .
T ∼D 2m
: 2 d
T →(S,S 0 )
Combining part (a) through (e), we finally have,
2em d
P[R(h) > ] ≤ 2 2−m/2 . (E.77)
d
Setting the right-hand side equal to δ and solving for show that with probability at least
1−δ
1 2em 1 2
≤ d log + log + log 2 .
m d δ log 2
3.31 Covering number generalization bound.
(a) First split the term into two separate terms:
|LS (h1 ) − LS (h2 )| ≤ |R(h1 ) − R(h2 )| + |R

bS (h1 ) − RbS (h2 )|
1 X m
= E [(h1 (x) − y)2 − (h2 (x) − y)2 ] + (h1 (xi ) − yi )2 − (h2 (xi ) − yi )2 .

x,y m i=1
Then, expanding the term
(h1 (x) − y)2 − (h2 (x) − y)2 = (h1 (x) − h2 (x))(h1 + h2 − 2y)

= (h1 (x) − h2 (x)) (h1 − y) + (h2 − y) ≤ kh1 − h2 k∞ 2M ,
allows us to bound both the empirical and true error, resulting in a total bound of 4M kh1 −
h2 k∞ .
(b) This follows by splitting the event into the union of several smaller events and then using
the sum rule,
h i
P sup |LS (h)| ≥
S h∈H
k
h_ i Xk h i
=P sup |LS (h)| ≥ ≤ P sup |LS (h)| ≥ .
S S
i=1 h∈Bi i=1 h∈Bi

(c) For any i let hi be the center of ball Bi with radius . Note that for any h ∈ H we have
8M
|LS (h) − LS (hi )| ≤ 4M kh − hi k∞ ≤ /2. Thus, if for any h ∈ Bi we have |LS (h)| ≥ it
must be the case that |LS (hi ) ≥ 2 |, which shows the inequality.
To complete the bound, we use Hoeffding’s inequality applied to the random variables
(h(xi ) − yi )2 /m ≤ M 2 /m, which guarantees
h i −m2
P |LS (hi )| ≥ ≤ 2 exp .
S 2 2M 4
Chapter 5
5.1 Soft margin hyperplanes
(a) The corresponding dual problem is:

m m m
X 1 X X (αi + βi )k/(k−1) 1
max αi − αi αj yi yj (xi · xj ) − 1/(k−1)
(1 − )
α,β
i=1
2 i,j=1 i=1
(kC) k
subject to:
m
X
α i yi = 0 α≥0 β ≥ 0.
i=1
(b) Here we see that the objective function is more complex requiring an optimization over
both α and β and there is the additional constraint β ≥ 0.
For k = 2 the additional term of interest is − m 2
P
i=1 (αi + βi ) (to see this, note that the
Hessian is negative semidefinite), which is jointly concave in αi and βi , which allows for
convex optimization techniques.
5.2 Tighter Rademacher bound

Proceed as in the proof of theorem 5.9, but choose ρk = 1/γ k . For any ρ ∈ (0, 1), there exists
k ≥ 1 such
q that ρ ∈ (ρk , ρk−1 ], with ρ0 = 1. For that k, ρ ≤ ρk−1 = γρk , thus 1/ρk ≤ γ/ρ and
p
log k = log logγ (1/ρk ) ≤ log log2 (γ/ρ). Furthermore, for any h ∈ H, R bS,ρ (h) ≤ R
k
bS,ρ (h).
Thus,
 s 
2γ log logγ (γ/ρ)
P ∃k : R(h) − RS,ρ (h) >
 b Rm (H) + +  ≤ 2 exp(−2m2 ),
ρ m
which proves the statement.
5.3 Importance weighted SVM
The modified primal optimization problem can be written as
1 Pm
minimize 2
||w||2 +C i=1 ξi pi
subject to yi [w · xi + b] ≥ 1 − ξi .
The Lagrangian holding for all w, b, αi ≥ 0, βi ≥ 0 is then
m
1 X
L(w, b, α) = ||w||2 + C ξ i pi (E.78)
2 i=1
m
X m
X
− αi [yi (w · xi + b) − 1 + ξi ] − βi ξ i .
i=1 i=1
∂L
Then ∂w and ∂L
∂b
are the same as for the regular non-separable SVM optimization problem.
∂L
We also have ∂ξ = Cpi −αi −βi . Thus, to satisfy the KKT conditions we have for all i ∈ [m],
i
m
X
w = α i yi xi (E.79)
i=1
m
X
α i yi = 0 (E.80)
i=1
αi + βi = Cpi (E.81)
αi [yi (w · xi + b) − 1 + ξi ] = 0 (E.82)
βi ξi = 0. (E.83)
Plugging equation E.79 into equation E.78, we get
m m m
1 X X X
L = || αi yi xi ||2 + C ξi pi − αi αj yi yj (xi · xj ) (E.84)
2 i=1 i=1 i,j=1
m
X m
X X m
X
− αi yi b + αi − αi ξi − βi ξi .
i=1 i=1 i=1
Using equation E.81, we can simplify:
m m
X 1 X
L= || αi yi xi ||2 ,
αi −
i=1
2 i=1
meaning that the objective function is the same as in the regular SVM problem. The difference
is in the constraints on the optimization. Recall that our dual form holds for βi ≥ 0. Using
again equation E.81, our optimization problem is to maximize L subject to the constraints:
m
X
∀i ∈ [m], 0 ≤ αi ≤ Cpi ∧ αi yi = 0.
i=1
5.4 Sequential Minimal Optimization (SMO).
(a) Starting from equation (5.33) and removing all terms that are constant with respect to
α1 and α2 yields the desired result.
(b) Substituting into Ψ1 , we have:
1 1
Ψ2 = γ − sα2 + α2 − K11 (γ − sα2 )2 − K22 α22 − sK12 (γ − sα2 )α2
2 2
− y1 (γ − sα2 )v1 − y2 α2 v2 .
We take the derivative to find the equation for the stationary point as follows:
dΨ2
= −s+1+sK11 (γ −sα2 )−K22 α2 −sK12 (γ −sα2 )+s2 K12 α2 +y1 sv1 −y2 v2 = 0 .
dα2
Noting that s2 = 1 and rearranging terms yields the statement of interest.
(c) By definition of f (·) we see that v1 = f (x1 ) − y1 α∗1 K11 − y2 α∗2 K12 and similarly, v2 =
f (x2 ) − y1 α∗1 K12 − y2 α∗2 K22 . Using these equations along with the identities α∗1 = γ − sα∗2
and y1 = sy2 we have
v1 − v2 = f (x1 ) − f (x2 ) − y1 α∗1 (K11 − K12 ) − y2 α∗2 (K12 − K22 )
= f (x1 ) − f (x2 ) − y1 (γ − sα∗2 )(K11 − K12 ) − y2 α∗2 (K12 − K22 )
= f (x1 ) − f (x2 ) − sy2 (γ − sα∗2 )(K11 − K12 ) − y2 α∗2 (K12 − K22 )
= f (x1 ) − f (x2 ) − sy2 γ(K11 − K12 ) + y2 α∗2 (K11 − K12 ) − y2 α∗2 (K12 − K22 )
= f (x1 ) − f (x2 ) − sy2 γ(K11 − K12 ) + y2 α∗2 η .
(d) Combining the results from (b) and (c), we have
h i
ηα2 = s(K11 − K12 )γ + y2 f (x1 ) − f (x2 ) − sy2 γ(K11 − K12 ) + y2 α∗2 η − s + 1
h i
= y2 f (x1 ) − f (x2 ) + y2 α∗2 η − s + 1
h
= α∗2 η + y2 f (x1 ) − f (x2 ) − y1 + y2

h
= α∗2 η + y2 (y2 − f (x2 )) − (y1 − f (x1 )) .

Dividing both sides by η yields the desired result.

(e) Clipping is required to ensure that the new values of α1 and α2 satisfy the inequality
constraints 0 ≤ α1 , α2 ≤ C. The lower bound of 0 follows directly from this inequality
constraint, as does the upper bound of C. Moreover, when s = +1, the lower bound of
γ − C ensures that αclip
2 is large enough such that αclip
2 + α1 = γ while respecting the
constraint α1 ≤ C. Similarly, the upper bound γ ensures that αclip
2 is small enough such
that αclip
2 + α1 = γ while respecting the constraint α1 ≥ 0.
5.5 SVM hands-on
(a) Download and install the libsvm software library from:

http://www.csie.ntu.edu.tw/~cjlin/libsvm/
(b) Concatenate and scale data.
cat \
satimage/satimage.scale.t \
satimage/satimage.scale.val > data/train
libsvm-2.88/svm-scale \
data/train > data/train.scaled
Polinomial kernel, degree = 1 Polinomial kernel, degree = 2

0.14 0.14
cross validation error

0.12 0.12
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-8 -6 -4 -2 0 2 4 6 -8 -6 -4 -2 0 2 4 6
log2(C) log2(C)

0.14 0.14

0.12 0.12
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-8 -6 -4 -2 0 2 4 6 -8 -6 -4 -2 0 2 4 6
log2(C) log2(C)
Figure E.8
Average cross-validation error plus or minus one standar deviation for different values of the
trade-off constant C and of the degree of the polinomial kernel
(c) Run 10-fold cross-validation, for different values of the degree d of the polynomial kernel
and of the trade-off constant C. We test d = 1, 2, 3, 4 and log2 (C) = −8, −7.5, · · · , 5.5, 6.
We step the value of the trade-off constant logarithmically as suggested by:
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
Which gives the cross-validation error plots shown in figure E.8.
The best values of trade-off constant C are:
d = 1 \colon C* = 2^(+5.0) = 32 cv-err = 10.4% +/- 1.4%

d = 2 \colon C* = 2^(+0.0) = 1 cv-err = 8.3% +/- 2.2%
d = 3 \colon C* = 2^(-2.0) = .25 cv-err = 7.5% +/- 2.1%
d = 4 \colon C* = 2^(-4.5) = .0442 cv-err = 6.4% +/- 1.5%
The best C measured on the validation set is C ∗ = 2−4.5 , with degree d∗ = 4, which gives
an average error rate of 6.4% ± 1.5%.
(d) The trade-off constant is fixed to C ∗ = 2−4.5 , and 10-fold cross-validation is run for
degrees 1 through 4. In figure E.9 we plot the resulting cross-validation training and test
errors and the average number of support vectors (nSV is the number of support vectors,
nBSV is the number of bounded support vectors, i.e. whose dual variable is equal to the
trade-off constant).
(e) Support vectors always lie on the margin hyperplanes when their dual variable is smaller
than C. This happens for all the support vectors (SV) that are not bounded (BSV). Our
measurement gives the following averages:
d = 1 \colon nSV - nBSV = 8.5
CV error, log2(C) = -4.5 CV error, log2(C) = -4.5

1100
Test error nSV
0.14 Train error nBSV
1000
0.12
900
number of support vectors

0.1 800
700
error rate
0.08
600
0.06
500
0.04
400
0.02
300
0 200
0 1 2 3 4 5 0 1 2 3 4 5
polynomial kernel degree polynomial kernel degree
Figure E.9
Average cross-validation error rates and average number of support vectors (nSV) and of bounded
support vectors(nSV) as a function of the degree of the polynomial kernel.

(f) We can consider the more general problem of assigning a weight ki to every sample that
will multiply its misclassification penalty. The optimization problem becomes:
m
1 X
minimize ||w||2 + ki ξi
2 i=1
subject to yi (wxi + b) ≤ 1 − ξi , ξi ≥ 0, ∀i ∈ 1, . . . , m .
Moving to the dual exactly as shown in the chapter we obtain:

m
X 1X
maximize αi − αi αj yi yj xi · xj
i=1
2 i,j
subject to 0 ≤ αi ≤ Cki , ∀i ∈ 1, . . . , m .
Setting ki = k for yi = −1 and ki = 1 for yi = 1 gives the desired problem.

(g) If k is an integer we can repeat every negative training point in the data k times. False
positives will thus get penalized k times as much as false negatives.
(h) Repeating the training for k = 1, 2, 4, 8, 16, we find the following results:
k = 1 \colon (d*, C*) = (4, 2^(-4.5)), cv-err = 6.4% +/- 1.5%
k = 2 \colon (d*, C*) = (4, 2^(-5.0)), cv-err = 6.4% +/- 0.9%

0.14 0.14

0.12 0.12
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
log2(C) log2(C)

0.14 0.14

0.12 0.12
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
log2(C) log2(C)
Figure E.10
As in figure E.8, but false positives are penalized twice as much as false negatives.
k = 4 \colon (d*, C*) = (4, 2^(-5.5)), cv-err = 6.1% +/- 1.7%

k = 8 \colon (d*, C*) = (4, 2^(-3.5)), cv-err = 6.2% +/- 1.0%
k = 16 \colon (d*, C*) = (4, 2^(-3.5)), cv-err = 6.2% +/- 1.4%
Plots equivalent to the ones in figure E.8 are given in figure E.10, figure E.11, figure E.12,
and figure E.13. We obtain the best average accuracy for k = 4.
5.6 Sparse SVM
(a) Let
x0i = (y1 xi · x1 , . . . , ym xi · xm ) .
Then the optimization problem becomes
m
1 X
min ||α||2 + C ξi
α,b,ξ 2
i=1
subject to yi α · x0i + b ≥ 1 − ξ

ξi , αi ≥ 0, i ∈ [m] ,
which is the standard formulation of the primal SVM optimization problem on samples
x0i , modulo the non-negativity constraints on αi .
(b) The Lagrangian of (1) for all αi ≥ 0, ξi ≥ 0, b, α0i ≥ 0, βi ≥ 0, γi ≥ 0, i ∈ [m] is

m m m m
1 X X X X
L= ||α||2 + C ξi − α0i (yi (α · x0i + b) − 1 + ξi ) − βi ξ i − γi αi ,
2 i=1 i=1 i=1 i=1

0.14 0.14

0.12 0.12
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-6 -4 -2 0 2 4 -6 -4 -2 0 2 4
log2(C) log2(C)

0.14 0.14

0.12 0.12
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-6 -4 -2 0 2 4 -6 -4 -2 0 2 4
log2(C) log2(C)
Figure E.11
As in figure E.8, but false positives are penalized four times as much as false negatives.
and the KKT conditions are

m
X
∇α L = 0 ⇔ α= α0i yi x0i + γ
i=1
m
X
∇b L = 0 ⇔ α0i yi = 0
i=1
∇ ξi L = 0 ⇔ α0i + βi = C
and
α0i (yi (α · x0i + b) − 1 − ξi ) = 0
βi ξi = 0
γi αi = 0.

0.14 0.14

0.12 0.12
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-8 -6 -4 -2 0 2 4 -8 -6 -4 -2 0 2 4
log2(C) log2(C)

0.14 0.14

0.12 0.12
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-8 -6 -4 -2 0 2 4 -8 -6 -4 -2 0 2 4
log2(C) log2(C)
Figure E.12
As in figure E.8, but false positives are penalized eight times as much as false negatives.
Using the KKT conditions on L we get

m
! m 
m
1 X X X
L = α0i yi x0i +γ · 0 0
αj yj xj + γ  + C ξi
2 i=1 j=1 i=1
    
m
X m
X
− α0i yi  α0j yj x0j + γ  · x0i + b − 1 + ξi 
i=1 j=1
m
X m
X
− βi ξi − γi αi
i=1 i=1
 
m m
1X 0 0 
X
0 0 1
= − α yi x i · αj yj xj + γ  + γ·α
2 i=1 i j=1
2

m
X
+ Cξi − α0i (yi b − 1 + ξi )
i=1
m m
X 1 X 0 0
α0i α α yi yj x0> x0j + γ

= −
i=1
2 i,j=1 i j i
m
X X
m
+ (C α0i )ξi −
− 0
α
i yi b.

i=1
i=1


0.14 0.14

0.12 0.12
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-8 -6 -4 -2 0 2 -8 -6 -4 -2 0 2
log2(C) log2(C)

0.14 0.14

0.12 0.12
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-8 -6 -4 -2 0 2 -8 -6 -4 -2 0 2
log2(C) log2(C)
Figure E.13
As in figure E.8, but false positives are penalized sixteen times as much as false-negatives.
Thus the dual optimization problem is

m m
X 1 X 0 0
α0i − αi αj yi yj x0i · x0j + γ

max
0
α ,γ 2
i=1 i,j=1
m
X
subject to α0i yi = 0
i=1
0 ≤ α0i ≤ C, γi ≥ 0, i ∈ [m] .
5.7 VC-dimension of canonical hyperplanes
(a) By definition of {x1 , . . . , xd }, for all y = (y1 , . . . , yd ) ∈ {−1, +1}d , there exists w such
that,
∀i ∈ [d], 1 ≤ yi (w · xi ) .
Summing up these inequalities yields
d
d d
X X X
d≤w· yi xi ≤ kwk yi x i ≤ Λ yi x i .

i=1 i=1 i=1
(b) Since this inequality holds for all y ∈ {−1, +1}d , it also holds on expectation over
y1 , . . . , yd drawn i.i.d. according to a uniform distribution over {−1, +1}. In view of
the independence assumption, for i 6= j we have E[yi yj ] = E[yi ] E[yj ]. Thus, since the
distribution is uniform, E[yi yj ] = 0 if i 6= j, E[yi yj ] = 1 otherwise. This gives

" d #
X
d ≤ ΛE yi x i (taking expectations)

y
i=1
 
d
2  12
X
≤ Λ E  yi x i   (Jensen’s inequality)

y
i=1
 1
d 2
X
= Λ E [yi yj ] (xi · xj )
y
i,j=1
" d
# 12
X
=Λ (xi · xi ) .
√ i=1
Thus, d ≤ Λr, which completes the proof.
1 √
(c) In view of the previous inequality, we can write d ≤ Λ dr2 2 = Λr d.

Chapter 6
6.1 This follows directly the Cauchy-Schwarz inequality:

|Φ(x)> Φ(y)| ≤ kΦ(x)kkΦ(y)k, (E.85)
where Φ is a feature mapping associated to K.
6.2 (a) Since cos(x − y) = cos x cos y + sin x sin y, K(x, y) can be written as the dot product of
the vectors " # " #
cos x cos y
Φ(x) = and Φ(y) = ; (E.86)
sin x sin y
thus it is PDS.
(b) This
P is a consequence ofPthe fact that the kernel of the previous question is PDS since
2 2 0 0 0 2
ij ci cj cos(xi − xj ) = ij ci cj cos(xi − xj ), with xi = xi for all i.
The solution for this question is similar to that of the previous question.
(c) Since the product and sum of PDS kernels is PDS, it suffices to show that k : x 7→ cos(x2 −
y 2 ) is PDS over R × R, which was proven in part (b).
(d) For any x1 , . . . , xm ∈ R, c1 , . . . , cm ∈ R, and a ≥ 0, let f (a) be defined by
X
f (a) = ci cj K(xi , xj )axi +xj . (E.87)
i,j
xi +xj −1 1
Then, f 0 (a) = k(ci axi )i k2 ≥ 0. Therefore f is monotonically
P
i,j ci cj a =P
a
increasing. Thus, f (1) ≥ f (0), that is i,j ci cj K(xi , xj ) ≥ 0.
(e) Rewriting the cosine in terms of the dot product, we have
x · x0
K(x, x0 ) = cos ∠(x, x0 ) =
.
kxkkx0 k
Thus, the cosine kernel is just a scaling of the standard dot product, which is a PDS
kernel. Hence, the cosine kernel is also PDS.
(f) For all x, x0 ∈ R,
[sin(x0 − x)]2 = 1 − [cos(x0 − x)]2
= 1 − [cos x0 cos x + sin x0 sin x]2
= 1 − (u(x0 ) · u(x))2 ,
where u(x) = (cos x, sin x)> for all x ∈ X. Observe that ku(x)k = 1 for all x ∈ X. Thus,
λ 0 2
[sin(x0 − x)]2 = 21 ku(x0 ) − u(x)k2 and K(x, x0 ) = e− 2 ku(x )−u(x)k . Since the Gaussian
kernel is known to be PDS, K is also PDS (the fact that Gaussian kernels are PDS can be
shown easily by observing that they are the normalized kernels associated to the kernel
obtained by applying exp to standard inner products).
(g) It suffices to show that K is the normalized kernel associated to the kernel K 0 defined by
∀(x, y) ∈ RN × RN , K 0 (x, y) = eφ(x,y)
1
where φ(x, y) = σ
[kxk + kyk − kx − yk], and to show that K 0 is PDS. For the first part,
observe that
K 0 (x, y) 1 1 kx−yk
p = eφ(x,y)− 2 φ(x,x)− 2 φ(y,y) = e− σ .
0 0
K (x, x)K (y, y)
To show that K 0 is PDS, it suffices to show that φ is PDS, since composition with a power
series with non-negative coefficients (here exp) preserve the PDS property. Now, for any
c1 , . . . , cn ∈ R, let c0 = − n
P
i=1 ci , then, we can write
n n
X 1 X
ci cj φ(xi , xj ) = ci cj [kxi k + kxj k − kxi − xj k]
i,j=1
σ i,j=1
n n n
1h X X X i
= − c0 ci kxi k + − c0 cj kxj k − ci cj kxi − xj k
σ i=1 i=1 i,j=1
n
1 X
=− ci cj kxi − xj k,
σ i,j=0
with x0 = 0. Now, for any z ∈ R, the following equality holds:
1 − e−tz
Z +∞
1 1
z2 = 1 3 dt.
2Γ( 2 ) 0 t2
Thus,
n n 2
1 − e−tkxi −xj k
Z +∞
1 X 1 1 X
− ci cj kxi − xj k = − c i c j dt
σ i,j=0 2Γ( 21 ) 0 σ i,j=0 t2
3
Z +∞ Pn −tkxi −xj k2
1 1 i,j=0 ci cj e
= 1 3 dt.
2Γ( 2 ) 0 σ t2
Pn 2
Since a Gaussian kernel is PDS, the inequality i,j=0 ci cj e−tkxi −xj k ≥ 0 holds and the
Pn
right-hand side is non-negative. Thus, the inequality − σ1 i,j=0 ci cj kxi − xj k ≥ 0 holds,
which shows that φ is PDS.
Alternatively, one can also apply the theorem on page 43 of the lecture slides on kernel
methods to reduce the problem to showing that the norm G(x, y) = kx − yk is a NDS
function. This can be shown through a direct application of the definition of NDS together
with the representation of the norm given in the hint.
(h) Let h·, ·i denote the inner product over the set of all measurable functions on [0, 1]:
Z 1
hf, gi = f (t)g(t)dt.
0
Note that k1 (x, y) = 0 1t∈[0,x] 1t∈[0,y] dt = h1[0,x] , 1[0,y] i, and k2 (x, y) = 01 1t∈[x,1] 1t∈[y,1] dt =
R1 R
h1[x,1] , 1[y,1] i. Since they are defined as inner products, both k1 and k2 are PDS. Note
that k1 (x, y) = min(x, y) and k2 (x, y) = 1 − max(x, y). Observe that K = k1 k2 , thus K
is PDS as a product of two PDS kernels.
(i) f : x 7→ √1 admits the Taylor series expansion

1−x
∞
X 1/2
f (x) = (−1)n xn
n=0
n
1 ( 1 −1)···( 1 −n+1)
for |x| < 1, where 1/2 Observe that 1/2 (−1)n > 0 for all
2 2 2

n
= n! n
.
n ≥ 0, thus, the coefficients in the power series expansion are all positive. Since the
radius of the convergence of the series is one and that by the Cauchy-Schwarz inequality
|x0 · x| ≤ kx0 kkxk < 1 for x, x0 ∈ X, by the closure property theorem presented in class,
(x, x0 ) 7→ f (x0 · x) is a PDS kernel.
1/2
Now, let an = n
(−1)n . Then, for x, x0 ∈ X,
∞
X N
X n
f (x0 · x) = an xi x0i
n=0 i=1
∞
X X n
= an (xi1 x0i1 )s1 · · · (xiN x0iN )sN
n=0 s1 +···+sN =n
s 1 , . . . , sN
s + · · · + s
1 N
X
= as1 +···+sN (xi1 x0i1 )s1 · · · (xiN x0iN )sN ,
s ,...,s ≥0
s1 , . . . , sN
1 N
where the sums can be permuted since the series is absolutely summable. Thus, we can
write f (x0 · x) = Φ(x) · Φ(x0 ) with
s s + · · · + s
1 N s
Φ(x) = as1 +···+sN xsi11 · · · xi N .
s1 , . . . , sN N
s1 ,...,sN ≥0
Φ is a mapping to an infinite-dimensional space.
(j) For x ≥ 0, the integral 0+∞ e−sx e−s ds is well defined and
R
" #+∞
e−s(1+x)
Z +∞ Z +∞
−sx −s −s(1+x) 1
e e ds = e ds = − = . (E.88)
0 0 1+x 1+x
0
kx−yk2
−
Since the Gaussian kernel, (x, y) 7→ e is PDS for any σ 6= 0, the kernel (x, y) 7→
σ2
R +∞ − skx−yk2 −s
0 e σ2 e ds is also PDS since for any x1 , . . . , xm in RN and c1 , . . . , cm in R,
m −x k 2 m skxi −xj k2
skx i j
Z +∞ X
− −
X
ci cj e σ2 ≥0⇒ ci cj e σ2 e−s ds ≥ 0. (E.89)
i,j=1 0 i,j=1
skx−yk2
R +∞ − 1
By (E.88), 0 e σ2 e−s ds = kx−yk2
for all x, y in RN , which concludes the
1+
σ2
proof.
(k) Observe that for all x, y ∈ R,

Z +∞
1t∈[0,|x|] 1t∈[0,|y|] dt = min{|x|, |y|}. (E.90)
0
Thus, (x, y) 7→ min(|x|, |y|) is a PDS kernel over R × R since for any x1 , . . . , xm in RN
and c1 , . . . , cm in R,
m
X
ci cj min{|xi |, |xj |}
i,j=1
Z +∞ m
X N
X
= ci cj 1t∈[0,|xi |] 1t∈[0,|xj |] dt
0 i,j=1 k=1
 c1 1
t∈[0,|x1 |]
 2
+∞
Z
..

=   dt ≥ 0.
c 1 .

0
m t∈[0,|xm |]
PN N × RN as a sum of PDS
Thus, (x, y) 7→ i=1 min(|xi |, |yi |) xis a PDS kernel over R
kernels. Its composition with x 7→ σ with admits a power series with an infinite radius
2
of convergence and non-negative coefficients is thus also PDS, which concludes the proof.
6.3 Graph kernel For any i, j ∈ V, let E[i, j] denote the set of edges between i and j which is either
reduced to one edge or is empty. For convenience, let V = {1, . . . , n} and define the matrix
W = (Wij ) ∈ Rn×n by Wij = 0 if E[i, j] = ∅, Wij = w[e] if e ∈ E[i, j]. Then, we can write
X X
K(p, q) = w[e]w[e0 ] = Wpr Wrq = Wpq 2
.
e∈E[p,r],e0 ∈E[r,q] r
Let K = (Kpq ) denote the kernel matrix. Since W is symmetric For any vector X ∈ Rn ,
X> KX = X> W2 X = X> W> WX = kWXk2 ≥ 0. Thus, the eigenvalues of K are non-
negative. The same holds similarly for the kernel matrix restricted to any subset of V, thus
K is PDS.
6.4 Symmetric difference kernel
k : (A, B) 7→ |A∩B| is a PDS kernel over 2X since k(A, B) = x∈X 1A (x)1B (x) = Φ(A)·Φ(B),
P
where, for any subset A, Φ(A) ∈ {0, 1} |X| is the vector whose coordinate indexed by x ∈ X is
1 if x ∈ A, 0 otherwise. Since k is PDS, K 0 = exp(k) is also PDS (PDS property preserved
by composition with power series). Since K is the result of the normalization of K 0 , it is also
PDS.
6.5 Set kernel
Let Φ0 be a feature mapping associated to k0 , then K 0 (A, B) = x∈A,x0 ∈B φ0 (x) · φ0 (x0 ) =
P
P P
0
P
x∈A φ0 (x) · x0 ∈B φ0 (x ) = Ψ(A)·Ψ(B), where for any subset A, Ψ(A) = x∈A φ0 (x).
K(x, y) = 1 − cos2 (x − y). Now, for any x1 , . . . , xm ∈ R and c1 , . . . , cm ∈ R

6.6 (a) Note that P
such that m i=1 ci = 0,
m
X m
X m
X
ci cj K(xi , xj ) = ci cj − ci cj cos2 (xi − xj ) (E.91)
i,j=1 i,j=1 i,j=1
Xm
=− ci cj cos2 (xi − xj ), (E.92)
i,j=1
Pm Pm 2 = 0. It was already shown in a previous question
since i,j=1 ci cj = ( i=1 ci )
0
that K (x, y) = cos(x − y) defines a PDS kernel. P Since the product of two PDS ker-
m
nels is PDS, K 02 is also PDS. This implies that 2
i,j=1 ci cj cos (xi − xj ) ≥ 0, and
Pm 2
i,j=1 ci cj K(xi , xj ) ≤ 0, which can also be verified directly by expanding cos (xi − xj ).
(b) As presented in this chapter, K is NDS iff exp(−tK) is PDS for all t > 0. Here, for any
t > 0, exp(−tK) = (x + y)−t . It was already shown in a previous question that (x + y)−1 ,
which corresponds to the case t = 1, is PDS. Since a product of PDS kernels is PDS,
(x + y)−n also defines a PDS kernel for any positive integer n. To see the general case, it
suffices to show that (x + y)−t is PDS for any 0 < t < 1.
6.7 Consider the Gram matrix defined as Kij = K(xi , xj ). It is clear that K will have all zeros on
the diagonal. Hence tr(K) = 0. When K 6= 0, this means it must have at least one negative
eigenvalue. Hence K is not PDS.
6.8 The kernel K is NDS. In fact, more generally, kx − ykp for 0 < p ≤ 2 defines an NDS kernel.
This follows a general fact: if a kernel K is NDS, then K α is NDS for any 0 < α < 1.
Indeed, if K is NDS, then exp(−tK) is PDS for all t > 0. Then − exp(−tK) is NDS and so
is 1 − exp(−tK). Now, the following formula holds for all 0 < α < 1:
Z +∞
dt
K α = cα [1 − exp(−tK)] α+1 , (E.93)
0 t
where cα is a positive constant. Thus, K α is NDS for all 0 < α < 1.
6.9 Let n ≥ 1, {c1 , . . . , cn } ⊆ R and {x1 , . . . , xn } ⊆ H with n
P
i=1 ci = 0. Then,
n n n n n
* +
X X X X X
ci cj K(xi , xj ) = ( ci )2 − c i xi , cj xj = −k ci xi k ≤ 0. (E.94)
i,j=1 i=1 i=1 j=1 i=1
6.10 It is straightforward to see that the kernel K defined over R+ × R+ by K(x, y) = x + y
is negative definite symmetric (NDS) and that K(x, x) ≥ 0 for all x ≥ 0. Thus, for any
0 < α ≤ 1, K α is also NDS. Thus, exp(−tK α ) is PDS for all t > 0, in particular t = 1.
Assume now that Kp is PDS for some p > 1. Then, for all {c1 , . . . , cn } ⊆ R, {x1 , . . . , xn } ⊆ X,
and t > 0,
n n
X p X 1/p 1/p p
ci cj e−t(xi +xj ) = ci cj e−(t xi +t xj ) ≥ 0. (E.95)
i,j=1 i,j=1
Thus, Ψ(x, y) = (x + y)p is NDS. This contradicts
2
X
ci cj (xi + xj )p = 2p − 2 > 0 (E.96)
i,j=1
for x1 = 0, xj = 1, and c1 + c2 = 0 with c1 = 1.
6.11 Explicit mappings
(a) Because K is positive semidefinite, it can be diagonalized as K = SΛS> where Λ is
a diagonal matrix of K’s eigenvalues and S is the matrix of K’s eigenvectors. Further
decomposing, we get K = SΛ1/2 Λ1/2 S> . We then have
K(xi , xj ) = Kij = (SΛS> )ij = (Λ1/2 Si ) · (Λ1/2 Sj )

where Si is the ith eigenvector of K. Thus the kernel map Φ(xi ) = Λ1/2 Si clearly satisfies
the desired condition.
(b) For any α1 , . . . αm ∈ R, we have
m
m 2
X X
αi αj Kij = αi Φ(xi ) ≥ 0

i,j=1 i=1
6.12 Explicit polynomial kernel mapping

k
It is clear that K can be written as a linear combinations of all monomials xk1 1 · · · xNN with
PN
j=1 lk ≤ d. The dimension of the feature space is thus the number of such monomials,
f (N, d), that is the number of ways of adding N non-negative integers to obtain a sum of at
most d. Note that any sum of N −1 integers less than or equal to d can be uniquely completely
to be equal to d by adding one more term. This defines in fact a bijection between sums of
N − 1 integers with value at most d and sums of N integers with value equal to d. Thus, the
number of sums of N integers exactly equal to d is f (N − 1, d).
Now, since a sum of N terms less than or equal to d is either equal to d or less than or equal
to d − 1,
f (N, d) = f (N − 1, d) + f (N, d − 1).
The result then follows by induction on N + d, using f (1, 0) = f (0, 1) = 1.

By the binomial identity, K(x, x0 ) = (x·x0 +c)d = di=0 di cd−i (x·x0 )i . The weight assigned
P
d d−i
to each ki , i ≤ d, is thus i c . Increasing c decreases the weight of ki , particularly that of
ki s with larger i.
6.13 High-dimensional mapping
(a) By definition of K 0 , for all x, x0 ∈ X, ED [K 0 (x, x0 )] = K(x, x0 ). Replacing one index i ∈ I
2
1
by i0 affects K 0 (x, x0 ) by at most n (|[Φ(x)]i [Φ(x0 )]i | + |[Φ(x)]i0 [Φ(x0 )]i0 |) ≤ 2R
n
. The
result thus follows directly McDiarmid’s inequality.
(b) Since K and K0 are symmetric, it suffices to prove the statement for the entries of these
matrices that are above the diagonal. By the union bound and the previous question, the
following holds:
m(m + 1) −n42 −n2
P n [∃i, j ∈ [m] : |K0ij − Kij | > ] ≤ 2 e 2R = m(m + 1)e 2R4 .
I∼D 2
Setting δ > 0 to match the upper bound leads directly to the inequality claimed.
6.14 Classifier based kernel.
(a) By definition, given the distribution D, h∗ is defined as:
h∗ = argmin errorD (h). (E.97)
h : X→{−1,+1}
K ∗ is clearly positive definite symmetric since K ∗ (x, x0 ) is defined as the dot product (in
dimension one) of the features vectors h∗ (x) and h∗ (x0 ).
(b) The general expression of the solution is
Xm
h(x) = sgn( αi K ∗ (x, xi ) + b). (E.98)
i=1
Here, it is easy to see both in the separable and non-separable case that the solution is
simply:
h(x) = sgn(K ∗ (x, x+ )), (E.99)
∗
where x+ is such that h (x+ ) = +1. One support vector is enough. The solution can be
rewritten as
h(x) = h∗ (x). (E.100)
The generalization error of the solution is thus that of the Bayes classifier (it is optimal).
The data is separable iff the Bayes error is zero.
(c) A kernel of this type is always positive definite symmetric since K(x, x0 ) is defined as a
dot product of the feature vectors h(x) and h(x0 ).
6.15 Image classification kernel
(a) Observe that min(|u|α , |u0 |α ) = 0+∞ 1t∈[0,|u0 |α ] 1t∈[0,|u0 |α ] dt, which shows that (u, u0 ) 7→
R
α 0 α
min(|u| , |u | ) is PDS.
(b) Since Kα (x, x0 ) = N α 0 α
P
k=1 min(|xk | , |xk | ), Kα is PDS as a sum of N PDS kernels.
6.16 Fraud detection

Let h·, ·i be the bilinear function defined over the space of all functions defined by:
hf, gi = E (f (x) − E f (x))(g(x) − E g(x). (E.101)
Since the covariance matrix of n random variables is positive semidefinite and symmetric, this
defines a dot product. Observe that P[U ] = E 1U and P[U ∧ V ] = E 1U 1V , thus
K(U, V ) = h1U , 1V i . (E.102)
This shows that K is a positive definite symmetric kernel.
6.17 Relationship between NDS and PDS kernels
• Assume exp(−tK) is PDS. Then it is easy to see that − exp(−tK) is NDS. It is also
straightforward to show that shifting by a constant and scaling by a positive constant t > 0
1−exp(−tK)
preserves the NDS property, thus, t
is NDS. Finally, we examine the one-sided
limit of the expression, which is also NDS,

1 − exp(−tK) ∂ exp(−xK)
lim =− =K,
t→0+ t ∂x
x=0
showing that K is NDS.
• Assume K is NDS. Theorem 6.16 then implies for some PDS function K 0 and x0 :
−K(x, x0 ) = K 0 (x, x0 ) − K(x, x0 ) − K(x0 , x0 ) + K(x0 , x0 )
0 0
(x,x0 ) −K(x,x0 )
⇐⇒ e−K(x,x )
= eK e e−K(x0 , x0 ) eK(x0 ,x0 ) ,
| {z }
φ(x)φ(x0 )
where K 0 is PDS. Since K 0 is PDS exp(K 0 ) is also PDS, it is easy to show that any func-
tion of the form K(x, x0 ) = φ(x)φ(x0 ) is PDS and finally the constant function K(x, x0 ) =
exp(K(x0 , x0 )) is also PDS. Using the closure of PDS functions with respect to multiplica-
tion completes the proof.
6.18 Metrics and kernels
(a) If K is an NDS kernel, then by theorem 6.16 the kernel K 0 defined for any x0 ∈ X by:
1
K 0 (x, x0 ) = [K(x, x0 ) + K(x0 , x0 ) − K(x, x0 )]
2
is a PDS kernel (K(x0 , x0 ) = 0). Let H be the reproducing Hilbert space associated to K 0 .
There exists a mapping Φ(x) from X to H such that ∀x, x0 ∈ X, K 0 (x, x0 ) = Φ(x) · Φ(x0 ).
Then,
||Φ(x) − Φ(x0 )||2 = K 0 (x, x) + K 0 (x0 , x0 ) − 2K 0 (x, x0 )
1
= [2K(x, x0 ) − K(x, x)] +
2
1
[2K(x0 , x0 ) − K(x0 , x0 )] −
2
[K(x, x0 ) + K(x0 , x0 ) − K(x, x0 )]
= √K(x, x0 )
It is then straightforward to show that K is a metric.
(b) Suppose that K(x, x0 ) = exp(−|x − x0 |p ), x, x0 ∈ R, is positive definite for p > 2. Then,
for any t > 0, {x1 , . . . , xn } ⊆ X, {c1 , . . . , cn } ⊆ R,
Xn X n
ci cj exp(−t|xj − xk |p ) = ci cj exp(−|t1/p xj − t1/p xk |p ) ≥ 0
i,j=1 i,j=1
√
Thus, by theorem 6.17, K 0 (x, x0 ) = |x − x0 |p is an NDS kernel. But, K 0 is not a metric
for p > 2 since it does not verify the triangle inequality (take x = 1, x0 = 2, x00 = 3),
which contradicts part (a).
(c) If a < 0 or b < 0, a||x||2 + b < 0 for some non-null vectors x. For such values, K(x, x) =
tanh(a||x||2 + b) < 0. The kernel is thus not PDS and the SVM training may not converge
to an optimal value. The equivalent neural network may also converge to a local minimum.
6.19 Sequence kernels

(a) X ∗ − I is a regular language and can be represented by a finite automaton. K can thus
be defined by
∀x, y ∈ X∗ , K(x, y) = [[T ◦ T −1 ]](x, y), (E.103)
where T is the weighted transducer shown in figure E.14. Thus, K is a rational kernel and
in view of the theorem 6.21, it is positive definite symmetric.
t:e/1
t:e/1
g:e/1
g:e/1
c:e/1
c:e/1
a:e/1
a:e/1
X* - I: X* - I/r
0 1/1
Figure E.14
Weighted transducer T . e represents the empty string, and r = ρ. X ∗ − I stands for a finite
automaton accepting X ∗ − I.
(b) Let MX ∗ −I be the minimal automaton representing X ∗ − I. The transducer T of fig-

ure E.14 can be constructed using MX ∗ −I . Then, |T | = |MX ∗ −I | + 8. Using composition
of weighted transducers, the running time complexity of the computation of the algorithm
is:
O(|x||y||T ◦ T −1 |) = O(|x||y||T |2 ) = O(|x||y||MX ∗ −I |2 ). (E.104)
(c) The set of strings Y over the alphabet X of length less than n form a regular language
since they can be described by:
n−1
[
Y = Xi. (E.105)
i=0
Thus, Y1 = Y ∩ (X ∗ − I) and Y2 = (X ∗ − I) − Y1 are also regular languages. It suffices to
replace in the transducer T of figure E.14 the transition labeled with X ∗ − I : X ∗ − I/ρ
with two transitions:
• Y1 : Y1 /ρ1 , and
• Y2 : Y2 /ρ2 ,
with the same origin and destination states and with Y1 and Y2 denoting finite automata
representing them. The kernel is thus still rational and PDS since it is of the form T 0 ◦T 0−1 .
6.20 n-gram kernel
6.21 Mercer’s condition

We prove the contrapositive, that if K is not PDS then Mercer’s Pm condition necessarily does
not hold. Consider c and a sample (x1 , . . . , xm ) such that i,j=1 ci cj K(xi , xj ) < 0, which
exists
Pm by the non-PDS assumption. Then define a function cσ that is the sum of m Gaussians
i=1 ci N (xi , σ). For sufficiently small σ and ∀i ∈ [m], cσ (xi ) ≈ ci pi , where pi is equal to the
number of occurrences of xi in the R R sample, and0 approximately zeroP elsewhere. Thus, for some
0 0 m
σ , where |σ | → 0 as σ → 0, X×X c(x)c(x )K(x, x ) dx dx ≤ i,j=1 ci cj K(xi , xj ) + σ ,
which is less than 0 for sufficiently small σ.
6.22 Anomaly detection
(a) The Lagrangian for the optimization problem is:

m
X
L(c, r, α) = r2 + αi (kΦ(xi ) − ck2 − r2 ) .
i=1
By the KKT conditions, the optimal solution must satisfy

m
X Xm
∇r L = 2r − αi 2r = 0 ⇐⇒ αi = 1
i=1 i=1
m
X m
X m
X
∇c L = αi (2c − 2Φ(xi )) = 0 ⇐⇒ c = αi Φ(xi )/ αi .
i=1 i=1 i=1
Combining the
Pmtwo statements above, we see that at the optimum the solution must
Pm
satisfy c = i=1 αi Φ(xi ). Finally, adding the explicit constraint i=1 αi = 1 to the
optimization problem and substituting the expression for c into the Lagrangian results in
the dual optimization problem.
(b) Since we know c = m

P Pm
i=1 αi Φ(xi ), α ≥ 0 and i=1 αi = 1, we have
Xm
kck = αi Φ(xi ) ≤ max kΦ(xi )k ≤ M

i
i=1
r = max kc − Φ(xi )k ≤ 2 max kΦ(xi )k ≤ 2M .
i i
(c) Note that the losses over D coincide with the losses in standard binary classification
problems in the case when all points in the source domain are labeled with the positive
class. Thus, we may directly apply theorem 5.9. We simply need to bound the empirical
Rademacher complexity of H:
h Xm i
mR(H)
b = E sup σi h(xi )
σ h∈H i=1
h m
X i
= E sup σi (r2 − kΦ(xi ) − ck2 )
σ r,c
i=1
h m
X i m
hX i
≤ E sup σi r2 + E kΦ(xi )k2
σ r σ
i=1 i=1
h m
X ih m
X i
+ E sup kck2 + E sup 2c · Φ(xi )) .
σ c σ c
i=1 i=1
We bound each of these four terms separately. First we have
hXm i hXm i Xm
kΦ(xi )k2 = E

E σi K(xi , xi ) = E σi K(xi , xi ) = 0 .
σ σ σ
i=1 i=1 i=1
Next we bound
h m
X i h m
X i
E sup σi kck2 = E C 2 σi 1Pm σi >0
σ c σ i=1
i=1 i=1
h Xm i
≤ C2 E | σi |
σ
i=1
v
u
u X m
≤ C 2 tE[| σi | 2 ] (Jensen’s ineq.)
σ
i=1
v
u X m
X
2t
σi2 ]
u
=C E[ σi σj +
σ
i6=j i=1
v
u m
u X √
= C 2 tE[ σi2 ] = C 2 m.
σ
i=1
The term containing r is bound similarly,

m
h X i √
sup σi r2 ≤ R2 m .
r
i=1
The final term is bounded as follows
h m
X i h m
X i
2 E sup σi c · Φ(xi ) ≤ E sup kckk σi Φ(xi )k (Cauchy-Schwarz ineq.)
σ c σ c
i=1 i=1
h Xm i p
=CE k σi Φ(xi )k ≤ C Tr[K] .
σ
i=1
Collecting terms and plugging into theorem 5.9 completes the generalization guarantee.
(d) The solution follows very similar steps as in part (a), but using the same modifications as
seen in the analysis of soft-margin SVMs.
Chapter 7
7.1 VC-dimension of the hypothesis set of AdaBoost.
7.2 Alternative objective functions.
(a) The direction eu taken by coordinate descent after T − 1 rounds is the argminu of:
m
dL(α + βeu ) X
= − yi hu (xi )Φ0 (−yi f (xi ))
dβ
β=0 i=1
m
X Φ0 (−yi f (xi ))
∝ − yi hu (xi ) Pm 0
i=1 i=1 Φ (−yi f (xi ))
m
X
∝ − yi hu (xi )DT −1 (i)
i=1
= −(1 − 2u ),
1 P Φ0 (−y f (x ))
i i
where DT −1 (i) = m m Φ0 (−y f (x )) . Thus, the base classifier hu selected at each round
i=1 i i
is the one with the minimal error rate over the training data.
(b) Only the logistic loss (Φ4 ) verifies the hypotheses (assuming log with base 2).
(c) The step size β is the solution of:

m
dL(α + βeu ) X
=− yi hu (xi )Φ04 (−yi f (xi ) − βyi hu (xi )) = 0
dβ i=1
m
X 1
⇔ − yi hu (xi ) = 0.
i=1
1 + exp(y i f (x i ) + βyi hu (xi ))
Using β as the step function, the algorithm differs from AdaBoost, only by the choice of
the distribution update at each boosting round, here:
1 1
Dt+1 (i) = .
Zt 1 + exp(yi f (xi ))
7.3 Update guarantee

By the weak learning assumption, there exists a hypothesis h ∈ H whose Dt+1 -errorp
is less than
half. Examine the empirical error of ht for the distribution Dt+1 . Since Zt = 2 t (1 − t )
1 1−t
and αt = 2
log t
,
m
X Dt (i)e−αt yi ht (xi )
R
bD
t+1
(ht ) = 1yi ht (xi )<0
i=1
Zt
m
X Dt (i)eαt
=
Zt
yi ht (xi )<0
m
eαt X
= Dt (i)
Zt
yi ht (xi )<0
q
1−t
t 1
= p t = .
2 t (1 − t ) 2
This shows that ht cannot be selected at round t + 1.
7.4 Weighted instances

For all α, F (α) = m −yi T
−yi f (xi ) =
Pm P
t=1 αt ht (xi ) , with wi ≥ 0. F is convex
P
i=1 wi e i=1 wi e
as a non-negative linear combination of convex functions and is clearly differentiable. Applying
coordinate descent to this function leads to the same algorithm as AdaBoost with the only
difference that wi
D1 (i) ← Pm . (E.106)
i=1 wi
7.5 By definition the unnormalized correlation is given by

m m
X X Dt (i)e−αt yi ht (xi ) yi ht (xi ) 1 dZt
Dt+1 (i)yi ht (xi ) = = , (E.107)
i=1 i=1
Z t Zt dαt
Pm dZt
since Zt = i=1 Dt (i)e−αt yi ht (xi ) . Recall that αt minimizes Zt , thus dα = 0. This shows
t
that the distribution at round t + 1 and the vector of margins at round t are uncorrelated.
7.6 It is not hard to see that the base hypotheses in this problem can be defined to be thresh-
old functions based on the first or second axis, or constant functions (horizontal or vertical
thresholds outside the convex hull of all the points).
The hypotheses selected by AdaBoost are therefore chosen from this set. It can be shown that
the hypotheses selected in two consecutive rounds of AdaBoost are distinct (see exercise 7.3).
Furthermore, ht and −ht cannot be selected in consecutive rounds, since misclassified and
correctly classified points by ht are assigned the same distribution mass. Thus, at each round
a distinct hypothesis is chosen. The points at coordinate (−1, −1) are misclassified by all
these base hypotheses.
The algorithm should be stopped when the best t found is 1/2. It can be shown, then, that
the error of the final classifier returned on the training set is 14 (1 − ), since it misclassifies
exactly the points at at coordinate (+1, −1).
7.7 Noise-tolerant AdaBoost.
(a) Let G1 (x) = ex and G2 (x) = x + 1. G1 and G2 are continuously differentiable over R and
G01 (0) = G02 (0). Thus, G is differentiable over R. Note that G0 ≥ 0.
Both G1 and G2 are convex, thus
G(y) − G(x) ≥ G0 (x)(y − x) (E.108)
for x, y ≤ 0 or x, y ≥ 0. Assume now that y ≤ 0 and x ≥ 0, then
G(y) − G(x) = ey − (x + 1) ≥ (y + 1) − (x + 1) = G0 (x)(y − x), (E.109)
since G0 (x) = 1. Thus G is convex.
(b) The direction eu taken by coordinate descent after T − 1 rounds is the argminu of:
m
dG(α + βeu ) X
= − yi hu (xi )G0 (−yi f (xi )) (E.110)
dβ
β=0 i=1
m
X G0 (−yi f (xi ))
(since G0 ≥ 0) ∝ − yi hu (xi ) Pm 0
(E.111)
i=1 i=1 G (−yi f (xi ))
m
X
∝ − yi hu (xi )DT −1 (i) (E.112)
i=1
= −(1 − 2u ), (E.113)
1 P G0 (−y f (x ))
i i
with DT −1 (i) = m m G0 (−y f (x )) . Thus, the base classifier hu selected at each round
i=1 i i
is the one with the minimal error rate over the training data.
The step size β is the solution of:
m
dF (α + βeu ) X
=− yi hu (xi )G0 (−yi f (xi ) − βyi hu (xi )) = 0, (E.114)
dβ i=1
which can be solved numerically. A closed-form solution can be given under certain con-
ditions, e.g., if
β ≤ ρ = min |f (xi )|. (E.115)
i∈[m]
(c) By definition of the objective function, this algorithm is less aggressively reducing the
empirical error rate than AdaBoost.
7.8 Simplified AdaBoost
(a) As in the standard case, we can show that

T
Y
R(h)
b ≤ Zt , (E.116)
t=1
and that
Zt = (1 − t )e−α + t eα . (E.117)
By definition of γ and the fact that eα − e−α > 0 for all α > 0,
Zt = t (eα − e−α ) + e−α (E.118)
≤ (1 − γ)(eα − e−α ) + e−α (E.119)
1 1
= ( − γ)eα + ( + γ)e−α = u(α). (E.120)
2 2
u(α) is minimized for
1 1
( − γ)eα = ( + γ)e−α , (E.121)
2 2
that is, for
1
1 2
+γ
α= log 1
. (E.122)
2 2
−γ
Tighter bounds on the product of the Zt s can lead to better values for α.
(b) As in the standard case, at round t, the probability mass assigned to correctly classified
points is p+ = (1 − t )e−α and the probability mass assigned to the misclassified points
is p− = t eα . Thus,
1 1 1
p− t 2
+γ 2
−γ 2
+γ
= ≤ = 1. (E.123)
p+ 1 − t 12 − γ 1
2
+γ 1
2
−γ
This contrasts with AdaBoost’s property.
(c)
1 1
Zt ≤ − γ)eα + ( + γ)e−α
( (E.124)
2 v 2 v
u1 u 1
1 u
2
+ γ 1 u
2
−γ
= ( − γ)t 1 + ( + γ)t 1
(E.125)
2 2
−γ 2 2
+γ
r
1 1
= 2 ( + γ)( − γ). (E.126)
2 2
Thus, the empirical error can be bounded as follows:
T
Y
R
bS (h) ≤ Zt (E.127)
t=1
r
1 1
≤ [2 (+ γ)( − γ)]T (E.128)
2 2
= (1 − 4γ 2 )T /2 (E.129)
−2γ 2 T
≤ e . (E.130)
bS (h) = 1 Pm 1y f (x )≤0 ≤ 1 , then clearly R
(d) If R bS (h) = 0. Using the bound obtained
m i=1 i i m
2 1
in the previous question, if e−2γ T < m , the empirical error is zero. This can be rewritten
as
log m
T > . (E.131)
2γ 2
(e) Using the bound for the consistent case,
m 2em d − m
P[R(h) > ] ≤ 2ΠC (2m)2− 2 ≤ 2(
) 2 2 . (E.132)
d
Setting the right-hand side to δ, with probability at least 1 − δ, the following bound holds
for that consistent hypothesis:
2 2em 2
errorD (H) ≤ d log2 + log2 , (E.133)
j m k d δ
log m
with d = 2(s + 1)T log2 (eT ) and T = 2γ 2 + 1.
q
The bound is vacuous for γ(m) = O( log m
m
). This could suggest overfitting.
7.9 AdaBoost example.
(a) At t = 1 we have:
√ √ √ √ √ √
5−1 3 − 5 3 5 − 1 3 5 − 1 3 5 − 1 1 11 − 3 5
d>1 M= , 0, , , , , , ,
2 2 12 12 12 2 12
so we pick weak classifier 1. Now, the distribution at round two is:
√ √ √ √ √ >
1 1 5−1 5−1 5−1 3− 5 3− 5
d2 = , , , , , , ,0 ,
4 4 12 12 12 8 8
and the edges at round 2 are:
√ √ √ √ √ √ √
3− 5 5−1 4− 5 4− 5 4− 5 5−1 5+ 5
d>2 M = 0, , , , , , , ,
2 2 6 6 6 4 12
so we pick weak classifier 3. Continuing this process, we then pick weak classifier 2 in
round 3. However, now we observe that d4 = d1 ; hence, we have found a cycle, in which
we repeatedly select classifiers 1, 3, 2, 1, 3, 2, . . .
√
(b) rt = 5−1 2
, t ∈ 1, 2, 3. Thus, the coefficients used to combine classifiers in our example
are: [ 13 , 31 , 31 , 0, 0, 0, 0, 0] and the margin equals the minimum value in the following vector:
M[ 31 , 13 , 31 , 0, 0, 0, 0, 0]> , which is 13 .
1
(c) 16
M[2, 3, 4, 1, 2, 2, 1, 1]> = 83 for all training points. This margin is greater than the one
generated by AdaBoost. Therefore AdaBoost does NOT always maximize the L1 norm
margin.
7.10 Boosting in the presence of unknown labels
(a) Say a ‘boosting-style algorithm’ is just AdaBoost with a possibly different step size αt .
Recall
P these definitions from the description of AdaBoost: The Pfinal hypothesis is f (x) =
t αt ht (x) and the normalization constant in round t is Zt = i Dt (i) exp(−αt yi ht (xi )).
We proved in the chapter that
1 X 1 X Y
1yi f (xi )<0 ≤ exp(−yi f (xi )) = Zt
m i m i t
and that AdaBoost’s step size can be derived by minimizing this objective in each round
t. Taking that same approach, observe that
Dt (i) exp(−αt yi ht (xi )) = 0t + −
X
+
Zt = t exp(αt ) + t exp(−αt ).
i
Differentiating the right-hand side with
respect
to αt and setting equal to zero shows that
1 +
t
Zt is minimized by letting αt = 2 log − .
t
−
+
t −t
(b) One possible assumption is q ≥ γ > 0. Informally, this assumption says that the
1−0t
difference between the accuracy and error of each weak hypothesis is non-negligible relative
to the fraction of examples on which the hypothesis makes any prediction at all. In part
(d) we will prove that this assumption suffices to drive the training error to zero.
(c) 1. Given: Training examples ((x1 , y1 ), . . . , (xm , ym )).
2. Initialize D1 to the uniform distribution on training examples.
3. for t = 1, . . . , T :
a. ht ← base classifier
in H.
1 +
b. αt ← log t
.
2 −
t
Dt (i) exp(−αt yi ht (xi )) P
c. For each i = 1, . . . , m: Dt+1 (i) ← Zt
, where Zt ← i Dt (i) exp(−αt yi ht (xi ))
is the normalization constant.
4. f ← T
P
t=1 αt ht .
5. Return: sgn(f ).
(d) Plug in the qvalue of αt from part (a) into Zt = 0t + − +
t exp(αt ) + t exp(−αt ) to obtain
− +
Zt = 0t + 2 t t . Therefore
1 X Y q
0t + 2 −
Y
+
1yi f (xi )<0 ≤ Zt =
t t .
m i t t
Moreover, if the weak learning assumption from part (b) is satisfied then
q q
0t + 2 − + 0
t t = t + (1 − 0t )2 − (+ − 2
t − t )
s
(+ − −t )
2
= 0t + (1 − 0t ) 1 − t 0 2
(1 − t )
s
+ − 2
( − t )
≤ 1− t
1 − 0t
p
≤ 1 − γ2.
The first equality follows from (+ t + − 2 + − 2 + −
t ) − (t − t ) = 4t t (just multiply and gather
−
terms) and +t + t = 1 − 0 . The first inequality follows from the fact that square root is
t
√ √ p
concave on [0, ∞), and thus λ x + (1 − λ) y ≤ λx + (1 − λ)y for λ ∈ [0, 1]. The last
inequality follows from the weak learning assumption.
p T 2

Therefore we have m 1 P
i 1yi f (xi )<0 ≤ 1 − γ2 ≤ exp − γ 2T , where we used 1+x ≤
exp(x).
7.11 HingeBoost
(a) Since the hinge loss is convex, its composition with affine function of α is also convex and
F is convex as as sum of convex functions.
For the existence of one-sided directional derivatives, one can use the fact that any convex
function has one-sided directional derivatives or alternatively, that our specific function is
the sum of piecewise affine functions, which are also known to have one-sided directional
derivatives (think of one-dimensional hinge loss).
(b) Distinguishing different cases depending on the value of yi ft−1 (xi ) = 1, it is straightfor-
ward to derive the following expressions for all j ∈ [N ]:
m
X
0
F+ (αt−1 , ej ) = −yi hj (xi )[1yi ft−1 (xi )<1 + 1(yi hj (xi )<0)∧(yi ft−1 (xi )=1) ]
i=1
m
X
0
F− (αt−1 , ej ) = −yi hj (xi )[1yi ft−1 (xi )<1 + 1(yi hj (xi )>0)∧(yi ft−1 (xi )=1) ].
i=1
The key here is that when yi ft−1 (xi ) 6= 1, each term in the sum will be either 0 or the
affine function independent of yi hj (xi ). On the other hand, when yi ft−1 (xi ) = 1, the
sign of yi hj (xi ) determines whether the finite differences will extend into the 0 portion of
the affine portion of the term.
(c)
HingeBoost(S = ((x1 , y1 ), . . . , (xm , ym )))

1 f ←0
2 for j ← 1 to N do
r← m
P
3 i=1 −yi hj (xi )[1yi f (xi )<1 + 1(yi hj (xi )<0)∧(yi f (xi )=1) ]
Pm
4 l ← i=1 −yi hj (xi )[1yi f (xi )<1 + 1(yi hj (xi )>0)∧(yi f (xi )=1) ]
5 if (l ≤ 0) ∧ (r ≥ 0) then
6 d[j] ← 0
7 elseif (l ≤ r) then
8 d[j] ← r
9 else d[j] ← l
10 for t ← 1 to T do
11 k ← argmin |d[j]|
j∈[N ]
12 η ← argminη≥0 G(f + ηhk ) . line search
13 f ← f + ηhk
14 return f
7.12 Empirical margin loss boosting

(a) The following shows how R

bS,ρ (f ) can be upper-bounded:
m m
bS,ρ (f ) = 1 1 X
X
R 1yi f (xi )≤ρ = 1 PT PT
m i=1 m i=1 yi t=1 αt ht (xi )−ρ t=1 αt ≤0
m T T
!
1 X X X
≤ exp −yi αt ht (xi ) + ρ αt .
m i=1 t=1 t=1
(b) Since exp is convex and that composition with an affine function (of α) preserves convexity,
each term of the objective is a convex function and Gρ is convex as a sum of convex
functions. The differentiability follows directly that of exp.
(c) By definition of Gρ , for any direction et we can write

m t−1 t−1
1 X X X
Gρ (αt−1 + ηet ) = exp − yi αs hs (xi ) + ρ αs e−yi ηht (xi )+ρη .
m i=1 s=1 s=1
1
Let D1 be the uniform distribution, that is D1 (i) = m for all i ∈ [m] and for any
t ∈ {2, . . . , T }, define Dt by
Dt−1 (i) exp(−yi αt−1 ht−1 (xi ))
Dt (i) = ,
Zt−1
exp −yi t−1
P
Pm s=1 αs hs (xi )
with Zt−1 = i=1 Dt−1 (i) exp(−yi αt−1 ht−1 (xi )). Observe that Dt (i) = Qt−1 .
m s=1 Zt
Thus,

dGρ (αt−1 + ηet )
dη
0
m t−1 t−1
1 X X X
= [−yi ht (xi ) + ρ] exp − yi αs hs (xi ) + ρ αs
m i=1 s=1 s=1
m t−1
X Y Pt−1
= [−yi ht (xi ) + ρ]Dt (i) Zt eρ s=1 αs
i=1 s=1
m
X t−1
Y Pt−1
= −yi ht (xi )Dt (i) + ρ Zt eρ s=1 αs
i=1 s=1
t−1
Y Pt−1
= (−(1 − t ) + t + ρ) Zt eρ s=1 αs
s=1
t−1
Y Pt−1
Zt eρ αs

= (2t − 1 + ρ) s=1 ,
Pm s=1
where t = i=1 Dt (i)1yi ht (xi )>0 . The direction selected is the one minimizing (2t −
1 + ρ) that is t . Thus, the algorithm selects at each round the base classifier with the
smallest weighted error, as in the case of AdaBoost.
To determine the step η selected at round t, we solve the following equation

dGρ (αt−1 + ηet )
=0
dη
Xm
⇔ Dt (i)Zt e−yi ηht (xi )+ρη [−yi ht (xi ) + ρ] = 0
i=1
X X
⇔ Dt (i)Zt eη(−1+ρ) (−1 + ρ) + Dt (i)Zt eη(1+ρ) (1 + ρ) = 0
yi ht (xi )>0 yi ht (xi )<0
⇔ (1 − t )eη(−1+ρ) (−1 + ρ) + t eη(1+ρ) (1 + ρ) = 0

⇔ (1 − t )e−η (−1 + ρ) + t eη (1 + ρ) = 0
(1 − ρ) (1 − t )
⇔ e2η =
(1 + ρ) t
1 (1 − t ) 1 (1 + ρ)
⇔ η = log − log .
2 t 2 (1 − ρ)
Thus, the algorithm differs from AdaBoost only by the choice of the step, which it chooses
more conservatively: the step size is smaller than that of AdaBoost by the additive con-
(1+ρ)
stant 12 log (1−ρ) .
(d) For the coordinate descent algorithm to make progress at each round, the step size selected
along the descent direction must be non-negative, that is
(1 − ρ) (1 − t )
> 1 ⇔ (1 − ρ)(1 − t ) > ρt + t
(1 + ρ) t
1−ρ
⇔ t < .
2
Thus, the error of the base classifier chosen must be at least ρ/2 better than one half.
(e) The normalization factor Zt can be expressed in terms of t and ρ using its definition:
Xm
Zt = Dt (i) exp(−yi αt ht (xi ))
i=1
= e−αt (1 − t ) + eαt t
s s
1+ρ 1−ρ
= (1 − t )t + (1 − t )t
1−ρ 1+ρ
"s s #
p 1+ρ 1−ρ
= t (1 − t ) +
1−ρ 1+ρ
" #
p 2
= t (1 − t ) p
1 − ρ2
s
t (1 − t )
=2 .
1 − ρ2
Aρ (S = ((x1 , y1 ), . . . , (xm , ym )))

1 for i ← 1 to m do
1
2 D1 (i) ← m
3 for t ← 1 to T do
4 ht ← base classifier in H with small error t = Pi∼Dt [ht (xi ) 6= yi ]
(1− ) (1+ρ)
5 αt ← 12 log t − 12 log (1−ρ)
t (1−tt) 1
6 Zt ← 2 1−ρ2 2 . normalization factor
7 for i ← 1 to m do
D (i) exp(−αt yi ht (xi ))
8 Dt+1 (i) ← t Zt
PT
9 f ← t=1 αt ht
10 return h = sgn(f )
Note that A0 coincides exactly with AdaBoost.
i. In view of the definition of Dt and the bound derived in the first part of this exercise,
we can write
m T T
!
bS,ρ (f ) ≤ 1
X X X
R exp −yi αt ht (xi ) + ρ αt
m i=1 t=1 t=1
m T T
! !
1 X Y X
= m Zt Dt (i) exp ρ αt
m i=1 t=1 t=1
T
! T !
X Y
= exp ρ αt Zt .
t=1 t=1
ii. The expression of Zt was already given above. Plugging in that expression in the
bound of the previous question and using the expression of αt gives
T p
!
Y Y 1 1
bS,ρ (f ) ≤ (
R eαt )ρ t (1 − t )(u 2 + u− 2 )
t t=1
s !ρT s !ρ T p
!
1−ρ Y 1 − t Y 1 −1
= t (1 − t )(u + u
2 2 )
1+ρ t
t t=1
T q
1+ρ 1−ρ T Y
= u 2 + u− 2 1−ρ
t (1 − t )1+ρ .
t=1
iii. Using the hint and the result of the previous question, we can write
 2 
1−ρ
Y 2
− t
bS,ρ (f ) ≤
R 1 − 2

1 − ρ2 
t
  2 
1−ρ
Y 2
− t
≤ exp −2
 
1 − ρ2

t
 2 
1−ρ
2
− t
= exp −2 T
 
1 − ρ2
2γ 2 T

≤ exp − 2
.
1−ρ
Thus, if the upper bound is less that 1/m, then R
bρ (f ) = 0 and every training point
2

has margin at least ρ. The inequality exp − 2γ 1−ρ2
T
< 1/m is equivalent to T >
(log m)(1−ρ2 )
2γ 2
.
Chapter 8
8.1 Perceptron lower bound
Let w be the weight vector. Since each update is of the form w ← w + yi xi and since the
components of the sample points are integers, the components of w are also integers.
Let n1 , . . . , nN ∈ Z denote the components of w. w correctly classifies all points iff yi (w ·xi ) >
0 for i = 1, . . . , m, that is,
 

 n1 > 0 
 n1 > 0
 n1 − n2 < 0
 
 n2 > n1

 

−n1 − n2 + n3 > 0 ⇔ n3 > n1 + n2
 
. . . ...

 


 

(−1)N (n1 + n2 + . . . + nN −1 − nN ) < 0
 
nN > n1 + n2 + . . . + nN −1 .
These last inequalities show that the data is linearly separable with w = (1, 2, . . . , 2N −1 ).
They also imply that n1 ≥ 1, n2 ≥ 2, n3 ≥ 4, . . . , nN ≥ 2N −1 . Since each update can at most
increment nN by 1, the number of updates is at least 2N −1 = Ω(2N ).
8.2 Generalized mistake bound

The bound is unaffected, as shown by the following, using the same definitions and steps as
in this chapter:
P
v · t∈I yt xt
Mρ ≤
kvk
P
v · t∈I (wt+1 − wt )/η
= (definition of updates)
kvk
v · wT +1
=
ηkvk
≤ kwT +1 k/η (Cauchy-Schwarz ineq.)
= kwtm + ηytm xtm k/η (tm largest t in I)
h i1/2
2 2 2
= kwtm k + η kxtm k + ηytm wtm · xtm ] /η
h i1/2 | {z }
2 2 2 ≤0
≤ kwtm k + η R ] /η
h i1/2 √
≤ M η 2 R2 ] /η = M R. (applying the same to previous ts in I).
8.3 Sparse instances

PT T be a vector of norm
Clearly, it takes T updates and leads to w = t=1 yt xt . Let u ∈ R
1 defining a separating hyperplane, thus yt u · xt = yt ut ≥ 0 for all t ∈ [T ]. To obtain the
maximum margin ρ, we seek a vector u maximizing the minimum of √ yt ut with yt ut ≥ 0 for
√ kuk = 1. By symmetry, all yt ut s2 are
all t and equal, thus ut = yt / T for all t ∈ [T ] and
ρ = 1/ T . Thus, Novikoff’s bound gives R /ρ2 = 1/(1/T ) = T .
8.4 Tightness of lower bound

The lower bound is tight. This follows from the tightness of the Khintchine-Kahane inequality,
which is the only inequality used in the proof.
8.5 On-line SVM algorithm

First we write the optimization in the following equivalent form:
m
1 X
min kwk2 + C

max 0, 1 − yi (w · xi ) .
w 2
i=1
Then, using the general update rule in equation (8.22), we get the update rule,

wt − η(wt − Cyt xt ) if yt (wt · xt ) < 1,

wt+1 ← wt − ηwt if yt (wt · xt ) > 1,

wt otherwise,

which corresponds exactly to the update in the pseudocode.
8.6 Margin Perceptron
y (v·x )
(a) By assumption, there exists v ∈ RN such that for all t ∈ [T ], ρ ≤ t kvk t , where ρ is the
maximum margin achievable on S. Summing up these inequalities gives
P
v · t∈I yt xt X
Mρ ≤ ≤ yt x t (Cauchy-Schwarz inequality)

kvk t∈I
X
= (wt+1 − wt ) (definition of updates)

t∈I
= kwT +1 k (telescoping sum, w0 = 0).
(b) For any t ∈ I, by definition of the update, wt+1 = wt + yt xt ; thus,

kwt+1 k2 = kwt k2 + kxt k2 + 2yt wt · xt
≤ kwt k2 + kxt k2 + kwt kρ (def. of update condition)
≤ kwt k2 + R2 + kwt kρ + ρ2 /4
= (kwt k + ρ/2)2 + R2 .
(c) In view of the previous result, kwt+1 k2 − (kwt k + ρ/2)2 ≤ R2 , that is
(kwt+1 k − kwt k − ρ/2)(kwt+1 k + kwt k + ρ/2) ≤ R2
R2
=⇒ (kwt+1 k − kwt k − ρ/2) ≤
kwt+1 k + kwt k + ρ/2
R2
=⇒ kwt+1 k ≤ kwt k + ρ/2 + .
4R2 4R2 4R2
(d) If kwt k ≥ ρ
or kwt+1 k ≥ ρ
, then kwt+1 k + kwt k + ρ/2 ≥ ρ
, thus
R2 ρ R2
≤ = .
kwt+1 k + kwt k + ρ/2 4R2 /ρ 4
In view of this, the inequality of the previous question implies
R2
kwt+1 k ≤ kwt k + ρ/2 +
ρ 3
=⇒ kwt+1 k ≤ kwt k + ρ/2 + = kwt k + ρ.
4 4
(e) Since w1 = y1 x1 , kw1 k = kx1 k ≤ R. The margin ρ is at most twice the radius R, thus,
ρ ≤ 2R and 2R/ρ ≥ 1. This implies that kw1 k ≤ R ≤ 2R2 /ρ. Since kw1 k ≤ 2R2 /ρ and
2 2
kwT +1 k ≥ 4R
ρ
, there must exist at least one update time t ∈ I at which kwt k ≤ 4R
ρ
and
4R2
kwt+1 k ≥ ρ
. The set of such times t is non empty and thus admits a largest element
t0 .
4R2
(f) By definition of t0 , for any t ≥ t0 , kwt+1 k ≥ ρ
. Thus, by the inequality of part (d),
the following holds for any t ≥ t0 ,
3
kwt+1 k ≤ kwt k + ρ.
4
This implies that
3
kwT +1 k ≤ kwt0 k + {t0 , . . . , T } ∩ I ρ
4
3
≤ kwt0 k + M ρ
4
4R2 3
≤ + M ρ.
ρ 4
By the first question M ρ ≤ kwT +1 k; therefore,
4R2 3
Mρ ≤ + M ρ ⇐⇒ M ρ/4 ≤ 4R2 /ρ ⇐⇒ M ≤ 16R2 /ρ2 .
ρ 4
8.7 Generalized RWM
(a) Observe that:

N
X
Wt+1 = (1 − (1 − β)lt,i )wt,i
i=1
= Wt − Wt (1 − β)lt,i )wt,i /Wt = Wt (1 − (1 − β)Lt ).
QT
Thus, WT +1 = N t=1 (1 − (1 − β)Lt ) and
T
X
log WT +1 = log N + log(1 − (1 − β)Lt )
t=1
T
X
≤ log N + −(1 − β)Lt
t=1
= log N − (1 − β)LT .
(b) For all i ∈ [N ],
log WT +1 ≥ log wT +1,i
T
Y
= log (1 − (1 − β)lt,i
t=1
T
X
= log(1 − (1 − β)lt,i
t=1
T
X
≥ −(1 − β)lt,i − (1 − β)2 lt,i
2
t=1
T
X
= −(1 − β)LT,i − (1 − β)2 2
lt,i .
t=1
(c) Comparing the lower and upper bounds gives:
T
X
− (1 − β)LT,i − (1 − β)2 2
lt,i ≤ log N − (1 − β)LT
t=1
T
log N X
2
=⇒ LT ≤ LT,i + + (1 − β) lt,i .
(1 − β) t=1
PT 2
Clearly, for any i ∈ [N ], t=1 lt,i ≤ T . Thus, for all i ∈ [N ],
log N
LT ≤ LT,i + + (1 − β)T,
(1 − β)
log N
in particular, LT ≤ Lmin
T + (1−β) + (1 − β)T . Differentiating with respect to β and setting
log N
the result to zero gives (1−β) 2 − T = 0, as in the case of the RWM algorithm. Thus, for
p √ √
β = max{1/2, 1 − (log N )/T }, LT ≤ Lmin T + 2 T log N , that is RT ≤ 2 T log N .
8.9 General inequality

2
(a) We analyze the function f (x) = log(1 − x) + x + xα and show that it is positive for
x ∈ [0, 1 − α
2
]. First note that f (0) = 0, then note that
2
f 0 (x) ≥ 0 ⇐⇒ −1 + 1 − x x(1 − x) ≥ 0
α
2
⇐⇒ x(1 − x) ≥ x
α
2
⇐⇒ (1 − x) ≥ 1 (for x ≥ 0)
α
α
⇐⇒ x ≤ 1 − .
2
Thus, the derivative is only increasing for x ∈ [1, 1 − α
2
], which implies that the function
is positive for the same interval.
2
In order to apply the inequality, inequality the valid range of β is [ α , 1).
(b) The bound follows directly using the same steps as in the poriginal proof, but with the
general inequality. The optimal choice of β is max{ α2
, 1 − α(log N )/T }, which gives
r
log(N )T p
RT ≤ + α log(N )T .
α
(c) Setting α close to 2 forces β close to 1, which results in an algorithm that downweights
experts in a very conservative fashion. From the bound in part (b) we see that α = 1, as
is used in the chapter, is the optimal choice.
8.10 On-line to batch — non-convex loss
(a) We use the following series of inequalities:
min (R(hi ) + 2cδ (T − i + 1))

i∈[T ]
T
1 X
≤ (R(hi ) + 2cδ (T − i + 1))
T i=1
T −1
s
T
1 X 2 X 1 T (T + 1)
= R(hi−1 ) + log
T i=1 T i=0 2(T − i) δ
−1
s
T T
1 X 2 X 1 (T + 1) 2
< R(hi−1 ) + log
T i=1 T i=0 2(T − i) δ
−1
s
T T
1 X 2 X 1 (T + 1)
= R(hi−1 ) + log
T i=1 T i=0 (T − i) δ
T r
1 X 1 T +1
≤ R(hi−1 ) + 4 log .
T i=1 T δ
The first inequality follows, since the P
minimump is always lessP thanpor equal√ to the average
T −1
and the final inequality follows from i=0 1/(T − i) = T i=1 1/i ≤ 2 T .
(b) Coupling P
the inequality of part (a) with the high probability statement of lemma 8.14 to
T
bound T1 i=1 R(hi ) shows the desired bound.
q
2(T +1)
(c) The square-root terms in part (b) can be bounded further by 6 T1 log δ
.
Now, note that for two events A and B that each occur with probability at least 1 − δ,
P[¬A ∪ ¬B] ≤ P[¬A] + P[¬B] ≤ 2δ
⇐⇒ P[A ∧ B] ≥ 1 − 2δ .
Thus, the probability that both bounds in (b) and (c) hold simultaneously is at least
1 − 2δ; substituting δ with δ/2 everywhere completes the bound.
8.11 On-line to batch — kernel Perceptron margin bound
(a) Let u ∈ H with kuk = 1. Observe that for any t ∈ [T ], we can write

yt (u · Φ(xt )) yt (u · Φ(xt ))
1− 1− ≤ .
ρ + ρ
Let I be the set of t ∈ [T ] at which kernel Perceptron makes an update, and let MT be the
total number of updates made, then, summing up the previous inequalities over all such
ts and using the Cauchy-Schwarz inequality yields
P
X k t∈I yt Φ(xt )k
X
yt (u · Φ(xt )) yt (u · Φ(xt ))
MT − 1− ≤ ≤ .
t∈I
ρ t∈I
ρ ρ
In view of the proof for separable case

√P of the perceptron algorithm (theorem 8.8), the
t∈I K(xt ,xt )
right-hand side can be bounded by ρ
. Thus, for any u ∈ H with kuk = 1,
qP
t∈I K(xt , xt )

X yt (u · Φ(xt ))
MT ≤ 1− + .
t∈I
ρ + ρ
(b) Plugging in the result from part (a) into (8.31) gives the desired bound. This bound
is with respect to expected error of best in class, while corollary 6.13 is with respect to
empirical error of selected hypothesis.
Chapter 9
9.5 Decision trees. A binary decision tree with n nodes has exactly n + 1 leaves. Each node can be
labeled with an integer from {1, . . . , N } indicating which dimension is queried to make a binary
split and each leaf can be labeled with ±1 to indicate the classification made at that leaf. Fix
an ordering of the nodes and leaves and consider all possible labelings of this sequence. There
can be no more than (N + 2)2n+1 distinct binary trees and, thus, the VC-dimension of this
finite set of hypotheses can be no larger than (2n + 1) log(N + 2) = O(n log N ).
Chapter 11
11.1 Pseudo-dimension and monotonic functions

If for some m > 0, there exists (t1 , . . . , tm ) and a set of points (x1 , . . . , xm ) that H shatters,
then φ ◦ H can also shatter it. To see that, note that if for some h ∈ H,
h(xi ) ≥ ti ,
then by the monotonic property of φ,
φ(h(xi )) ≥ φ(ti ) .
A similar argument holds for the case h(xi ) < ti . Thus, φ ◦ H can shatter the set of points
(x1 , . . . , xm ) with thresholds (φ(t1 ), . . . , φ(tm )), and this proves that Pdim(φ◦H) ≥ Pdim(H).
Since φ is strictly monotonic, it is invertible, and a similar argument with φ−1 can be used to
show Pdim(H) ≥ Pdim(φ ◦ H).
11.2 Pseudo-dimension of linear functions
From equation (11.3) we have that

Pdim(H) = VCdim (x, t) 7→ 1(w> x−t)>0 : h ∈ H .
1+sgn(w> x−t)
Note that 1(w> x−t)>0 = 2
is a linear separator with fixed offset. It is easy to
show that the VC-dimension of such a hypothesis class is d (as oppose to d + 1 in the case of
linear separators with a free offset parameter).
11.3 Linear regression
(a) In order for the matrix XX> to be invertible, we need the number of (linearly independent)
examples to outnumber the number of features used to represent each example.
(b) For any v ∈ Rm we can choose w = (XX> )† Xy + (I − (X† )> X> )v. To see this, observe
that
X> (I − (X† )> X> ) = X> − X> (X† )> X>
= X> − (XX† X)>
= X> − X> = 0 ,
and hence, we have X> w = X† Xy.
11.4 Perturbed kernels
(a) Using the closed form solutions α = (K+λI)−1 y and the fact M0−1 −M−1 = −M0−1 (M0 −
M)M−1 (this can be verified by simply expanding the right-hand side), we have
α0 − α = (K0 + λI)−1 − (K + λI)−1 y

= (K0 + λI)−1 (K0 + λI − K − λI)(K + λI)−1 y

= (K0 + λI)−1 (K0 − K)(K + λI)−1 y .

(b) Using the fact that for any vector v and matrix A, kAvk2 = v> A> Av ≤ kvk2 kA> Ak =
kvk2 kAk2 we have
kα0 − αk ≤ k(K0 + λI)−1 kkK0 − Kkk(K + λI)−1 kkyk .
√
Since |y| ≤ M , we have kyk ≤ mM and we can use the observation k(K + λI)−1 k =
λmax ((K + λI)−1 ) = λmin (K + λI)−1 ≤ 1/λ, where λmax (A) and λmin (A) are the maxi-
mum and minimum eigenvalues of A, respectively.
11.5 Huber loss

The primal function can be written as:
m
1 X
min kwk2 + C (Lc (ξi ) + Lc (ξi0 ))
w,b 2 i=1
s.t. w · Φ(xi ) + b − yi ≤ ξi , ∀i ∈ [m]
yi − w · Φ(xi ) − b ≤ ξi0 , ∀i ∈ [m]
ξi , ξi0 ≥ 0, ∀i ∈ [m]
The Lagrangian is written as follows,
m
1 X
L(w, b, ξ, ξ0 , α, α0 , β, β 0 ) = kwk2 + C (Lc (ξi ) + Lc (ξi0 ))
2 i=1
m
X
+ αi (w · Φ(xi ) + b − yi − ξi ) + α0i (yi − w · Φ(xi ) − b − ξi0 )
i=1
m
X
− (βi ξi + βi0 ξi0 ) ,
i=1
and the associated KKT conditions are:
m
∂L X
=w+ (αi − α0i )Φ(xi ) = 0 ,
∂w i=1
m
∂L X
= (αi − α0i ) = 0 ,
∂b i=1
(
∂L Cξi = αi + βi , if ξi ≤ c
= 0 ⇐⇒ , (E.134)
∂ξi Cc = αi + βi , otherwise
(
∂L Cξi = α0i + βi0 , if ξi ≤ c
= 0 ⇐⇒ , (E.135)
0
∂ξi Cc = α0i + βi0 , otherwise
βi ξi = 0, ∀i ∈ [m] , (E.136)
βi0 ξi0 = 0, ∀i ∈ [m] . (E.137)
The first two conditions are the same as in SVR the standard -insensitive loss, and using
them to simplify the Lagrangian gives several familiar terms. The novel conditions involve
ξ and ξ 0 . Collecting all terms in the Lagrangian that depend on ξi , we have the following
equality (regardless of which condition holds in (E.134))

(αi + βi )2
CLc (ξi ) − ξi (αi + βi ) = − . (E.138)
2C
If it is the case that ξi > 0, then βi = 0 by condition (E.136). In the case ξi = 0, then
αi + βi = 0 by the first case in condition (E.134). However, this also implies αi = βi = 0,
since both dual variables are constrained to be positive. Thus, in either case, we can simplify
2
(E.138) to −α2C
. Similar arguments can be used to simplify the ξi0 terms, resulting in the final
dual problem:
1 0 1 1
max y(α0 − α) − (α − α)> K(α0 − α) + α> 1 + α0> 1
α 2 C C
s.t. (α0 − α)> 1 = 0
0 ≤ αi , α0i ≤ cC, ∀i ∈ [m] .
11.8 Optimal kernel matrix
(a) Using the closed-form solution for the inner maximization problem α = (K + λI)−1 y,
simplifies the joint optimization to a simpler minimization:
min y> (K + λI)−1 y , s.t. kKk2 ≤ 1 .
K0
Note that for any invertible matrix A, y> A−1 y ≥ kyk2 λmin (A−1 ) = kyk2 λmax (A)−1 .
kyk2
Thus, it is easy to see that minK0 y> (K + λI)−1 y ≥ 1+λ since kKk2 = λmax (K) ≤ 1.
1 > achieves this lower bound. First, note that ( 1 yy> +
We now show K = kyk 2 yy kyk2
λI)y = (1 + λ)y, so y is an eigenvector of the matrix with eigenvalue (1 + λ). Since the
1 > + λI)−1
matrix is invertible, it can be shown that y is also an eigenvector of ( kyk 2 yy
1
with eigenvalue 1+λ
(for example, consider the eigen decomposition of the matrix).
(b) The kernel matrix alone is not useful for classifying future unseen points x, which requires
computing m
P
i=1 K(xi , x) and needs access to an underlying kernel function that in con-
sistent with the kernel matrix. Finding such a kernel function may be difficult in general,
and furthermore the choice of function may not be unique.
11.9 Leave-one-out error
(a) Note that the hypothesis hSi will make zero error on the ith point of Si0 and is defined as
the minimizer with respect to the remainder of the points. Thus, hSi is also the minimizer
with respect to the set Si0 .
(b) Using part (a) and the definition of the KRR hypothesis with respect to the dual variables
we have hSi (xi ) = hS 0 (xi ) = αS 0 Ki , where αS 0 is the optimal set of dual variable for
i i i
KRR trained with Si0 . Noting that the closed-form solution is α = (K + λI)−1 yi proves
the equality.
(c) Using part (b) we can write
hSi (xi ) − yi = yi> (K + λI)−1 Kei − yi
= (y − yi ei + hSi (xi )ei )> (K + λI)−1 Kei − yi
= hS (xi ) − yi + (hSi (xi ) − yi )e>i (K + λI)
−1
Kei ,
which implies hSi (xi ) − yi = (hS (xi ) − yi )/(e>
i (K + λI) −1 Ke ). Thus, we can write
i
m
1 X hS (xi ) − yi 2
R
bLOO (A) =
> −1
.
m i=1 ei (K + λI) Kei
√
(d) In this case the two losses differ only by the factor γ12 . Thus, if γ = m, the two
performance measures coincide.
Chapter 14
14.1 Tighter stability bounds

(a) No, even as βr→ 0 the generalization bound of theorem 14.2 only guarantees R(hS ) −
bS (hS ) ≤ M log δ = O(1/√m).
1
R 2m
r
√ 1
log δ
(b) In this case, M = C/ m and M 2m
= O(1/m); thus, it would suffice to have β =
O(1/m3/2 ) in order to guarantee an O(1/m) generalization bound.
14.2 Quadratic hinge loss stability
We first show that the loss function is σ-admissible. Consider three cases:
• Both h(x) and h0 (x) are correct with margin greater than 1, then
|L(h(x), y) − L(h0 (x), y)| = 0.
• Only one hypothesis is correct with large enough margin. Without loss of generality assume
h0 (x) is correct, then
|L(h(x), y) − L(h0 (x), y)| = (1 − h(x)y)2

≤ ((1 − h(x)y) − (1 − h0 (x)y))2 = (h0 (x) − h(x))2
√
≤ 4 M |h(x) − h0 (x)|.
The first inequality follows from the assumption 1 − h0 (x)y ≤ 0, and the second
√ inequality
follows from the bounded loss assumption, which implies ∀h ∈ H, |h(x)| ≤ M + 1 ≤ 2M .
• Finally, we consider the case where both h(x) and h0 (x) incur a loss. Without loss of
generality assume (1 − h(x)y) ≥ (1 − h0 (x)y), then
|L(h(x), y) − L(h0 (x), y)| = (1 − h(x)y)2 − (1 − h0 (x)y)2

= (1 − h(x)y) + (1 − h0 (x)y) (1 − h(x)y) − (1 − h0 (x)y)

√
≤ |2 − y(h(x) + h0 (x))||y(h(x) − h0 (x))| ≤ 6 M |h(x) − h0 (x)| .
√
Thus, the quadratic hinge loss is σ-admissible with σ = 6 M . By proposition 14.4, SVM with
36r 2 M
quadratic hinge loss is stable with β = mλ , and using theorem 14.2 gives the following
bound: s
36r2 M 72r2 M log 1
δ
R(hS ) ≤ RS (hS ) +
b + +M .
mλ λ m
14.3 Stability of linear regression
(a) Observing corollary 14.6, if λ → 0 then the bound becomes vacuous. No generalization is
guaranteed.
(b) Observe the following simple counter example in one dimension with |x| ≤ 1 and |y| ≤ 1.
Let S = ((0, 0), (1, 1)) define one training sample of (x, y) pairs, so the optimal linear
predictor is h(x) = x. Then we can define S 0 = ((0, 0), (α, 1)) for 0 < α ≤ 1 and the
1
optimal linear predictor is then h0 (x) = α x. Thus, the two predictors can vary by an
arbitrary amount as α → 0, which shows linear regression is not stable.
14.4 Kernel stability
First we note,
|h0 (x) − h(x)| = |(α0> − α> )kx |
≤ kα0> − α> kkkx k .
√
Using the bound on K, we have kkx k ≤ mr, and using the result from exercise 10.3 completes
the bound.
14.5 Stability of relative-entropy regularization
R R
(a) This result follows from the property | g(x) dx| ≤ |g(x)| dx for integrable function g
and the bound on L:
Z
|H(g, z) − H(g 0 , z)| ≤ L(hθ (x), y) g(θ) − g 0 (θ) dθ

ΘZ Z

≤ L(hθ (x), y) g(θ) − g 0 (θ) dθ ≤ M |g(θ) − g 0 (θ)| dθ .

Θ Θ
(b) i. We can write
Z 2 1
Z 2 1
Z 2
|g(θ) − g 0 (θ)| dθ = |g(θ) − g 0 (θ)| dθ + |g 0 (θ) − g(θ)| dθ
Θ 2 Θ 2 Θ
≤ K(g, g 0 ) + K(g 0 , g) .
Thus, it suffices to show K(g, g 0 ) = BK(·,f0 ) (gkg 0 ) + Θ g(θ) − g 0 (θ) dθ in order to show
R
g 0 (θ)
the result. Note that [∇g0 K(g 0 , f0 )]θ = log f0 (θ)
+ 1, thus
BK(·,f0 ) (gkg 0 ) = K(g, f0 ) − K(g 0 , f0 ) − g − g 0 , ∇g0 K(g 0 , f0 )

g 0 (θ)
Z
= K(g, f0 ) − K(g 0 , f0 ) − g(θ) − g 0 (θ) log

+ 1 dθ
Θ f 0 (θ)
g 0 (θ)
Z
g(θ) 0
= g(θ) log − − g(θ) + g (θ) dθ
Θ f0 (θ) f0 (θ)
Z
g(θ)
= g(θ) log 0 − g(θ) + g 0 (θ) dθ
Θ g (θ)
Z
= K(g, g 0 ) + g 0 (θ) − g(θ) dθ .
Θ
ii. To show the first inequality, note that BFS = BR b S + λBK(·,f0 ) , and, since Bregman
divergences are positive, it implies BK(·,f0 ) ≤ BFS . Then, using the definition of g
and g 0 as minimizers,
1
BK(.,f0 ) (gkg 0 ) + BK(.,f0 ) (g 0 kg) ≤ BFS 0 (gkg 0 ) + BFS (g 0 kg)
λ
1b 0 0
= R S 0 (g) − RS 0 (g ) + RS (g ) − RS (g)
b b b
λ
1
H(g 0 , zm ) − H(g, zm ) + H(g, zm 0
) − H(g 0 , zm
0

= ) .
mλ
Where the last equality follows from the fact that only the mth point differs in S and
S 0 . Finally, using the M -Lipschitz property from part (a) completes the question.
iii. Combining part (i) and (ii), we have
Z 2 2M
Z
|g(θ) − g 0 (θ)| dθ ≤ |g(θ) − g 0 (θ)| dθ .
Θ mλ Θ
Z
2M
⇐⇒ |g(θ) − g 0 (θ)| dθ ≤
Θ mλ
Thus, using the M -Lipschitz property, we have
2M 2
|H(g, z) − H(g 0 , z)| ≤ .
mλ
Chapter 15
15.1 PCA and maximal variance

(a)
m m
1 X > 1 X 2
variance = (xi u − x̄> u)2 = (xi − x̄)> u
m i=1 m i=1
m
1 X >
(xi − x̄)> u (xi − x̄)> u)

=
m i=1
m
1 X >
= u (xi − x̄)(xi − x̄)> u = u> Cu .
m i=1
(b) We want to maximize u> Cu subject to u> u = 1. We can directly apply the Rayleigh
quotient (section A.2.3) to show that u∗ is the largest eigenvector (or equivalently the
largest singular vector) of C, which proves that PCA with k = 1 projects the data onto
the direction of maximal variance. Alternatively, we can prove this by introducing a
Lagrange multiplier α to enforce the constraint u> u = 1. To maximize the Lagrangian
u> Cu + α(1 − u> u), we set the derivative w.r.t. u to 0, which gives us Cu = αu.
This equality implies that u∗ is an eigenvector of C. Finally, we observe that among all
eigenvectors of C, the eigenvector associated with the largest eigenvalue of C maximizes
u> Cu.
15.2 Double centering
(a) Observe that kxi − xj k2 = (xi − xj )> (xi − xj ) = x> > >
i xi + xj xj − 2xi xj and rearrange
terms.
1
(b) Noting that X∗ = X − m
X11> and plugging into the equation K∗ = X∗> X∗ yields the
result.
(c) Note that the scalar form of the equation in (b) is
m m
1 X 1 X 1 XX
K∗ij = Kij − Kik − Kkj + 2 Kk,l .
m k=1 m k=1 m k l
Substituting with the equation D2ij = Kii + Kjj − 2Kij from (a) and simplifying yields
the result.
(d) We first observe that − 12 HDH = − 12 (D − m 1 1
D11> − m 11> D + m12 11> D11> ). By
inspection, the matrix expression on the RHS corresponds to the scalar expression with
four terms on the RHS of the equation in (c).
15.3 Laplacian eigenmaps
2 2
X X
argmin Wij kyi0 − yj0 k22 = argmin Wij (y 0 i + y 0 j − 2yi0 yj0 )
y0 i,j y0 i,j
2
X X
= argmin 2 Dii y 0 i − 2 Wij yi0 yj0
y0 i i,j
> >
= argmin y0 Dy0 − y0 Wy0
y0
>
= argmin y0 Ly0 .
y0
15.4 Nyström method
(a) For the first part of question, note that W is SPSD if x> Wx ≥ 0 for all x ∈ Rl . This
condition is equivalent to y> Ky ≥ 0 for all y ∈ Rm where yi = 0 for l + 1 ≤ i ≤ m. Since
K is SPSD by assumption, this latter condition holds. For the second part, we write K e
in block form as
"# " #
W h i W K>
K
e = W† W K> 21 = 21 .
K21 K21 K21 W† K> 21
Comparison with the block form of K then immediately yields the desired result.
(b) Observe that C = X> X0 and W = X0> X0 . Thus,
e = CW† C> = X> X0 X0> X0 † X0 > X = X> UX 0 U> 0 X = X> PU X.

K X X0
(c) Yes. Using the expression for K

e in (b) and the idempotency of orthogonal projection
e = X> PU X = A> A, where A = PU X.
matrices, we can write K X0 X0
(d) Since K = X> X, rank(K) = rank(X) = r. Similarly, W = X0> X0 implies rank(W) =

rank(X0 ) = r. The columns of X0 are columns of X, and they thus span the columns of
X. Hence, UX 0 is an orthonormal basis for X, i.e., IN − PUX 0 ∈ Null(X), and by part
e = X> (IN − PU )X = 0.
(b) of this exercise we have K − K X0
(e) Storage of K requires roughly 3200 TB, i.e.,
1TB
(20 × 106 )2 entries × 8 bytes/entry × = 3200 TB.
1012 bytes
Storage of C requires roughly 160 GB, i.e.,
1GB
(20 × 106 × 103 ) entries × 8 bytes/entry × = 160 GB.
109 bytes
Note that the computed numbers do not account for the symmetry of K (doing so would
change the storage requirements by less than a factor of two).
15.5 Expression for KLLE

We must find a SPSD matrix whose top k singular vectors are identical to the second through
(k + 1)st smallest singular vectors of M, since these singular vectors of M form the solution
to (15.10). If we define σmax as the largest singular value of M, then the largest k + 1
singular vectors of Mc = σmax I − M are identical to the bottom k + 1 singular vectors of
M. Since M is SPSD, note that M c is also SPSD. Moreover, its largest singular vector is
simply the all ones vector properly normalized, i.e., m−1/2 1. Hence, by projecting onto the
subspace orthogonal to 1 LLE coincides with KPCA used with the kernel matrix KLLE =
1 c − 1 11> ).
(I − m 11> )M(I m
15.6 Random projection, PCA and nearest neighbors
(d,e) As shown in figure E.15, the random projection nearest neighbors become more accurate
approximations of the true nearest neighbors as k increases. Moreover, in scenarios where
we care about “near” neighbors, but not necessarily the “nearest” neighbors, the random
projection approximation is quite good, as exhibited by the score50 performance.
(f) As shown in figure E.16, the nearest neighbor approximations generated via PCA are
better than those generated via random projections, which is not surprising given that PCA
aims to minimize reconstruction error, as shown in (15.1).
Chapter C
C.1 For any δ > 0, let t = f −1 (δ). Plugging this in P[X > t] ≤ f (t) yields P[X > f −1 (δ)] ≤ δ,
that is P[X ≤ f −1 (δ)] ≥ 1 − δ.
C.2 By definition of expectation and using the hint, we can write
X X
E[X] = n P[X = n] = n(P[X ≥ n] − P[X ≥ n + 1]).
n≥0 n≥1
P in this sum, for n ≥ 1, P[X ≥ n] is added n times and subtracted n − 1 times, thus
Note that
E[X] = n≥1 P[X ≥ n].
random projection: score10 vs k random projection: score50 vs k

10 10
8 8
6 6
score10
score50
4 4
2 2
0 0
0 100 200 300 400 500 0 100 200 300 400 500
k k
(a) (b)
Figure E.15
score10 and score50 as functions of k for random projection nearest neighbors.
pca: score10 vs k pca: score50 vs k

10 10
8 8
6 6
score10
score50
4 4
2 2
0 0
0 100 200 300 400 500 0 100 200 300 400 500
k k
(a) (b)
Figure E.16
score10 and score50 as functions of k for PCA nearest neighbors.
More generally, by definition of the Lebesgue integral, for any non-negative random variable
X, the following identity holds:
Z +∞
E[X] = P[X ≥ t] dt.
0
Chapter D
D.2 Estimating label bias. Let pb+ be the fraction of positively labeled points in S = (x1 , . . . , xm ):
m
1 X
pb+ = 1f (xi )=+1
m i=1
Since the points are drawn i.i.d.,

m
1 X
p+ ] =
E[b E [1f (xi )=+1 ] = E m [1f (x1 )=+1 ] = E [1f (x)=+1 ] = p+ .
m i=1 S∼Dm S∼D x∼D
Thus, by Hoeffding’s inequality, for any > 0,
2
P[|p+ − pb+ | > ] ≤ 2e−2m .
Setting δ to match the right-hand side yields the result.
D.3 Biased coins
(a) By definition of the error of Oskar’s prediction rule,

error(fo ) = P[fo (S) 6= x]
= P[fo (S) = xA ∧ x = xB ] + P[fo (S) = xB ∧ x = xA ]
h m i
= P N (S) < x = xB P[x = xB ]+
2
h m i
P N (S) ≥ x = xA P[x = xA ]
2
1 m 1 h m i
= P[N (S) < x = xB ] + P N (S) ≥ x = xA
2 2 2 2
1 h m i
≥ P N (S) ≥ x = x A .
2 2
(b) Note that P[N (S) ≥ m 2

|x = xA ] = P[B(m, p) ≥ k], with p = 1/2 − /2, k = m
2
, and
mp ≤ k ≤ m(1 − p). Thus, by Slud’s inequality (section D.5)
" # √
1 m/2 1 m
error(fo ) ≥ P N ≥ p = P N ≥ √ .
2 1/4(1 − 2 )m 2 1 − 2
Using the second inequality of the appendix, we now obtain
q
1
error(fo ) ≥ 1 − 1 − e−u2 ,
√ 4
with u = √ m2 , which coincides with (D.29).
1−
h i h i
(c) If m is odd, since P N (S) ≥ m = xA ≥ P N (S) ≥ m+1 = xA , we can use the

2 x 2 x
lower bound
1 m + 1
error(fo ) ≥ P N (S) ≥ x = xA .
2 2
Thus, in both cases we can use the lower bound expression with dm/2e instead of m/2.
2dm/2e2 i 1 i
−
h h
1 1−2 2
(d) If error(fo ) is at most δ, then 4
1− 1−e < δ, which gives
2dm/2e2
−
e 1−2 < 1 − (1 − 4δ)2 = 4δ(2 − 4δ) = 8δ(1 − 2δ),
and
1 − 2

1
m>2 log .
22 8δ(1 − 2δ)
The lower bound varies as 12 .
(e) Let f be an arbitrary rule and denote by FA the set of samples for which f (S) = xA and
by FB the complement. Then, by definition of the error,
X X
error(f ) = P[S ∧ xB ] + P[S ∧ xA ]
S∈FA S∈FB
1 X 1 X
= P[S|xB ] + P[S|xA ]
2 S∈F 2 S∈F
A B
1 X 1 X
= P[S|xB ] + P[S|xB ]+
2 S∈FA
2 S∈FA
N (S)<m/2 N (S)≥m/2
1 X 1 X
P[S|xA ] + P[S|xA ].
2 S∈FB
2 S∈FB
N (S)<m/2 N (S)≥m/2
Now, if N (S) ≥ m/2, clearly P[S|xB ] ≥ P[S|xA ]. Similarly, if N (S) < m/2, clearly
P[S|xA ] ≥ P[S|xB ]. In view of these inequalities, error(f ) can be lower bounded as
follows
1 X 1 X
error(f ) ≥ P[S|xB ] + P[S|xA ]+
2 S∈F 2 S∈F
A A
N (S)<m/2 N (S)≥m/2
1 X 1 X
P[S|xB ] + P[S|xA ]
2 S∈FB
2 S∈FB
N (S)<m/2 N (S)≥m/2
1 X 1 X
= P[S|xB ] + P[S|xA ]
2 2
S : N (S)<m/2 S : N (S)≥m/2
= error(fo ).
Oskar’s rule is known as the maximum likelihood solution.

Solutions To Exercises

Uploaded by

Copyright:

Available Formats

Solutions To Exercises

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Solutions To Exercises

Uploaded by

Copyright:

Available Formats

Solutions to Exercises

2.1 Two-oracle variant of the PAC model

2.10 Consistent hypotheses

2.11 Senate laws

(a) The true error in the consistent case is bounded as follows:

(b) The true error in the inconsistent case is bounded as:

By the union bound,

3.1 Growth function of intervals in R

3.8 Rademacher identities

3.9 Rademacher complexity of intersection of concepts

3.10 Rademacher complexity of prediction vector

(a) The following proves the result:

(b) The empirical Rademacher complexity of F + h can be expressed as follows:

3.11 Rademacher complexity of regularized neural networks

(b) By Talagrand’s lemma, since σ is L-Lipschitz, the following inequality holds:

3.12 Rademacher complexity

3.13 VC-dimension of union of k intervals

3.17 VC-dimension of closed balls in Rn .

3.21 VC-dimension of union of halfspaces

number of sets intersections of concepts of C with X is at most

3.25 VC-dimension of symmetric difference of concepts

This shows that φ defined by

3.27 VC-dimension of neural networks

3.30 VC-dimension generalization bound – realizable case

(a) Let h0 ∈ HS , then we have the following set of inequalities:

bT (h) >  P[R

(a) First split the term into two separate terms:

|LS (h1 ) − LS (h2 )| ≤ |R(h1 ) − R(h2 )| + |R

5.1 Soft margin hyperplanes

(a) The corresponding dual problem is:

5.2 Tighter Rademacher bound

5.4 Sequential Minimal Optimization (SMO).

Dividing both sides by η yields the desired result.

5.5 SVM hands-on

(a) Download and install the libsvm software library from:

Polinomial kernel, degree = 1 Polinomial kernel, degree = 2

cross validation error

Polinomial kernel, degree = 3 Polinomial kernel, degree = 4

cross validation error

d = 1 \colon C* = 2^(+5.0) = 32 cv-err = 10.4% +/- 1.4%

CV error, log2(C) = -4.5 CV error, log2(C) = -4.5

number of support vectors

d = 2 \colon nSV - nBSV = 41.8

Moving to the dual exactly as shown in the chapter we obtain:

Setting ki = k for yi = −1 and ki = 1 for yi = 1 gives the desired problem.

Polinomial kernel, degree = 1 Polinomial kernel, degree = 2

cross validation error

Polinomial kernel, degree = 3 Polinomial kernel, degree = 4

cross validation error

k = 4 \colon (d*, C*) = (4, 2^(-5.5)), cv-err = 6.1% +/- 1.7%

5.6 Sparse SVM

(b) The Lagrangian of (1) for all αi ≥ 0, ξi ≥ 0, b, α0i ≥ 0, βi ≥ 0, γi ≥ 0, i ∈ [m] is

Polinomial kernel, degree = 1 Polinomial kernel, degree = 2

cross validation error

Polinomial kernel, degree = 3 Polinomial kernel, degree = 4

cross validation error

and the KKT conditions are

Polinomial kernel, degree = 1 Polinomial kernel, degree = 2

bT (h) > P[R

k = 4 \colon (d, C) = (4, 2^(-5.5)), cv-err = 6.1% +/- 1.7%

⇔ (1 − t )eη(−1+ρ) (−1 + ρ) + t eη(1+ρ) (1 + ρ) = 0