Solutions To Exercises
Solutions To Exercises
Solutions To Exercises
Chapter 2
• Assume that C is efficiently PAC-learnable using H in the standard PAC model using
algorithm A. Consider the distribution D = 21 (D− + D+ ). Let h ∈ H be the hypothesis
output by A. Choose δ such that:
P[RD (h) ≤ /2] ≥ 1 − δ.
From
RD (h) = P [h(x) 6= c(x)]
x∼D
1
= ( P [h(x) 6= c(x)] + P [h(x) 6= c(x)])
2 x∼D− x∼D+
1
= (RD− (h) + RD+ (h)),
2
it follows that:
P[RD− (h) ≤ ] ≥ 1 − δ and P[RD+ (h) ≤ ] ≥ 1 − δ.
This implies two-oracle PAC-learning with the same computational complexity.
• Assume now that C is efficiently PAC-learnable in the two-oracle PAC model. Thus, there
exists a learning algorithm A such that for c ∈ C, > 0, and δ > 0, there exist m− and m+
polynomial in 1/, 1/δ, and size(c), such that if we draw m− negative examples or more
and m+ positive examples or more, with confidence 1 − δ, the hypothesis h output by A
verifies:
P[RD− (h)] ≤ and P[RD+ (h)] ≤ .
Now, let D be a probability distribution over negative and positive examples. If we could
draw m examples according to D such that m ≥ max{m− , m+ }, m polynomial in 1/, 1/δ,
and size(c), then two-oracle PAC-learning would imply standard PAC-learning:
P[RD (h)]
≤ P[RD (h)|c(x) = 0] P[c(x) = 0] + P[RD (h)|c(x) = 1] P[c(x) = 1]
≤ (P[c(x) = 0] + P[c(x) = 1]) = .
If D is not too biased, that is, if the probability of drawing a positive example, or that of
drawing a negative example is more than , it is not hard to show, using Chernoff bounds
or just Chebyshev’s inequality, that drawing a polynomial number of examples in 1/ and
1/δ suffices to guarantee that m ≥ max{m− , m+ } with high confidence.
Otherwise, D is biased toward negative (or positive examples), in which case returning
h = h0 (respectively h = h1 ) guarantees that P[RD (h)] ≤ .
460 Solutions Manual
To show the claim about the not-too-biased case, let Sm denote the number of positive
examples obtained when drawing m examples when the probability of a positive example
is . By Chernoff bounds,
2
P[Sm ≤ (1 − α)m] ≤ e−mα /2 .
1 2m+
We want to ensure that at least m+ examples are found. With α = 2
and m =
,
P[Sm > m+ ] ≤ e−m+ /4 .
Setting the bound to be less than or equal to δ/2, leads to the following condition on m:
2m+ 8 2
m ≥ min{ , log }
δ
A similar analysis can be done in the case of negative examples. Thus, when D is not too
biased, with confidence 1 − δ, we will find at least m− negative and m+ positive examples
if we draw m examples, with
2m+ 2m− 8 2
m ≥ min{ , , log }.
δ
In both solutions, our training data is the set T and our learned concept L(T ) is the tightest
circle (with minimal radius) which is consistent with the data.
2.2 PAC learning of hyper-rectangles
The proof in the case of hyper-rectangles is similar to the one given presented within the
chapter. The algorithm selects the tightest axis-aligned hyper-rectangle containing all the
sample points. For i ∈ [2n], select a region ri such that PD [ri ] = /(2n) for each edge of the
hyper-rectangle R. Assuming that PD [R − R0 ] > , argue that R0 cannot meet all ri s, so it
must miss at least one. The probability that none of the m sample points falls into region ri
is (1 − /2n)m . By the union bound, this shows that
m
P[R(R0 ) > ] ≤ 2n(1 − /2n)m ≤ 2n exp − . (E.35)
2n
Setting δ to the right-hand side shows that for
2n 2n
m≥ log , (E.36)
δ
with probability at least 1 − δ, RD (R0 ) ≤ .
2.3 Concentric circles
Suppose our target concept c is the circle around the origin with radius r. We will choose a
slightly smaller radius s by
s := inf{s0 : P (s0 ≤ ||x|| ≤ r) < }.
Let A denote the annulus between radii s and r; that is, A := {x : s ≤ ||x|| ≤ r}. By definition
of s,
P (A) ≥ . (E.37)
In addition, our generalization error, P (c ∆ L(T )), must be small if T intersects A. We can
state this as
P (c ∆ L(T )) > =⇒ T ∩ A = ∅. (E.38)
Using (E.37), we know that any point in T chosen according to P will “miss” region A with
probability at most 1 − . Defining error := P (c ∆ L(T )), we can combine this with (E.38) to
see that
P (error > ) ≤ P (T ∩ A = ∅) ≤ (1 − )m ≤ e−m .
Setting δ to be greater than or equal to the right-hand side leads to m ≥ 1 log( 1δ ).
2.4 Non-concentric circles
As in the previous example, it is natural to assume the learning algorithm operates by return-
ing the smallest circle which is consistent with the data. Gertrude is relying on the logical
implication
error > =⇒ T ∩ ri = ∅ for some i, (E.39)
Solutions Manual 461
r1 r1
r3 r3
r2 r2
Figure E.5
Counter-example shows error of tightest circle in gray.
which is not necessarily true here. Figure E.5 illustrates a counterexample. In the figure, we
have one training point in each region ri . The points in r1 and r2 are very close together,
and the point in r3 is very close to region r1 . On this training data (some other points may
be included outside the three regions ri ), our learned circle is the “tightest” circle including
these points, and hence one diameter approximately traverses the corners of r1 . In the figure,
the gray regions are the error of this learned hypotheses versus the target circle, which has a
thick border. Clearly, the error may be greater than even while T ∩ ri 6= ∅ for any i; this
contradicts (E.39) and invalidates poor Gertrude’s proof.
2.5 Triangles
As in the case of axis-aligned rectangles, consider three regions r1 , r2 , r3 , along the sides of
the target concept as indicated in figure E.6. Note that the triangle formed by the points
A”, B”, C” is similar to ABC (same angles) since A”B” must be parallel to AB, and similarly
for the other sides.
Assume that P[ABC] > , otherwise the statement would be trivial. Consider a triangle
A0 B 0 C 0 similar to ABC and consistent with the training sample and such that it meets all
three regions r1 , r2 , r3 .
Since it meets r1 , the line A0 B 0 must be below A”B”. Since it meets r2 and r3 , A0 must be
in r2 and B 0 in r3 (see figure E.6). Now, since the angle A\ 0 B 0 C 0 is equal to A”B”C”,
\ C0
must be necessarily above C”. This implies that triangle A0 B 0 C 0 contains A”B”C”, and thus
error(A0 B 0 C 0 ) ≤ .
error(A0 B 0 C 0 ) > =⇒ ∃i ∈ {1, 2, 3} : A0 B 0 C 0 ∩ ri = ∅.
Thus, by the union bound,
3
X
P[error(A0 B 0 C 0 ) > ] ≤ P[A0 B 0 C 0 ∩ ri = ∅] ≤ 3(1 − /3)m ≤ 3e−3m .
i=1
3
Setting δ to match the right-hand side gives the sample complexity m ≥
log 3δ .
2.8 Learning intervals
462 Solutions Manual
C”
A” B”
A’ B’
A B
Figure E.6
Rectangle triangles.
Given a sample S, one algorithm consists of returning the tightest closed interval IS containing
positive points. Let I = [a, b] be the target concept. If P[I] < , then clearly R(IS ) < .
Assume that P[I] ≥ . Consider two intervals IL and IR defined as follows:
IL = [a, x] with x = inf{x : P[a, x] ≥ /2}
IR = [x0 , b] with x0 = sup{x0 : P[x0 , b] ≥ /2}.
By the definition of x, the probability of [a, x[ is less than or equal to /2, similarly the
probability of ]x0 , b] is less than or equal to /2. Thus, if IS overlaps both with IL and IR ,
then its error region has probability at most . Thus, R(IS ) > implies that IS does not
overlap with either IL or IR , that is either none of the training points falls in IL or none falls
in IR . Thus, by the union bound,
P[R(IS ) > ] ≤ P[S ∩ IL = ∅] + P[S ∩ IR = ∅]
≤ 2(1 − /2)m ≤ 2e−m/2 .
2 2
Setting δ to match the right-hand side gives the sample complexity m =
log δ
and proves
the PAC-learning of closed intervals.
2.9 Learning union of intervals
Given a sample S, our algorithm consists of the following steps:
(a) Sort S in ascending order.
(b) Loop through sorted S, marking where intervals of consecutive positively labeled points
begin and end.
(c) Return the union of intervals found on the previous step. This union is represented by a
list of tuples that indicate start and end points of the intervals.
This algorithms works both for p = 2 and for a general p. We will now consider the problem
for C2 . To show that this is a PAC-learning algorithm we need to distinguish between two
cases.
The first case is when our target concept is a disjoint union of two closed intervals: I =
[a, b] ∪ [c, d]. Note, there are two sources of error: false negatives in [a, b] and [c, d] and also
false positives in (b, c). False positives may occur if no sample is drawn from (b, c). By
linearity of expectation and since these two error regions are disjoint, we have that R(hS ) =
RFP (hS ) + RFN,1 (hS ) + RFN,2 (hS ), where
RFP (hS ) = P [x ∈ hS , x 6∈ I],
x∼D
RFN,1 (hS ) = P [x 6∈ hS , x ∈ [a, b]],
x∼D
RFN,2 (hS ) = P [x 6∈ hS , x ∈ [c, d]].
x∼D
Solutions Manual 463
Since we need to have that at least one of RFP (hS ), RFN,1 (hS ), RFN,2 (hS ) exceeds /3 in
order for R(hS ) > , by union bound
P(R(hS ) > ) ≤ P(RFP (hS ) > /3 or RFN(hS ),1 > /3 or RFN(hS ),2 > /3)
2
X
≤ P(RFP (hS ) > /3) + P(RFN(hS ),i > /3) (E.40)
i=1
We first bound P(RFP (hS ) > /3). Note that if RFP (hS ) > /3, then P((b, c)) > /3 and
hence
P(RFP (hS ) > /3) ≤ (1 − /3)m ≤ e−m/3 .
Now we can bound P(RFN(hS ),i > /3) by 2e−m/6 using the same argument as in the previous
question. Therefore,
P(R(hS ) > ) ≤ e−m/3 + 4e−m/6 ≤ 5e−m/6 .
Setting, the right-hand side to δ and solving for m yields that m ≥ 6 log 5δ .
The second case that we need to consider is when I = [a, d], that is, [a, b] ∩ [c, d] 6= ∅. In
that case, our algorithm reduces to the one from exercise 2.8 and it was already shown that
only m ≥ 2 log 2δ samples is required to learn this concept. Therefore, we conclude that our
algorithm is indeed a PAC-learning algorithm.
Extension of this result to the case of Cp is straightforward. The only difference is that in
(E.40), one has two summations for p − 1 regions of false positives and 2p regions of false
2(2p−1)
negatives. In that case sample complexity is m ≥
log 3p−1
δ
.
Sorting step of our algorithm takes O(m log m time and steps (b) and (c) are linear in m,
which leads to overall time complexity O(m log m).
2.12 Bayesian bound. For any fixed h ∈ H, by Hoeffding’s inequality, for any δ > 0,
s
1
log p(h)δ
P R(h) − R bS (h) ≥ ≤ p(h)δ. (E.43)
2m
464 Solutions Manual
(b) By definition, P[h is rejected] = P[R bS (h) ≥ 3 ]. Since R(h) ≤ /2, P[h is rejected] ≤
4
3
P[RS (h) ≥ 4 | R(h) = /2]. By the Chernoff bounds, we can thus write
b
n
P[h is rejected] ≤ exp − (1/2)2 (Chernoff bound)
32
4 2 i+1
= exp − log (def. of n)
3 δ
2 i+1 δ
≤ exp − log = i+1 .
δ 2
(c) The estimate se is then an upper bound on s and thus, by definition of algorithm B,
P[R(hi ) ≤ /2] ≥ 1/2. If a hypothesis h has error at least /2 it is rejected with probability
at most δ/2i+1 ≤ δ/4 ≤ 1/4, therefore, it is accepted with probability at 3/4. Thus, for
se ≥ s, P[hi is accepted] ≥ 1/2 × 1/4 = 3/8.
(d) By the previous question, the probability that algorithm B fails to halt while se ≥ s is at
most 1 − 3/8 = 5/8. Thus, the probability that it does not halt after j iterations is at
2 8
most (5/8)j ≤ (5/8)log δ / log 5 = exp log 2δ / log 85 log 85 = δ/2.
(e) By definition,
2
se ≥ s ⇐⇒ b2(i−1)/ log δ c ≥ s
2
⇐⇒ 2(i−1)/ log ≥ s δ
i−1
⇐⇒ ≥ log2 s
log 2δ
2
⇐⇒ i ≥ 1 + (log2 s) log
δ
2
⇐⇒ i ≥ d1 + (log2 s) log e.
δ
(f) In view of the two previous questions, with probability at least 1 − δ/2, algorithm B halts
after at most j 0 iterations. The probability that the hypothesis it returns be accepted
0
while its error is greater than is at most δ/2j +1 ≤ δ/2. Thus, with probability 1 − δ,
the algorithm halts and the hypothesis it returns has error at most .
Solutions Manual 465
Chapter 3
and the outer function is non-decreasing. It is not true in general that the composition of any
two concave functions is concave.
3.6 Singleton hypothesis class
(a)
m m
" # " #
1 X 1 X
Rm (H) = E sup σi h(zi ) = E σi h0 (zi )
S∼Dm ,σ h∈H m S∼Dm ,σ m
i=1 i=1
m
" #
1 X
= Em E[σi ]h0 (zi ) = 0 ,
S∼D m i=1 σ
where the final equality follows from the fact E[σ] = 0 for a Rademacher random variable
σ.
(b) Consider the set A = x0 , where x0 is any vector in Rm with kx0 k = r. Then,
m m
1 X 1 X
E sup σi xi = E[σi ]x0i = 0
σ m x∈A m i=1 σ
i=1
and, p
r 2 log |A|
= 0.
m
3.7 Two function hypothesis class
(a) VCdim(H) = 1 since H can shatter one point and clearly at most one. Observe that
m
X m
X Xm
sup σi h(xi ) = sup σi h(x1 ) = σi . (E.46)
h∈H i=1 h∈H i=1 i=1
Thus, by Jensen’s inequality,
m
b S (H) = 1 E
h X i
R σi
m σ i=1
m
1h h X ii1/2
≤ E ( σi )2
m σ
i=1
m
1 h h X 2 ii1/2
= E σ (E[σi σj ] = 0 for i 6= j)
m σ i=1 i
1
= √ .
m √
By the Khintchine inequality,
pthe upper bound is tight modulo the constant 1/ 2. The
upper bound coincides with d/m.
(b) VCdim(H) = 1 since H can shatter x1 and clearly at most one point. By definition,
m
b S (H) = 1 E sup
h X i
R σi h(xi )
m σ h∈H i=1
m
1 h X i
= E sup σ1 h(x1 ) − σi
m σ h∈H i=2
1 h i
= E sup σ1 h(x1 ) (E[σi ] = 0)
m σ1 h∈H
1 h i 1
= E 1 = .
m σ1 m p p
Here R
b S (H) is a clearly more favorable quantity than d/m = 1/m.
Solutions Manual 467
(a) If α ≥ 0, then
m
X m
X m
X
sup σi h(xi ) = sup α σi h(xi ) = α sup σi h(xi ),
h∈αH i=1 h∈H i=1 h∈H i=1
otherwise if α < 0, then
Xm m
X m
X
sup σi h(xi ) = sup α σi h(xi ) = (−α) sup (−σi )h(xi ).
h∈αH i=1 h∈αH i=1 h∈H i=1
Since σi and −σi have the same distribution, this shows that Rm (αH) = |α|Rm (H).
(b) The second equality follows from:
Rm (H + H0 )
m
" #
1 X
0
= E sup σi (h(xi ) + h (xi ))
m σ,S h∈H,h0 ∈H0 i=1
m m
" #
1 X X
= E sup σi h(xi ) + sup σi h0 (xi )
m σ,S h∈H,h0 ∈H0 i=1 h∈H,h0 ∈H0 i=1
m m
" # " #
1 X 1 X
= E sup σi h(xi ) + E sup σi h0 (xi ) .
m σ,S h∈H i=1 m σ,S h0 ∈H0 i=1
1
(c) For the inequality, using the identity max(a, b) = 2
[a + b + |a − b|] valid for all a, b ∈ R,
and the sub-additivity of sup we can write:
Rm ({max(h, h0 ) : h ∈ H, h0 ∈ H0 })
m
" #
1 X
= E sup σi [h(xi ) + h0 (xi ) + |h(xi ) − h0 (xi )|]
2m σ,S h∈H,h0 ∈H0 i=1
m
" #
1 0 1 X
0
≤ [Rm (H) + Rm (H )] + E sup σi |h(xi ) − h (xi )| .
2 2m σ,S h∈H,h0 ∈H0 i=1
Since the absolute value function is 1-Lipschitz, by the contraction lemma, the third term
can be bounded as follows
m
" #
1 X
E sup σi |h(xi ) − h0 (xi )|
2m σ,S h∈H,h0 ∈H0 i=1
m
" #
1 X
≤ E sup σi [h(xi ) − h0 (xi )]
2m σ,S h∈H,h0 ∈H0 i=1
m m
" #
1 X X
0
≤ E sup σi h(xi ) + sup −σi h (xi )]
2m σ,S h∈H,h0 ∈H0 i=1 h∈H,h0 ∈H0 i=1
m m
" # " #
1 X 1 X
0
= E sup σi h(xi ) + E sup −σi h (xi )]
2m σ,S h∈H,h0 ∈H0 i=1 2m σ,S h∈H,h0 ∈H0 i=1
1
= [Rm (H) + Rm (H0 )],
2
using the fact that σi and −σi follow the same distribution.
Since concepts can be identified with indicator functions, the intersection of two concepts can
be identified with the product two such indicator functions. In view of that, by the result just
proven and after taking expectations, the following holds:
Rm (C) ≤ Rm (C1 ) + Rm (C2 ).
kuk
= .
m
√
n 1
Thus, RS (H) ≤ m . For n = 1, R
b b S (H) ≤ b S (H) ≤
while for n = m, R √1 .
m m
b S (F) + kuk .
≤R
m
(a)
m n2
1 X X
RS (H) =
b E sup σi wj σ(uj · xi )
m σ kwk1 ≤Λ0 ,kuj k2 ≤Λ i=1 j=1
n2 m
1 X X
= E sup wj σi σ(uj · xi )
m σ kwk1 ≤Λ0 ,kuj k2 ≤Λ j=1 i=1
" m #
Λ0 X
= sup max σi σ(uj · xi ) (all the weight put on largest term)
E
m σ kuj k2 ≤Λ j∈[n2 ] i=1
" m #
Λ 0 X
= sup σi σ(uj · xi )
E
m σ kuj k2 ≤Λ,j∈[n2 ] i=1
" m #
Λ0 X
= sup σi σ(u · xi ) .
E
m σ kuk2 ≤Λ i=1
(c)
m
" #
b S (H0 ) = 1 E
X
R sup σi su · xi
mσ kuk2 ≤Λ,s∈{−1,+1} i=1
" m #
1 X
= sup σi u · xi (def. of abs. val.)
E
mσ kuk2 ≤Λ i=1
" m
#
1 X
= sup u · σi xi
E
m σ kuk ≤Λ
2 i=1
"
m
#
Λ
X
= σi xi
(Cauchy-Schwarz eq. case).
E
m σ
i=1
2
Λ m
P
σ x
The last equality holds by setting u = P i=1 i i .
k mi1=1 σi xi k
470 Solutions Manual
(d)
"
m
#
b S (H0 ) = Λ E
X
R σ x
i i
m σ
i=1
2
v
u
m
2
Λ u
u
X
≤ σi xi
(Jensen’s ineq.)
tE
m σ
i=1
2
v
uX m
1u
= t E [σi σj ] (xi · xj )
m i,j=1 σ
v
uX m
Λu
= t 1i=j (xi · xj ) (independence of σi s)
m i,j=1
v
um
Λu X
= t kxi k22 .
m i=1
(e) In view of the previous questions,
v
um
0 0 Λ 0 ΛL uX Λ0 ΛL √ 2 Λ0 ΛLr
b S (H) ≤ Λ L R
R b S (H ) ≤ t kxi k22 ≤ mr = √ .
m i=1
m m
Pm 2
Pm
Let X = i=1 σi . Note that E[X ] = E[ i,j=1 σi σj ]. For any i 6= j, since σi and σj are
independent, E[σi σj ] = E[σi ] E[σj ] = 0. Thus,
m
X m
X
E[X 2 ] = E[σi σi ] = E[σi2 ] = m.
i=1 i=1
Now, by Hölder’s inequality,
m = E[X 2 ] = E[|X|2/3 |X|4/3 ] ≤ E[|X|]2/3 E[X 4 ]1/3 .
Thus,
m3/2 m3/2 m3/2
E[|X|] ≥ = q P = p
E[X 4 ]1/2 m 4
P 2 2
E[ i=1 σi + 3 i6=j σi σj ] m + 3m(m − 1)
3/2 3/2 √
m m m
= p ≥ p = √ .
m(3m − 2) m(3m) 3
This shows that √
Rb S (H) ≥ √m .
3
Since Rm (H) ≥ R b S (H) + O( √1 ), it implies Rm (H) ≥ O( √1 ), which contradicts Rm (H) ≤
m m
1
O( m ).
Note that for the lower bound, we could have used instead a more general result (Khinchine’s
inequality) which states that for any a ∈ Rm ,
kak2
E[| σ · a|] ≥ √ .
2
Solutions Manual 471
(a) It is not hard to see that the set of 3 points with coordinates (1, 0), (0, 1), and (−1, 0)
can shattered by axis-aligned squares: e.g., to label positively two of these points, use a
square defined by the axes and with those to points as corners. Thus, the VC-dimension
is at least 3. No set of 4 points can be fully shattered. To see this, let PT be the highest
point, PB the lowest, PL the leftmost, and PR the rightmost, assuming for now that
these can be defined in a unique way (no tie) – the cases where there are ties can be
treated in a simpler fashion. Assume without loss of generality that the difference dBT
of y-coordinates between PT and PB is greater than the difference dLR of x-coordinates
between PL and PR . Then, PT and PB cannot be labeled positively while PL and PR are
labeled negatively. Thus, the VC-dimension of axis-aligned squares in the plane is 3.
(b) Check that the set of 4 points with coordinates (1, 0), (0, 1), (−1, 0), and (0, −1) can be
shattered by such triangles. This is essentially the same as the case with axis aligned
rectangles. To see that no five points can be shattered, the same example or argument
as for axis-aligned rectangles can be used: labeling all points positively except from the
one within the interior of the convex hull is not possible (for the degnerate cases where no
point is in the interior of the convex hull is simpler, this is even easier to see). Thus, the
VC-dimension of this family of triangles is 4.
which is equivalent to
Pn hW, Xi+ B ≤ 0, (E.48)
kxi k2
1 i=1
−2a x1
1
, and B = n 2
P
with W = , X = i=1 ai − r. The VC-dimension of
... ...
−2an xn
closed balls in Rn is thus at most equal to the VC-dimension of hyperplanes in Rn+1 , that is,
n + 2.
3.18 VC-dimension of ellipsoids
The general equation of ellipsoids in Rn is
(X − X0 )> A(X − X0 ) ≤ 1, (E.49)
where X, X0 ∈ Rn and A = (aij ) ∈ Sn
+ is a positive semidefinite symmetric matrix. This can
be rewritten as
>
XP AX − 2X> AX0 + X> 0 AX0 ≤ 1, (E.50)
or n n >
P
i,j=1 aij (xi xj + xj xi ) − i=1 2(AX0 )i xi + X0 AX0 − 1 ≤ 0 using the fact that A is
symmetric. Let ai = −2(AX0 )i for i ∈ [n] and let b = X> 0 AX0 − 1. Then this can be viewed
in terms of the following equations of hyperplanes in Rn(n+1)/2+n
W> Z + b ≤ 0, (E.51)
with
a1 x1
... ...
an xn
a11 x1 x1 + x1 x1
W= Z= n(n + 1)/2 + n (E.52)
... ...
aij xi xj + xj xi
... ...
ann xn xn + xn xn .
Since the VC-dimension of hyperplanes in Rn(n+1)/2+n is n(n+1)/2+n+1 = (n+1)(n/2+1),
the VC-dimension of ellipsoids in Rn is bounded by (n + 1)(n + 2)/2.
3.19 VC-dimension of a vector space of real functions
Show that no set of size m = r + 1 can be shattered by H. Let x1 , . . . , xm be m arbitrary
points. Define the linear mapping l : F → Rm defined by:
l(f ) = (f (x1 ), . . . , f (xm ))
Since the dimension of dim(F ) = m − 1, the rank of l is at most m − 1, and there exists
α ∈ Rm orthogonal to l(F ):
Xm
∀f ∈ F, αi f (xi ) = 0
i=1
We can assume that at least one αi is negative. Then,
Xm m
X
∀f ∈ F, αi f (xi ) = − αi f (xi )
i : αi ≥0 i : αi <0
Now, assume that there exists a set {x : f (x) ≥ 0} selecting exactly the xi s on the left-hand
side. Then all the terms on the left-hand side are non-negative, while those on the right-hand
side are negative, which cannot be. Thus, {x1 , . . . , xm } cannot be shattered.
3.20 VC-dimension of sine functions
(a) Fix x ∈ R and suppose there exists an ω that realizes the labeling −−+−. Thus sin(ωx) <
0, sin(2ωx) < 0, sin(3ωx) ≥ 0 and sin(4ωx) < 0. We will show that this implies sin2 (ωx) <
1
2
and sin2 (ωx) ≥ 34 , a contradiction.
Solutions Manual 473
Using the identity sin(2θ) = 2 sin(θ) cos(θ) and the fact that sin(4ωx) < 0 we have
2 sin(2ωx) cos(2ωx) = sin(4ωx) < 0.
Since sin(2ωx) < 0 we can divide both sides of this inequality by 2 sin(2ωx) to conclude
cos(2ωx) > 0. Applying the identity cos(2θ) = 1 − 2 sin2 (θ) yields 1 − 2 sin2 (ωx) > 0, or
sin2 (ωx) < 21 .
Using the identity sin(3θ) = 3 sin(θ) − 4 sin3 (θ) and the fact that sin(3ωx) ≥ 0 we have
3 sin(ωx) − 4 sin3 (ωx) = sin(3ωx) ≥ 0
Since sin(ωx) < 0 we can divide both sides of this inequality by sin(ωx) to conclude
3 − 4 sin2 (ωx) ≤ 0, or sin2 (ωx) ≥ 34 .
(b) For any m > 0, consider the set of points (x1 , . . . , xm
P) withi arbitrary labels (y1 , . . . , ym ) ∈
1−yi
{−1, +1}m . Now, choose the parameter ω = π(1 + m 0 0
i=1 2 yi ) where yi = 2
. We show
that this single parameter will always correctly classify the entire sample for any m > 0
and choice of labels. For any j ∈ [m] we have,
m
X j−1
X m−j
X
ωxj = ω2−j = π(2−j + 2i−j yi0 ) = π(2−j + ( 2i−j yi0 ) + yj0 + ( 2i yi0 )) .
i=1 i=1 i=1
The last term can be dropped from the sum,
P sincei−j
it contributes only multiples of 2π. Since
yi0 ∈ {0, 1} the remaining term π(2−j + ( j−1
Pj−1
i=1 2 yi0 ) + yj0 ) = π( i=1 2−i yi0 + 2−j + yj0 )
can be upper and lower bounded as follows:
j−1
X j
X
π( 2−i yi0 + 2−j + yj0 ) ≤ π( 2−i + yj0 ) < π(1 + yj0 ) ,
i=1 i=1
j−1
X
π( 2−i yi0 + 2−j + yj0 ) > πyj0 .
i=1
Thus, if yj = 1 we have yj0 = 0 and 0 < ωxj < π, which implies sgn(ωxj ) = 1. Similarly,
for yj = −1 we have sgn(ωxj ) = −1.
(a) Fix a set X of m points. Let Y1 , . . . , Yk be the set of intersections of the concepts of C1
with X. By definition of ΠC1 (X), k ≤ ΠC1 (X) ≤ ΠC1 (m). By definition of ΠC2 (Yi ), the
intersection of the concepts of C2 with Yi are at most ΠC2 (Yi ) ≤ ΠC2 (m). Thus, the
474 Solutions Manual
(a) When C = A ∪ B, ΠC (X) ≤ ΠA (X) + ΠB (X) for any set X, since dichotomies in ΠC (X) can
be generated by A or by B. Thus, for all m, ΠC (m) ≤ ΠA (m) + ΠB (m).
(b) For m ≥ dA + dB + 2, by Sauer’s lemma,
dA dB dA dB
X m X m X m X m − i
ΠC (m) ≤ + = +
i=0
i i=0
i i=0
i i=0
i
dA dB
X m X m
= + (E.63)
i=0
i i=m−d
i
B
dA dB
X m X m
≤ + (E.64)
i=0
i i=dA +2
i
m
X m
< = 2m . (E.65)
i=0
i
Thus, the VC-dimension of C is strictly less than dA + dB + 2:
VCdim(C) ≤ dA + dB + 1. (E.66)
Figure E.7
Illustration of (h∆A) ∩ S = (h ∩ S)∆(A ∩ S) shown in gray.
(a) For i = 0, . . . , n, let xi ∈ {0, 1}n be defined by xi = (1, . . . , 1, 0, . . . , 0). Then, {x0 , . . . , xn }
| {z }
i 1’s
can be shattered by C. Indeed, let y0 , . . . , yn ∈ 0, 1 be an arbitrary labeling of these points.
Then, the function h defined by:
h(x) = yi (E.70)
for all x with i 1’s is symmetric and h(xi ) = yi . Thus, VCdim(C) ≥ n + 1. Conversely,
a set of n + 2 points cannot be shattered by C, since at least two points would then have
the same number of 1’s and will not be distinguishable by C. Thus,
VCdim(C) = n + 1. (E.71)
(b) Thus, in view of the theorems presented in class, a lower bound on the number of training
examples needed to learn symmetric functions with accuracy 1 − and confidence 1 − δ is
1 1 n
Ω( log + ), (E.72)
δ
and an upper bound is:
1 1 n 1
O( log + log ), (E.73)
δ
which is only within a factor 1 of the lower bound.
(c) For a training data (z0 , t0 ), . . . , (zm , tm ) ∈ {0, 1}n × {0, 1} define h as the symmetric
function such that h(zi ) = ti for all i = 0, . . . , m.
(a) Let Πu (m) denote the growth function at a node u in the intermediate layer. For a fixed
Q class C the output node can
set of values at the intermediate layer, using the concept
generate at most ΠC (m) distinct labelings. There are u Πu (m) possible sets of values
at the intermediate layer since, by definition, for a sample Q
of size m, at most Πu (m)
(m) × u Πu (m) labelings can be
distinct values are possible at each u. Thus, at most ΠCQ
generated by the neural network and ΠH (m) ≤ ΠC (m) u Πu (m).
(b) For any intermediate node u, Πu (m) = ΠC (m). Thus, for k̃ = k + 1, ΠH (m) ≤ ΠC (m)k̃ .
d dk̃
By Sauer’s lemma, ΠC (m) ≤ em d
, thus ΠH (m) ≤ em
d
. Let m = 2k̃d log2 (ek̃). In
476 Solutions Manual
em
view of the inequality given by the hint and ek̃ > 4, this implies m > dk̃ log2 d
, that
dk̃
is 2m > emd
. Thus, the VC-dimension of H is less than
2k̃d log2 (ek̃) = 2(k + 1)d log2 (e(k + 1)).
(c) For threshold functions, the VC-dimension of C is r, thus, the VC-dimension of H is upper
bounded by
2(k + 1)r log2 (e(k + 1)).
3.28 VC-dimension of convex combinations
Following the hint, we can think of this family of functions as a one hidden layer neural
network, where the hidden layer is represented by the functions ht ∈ H, and the top layer is
a threshold function characterized by (α1 , . . . , αT ). Denote this class of threshold functions
by ∆T . From the solution of exercise 3.27(a) we can bound the growth function of FT by:
ΠFT (m) ≤ Π∆T (m) (ΠH (m))T .
From the solution to exercise 3.27(c), the VC dimension of ∆T is at most T , and we may
further denote the VC dimension of H by d. Applying Sauer’s lemma to the growth function
yields: em T em d
Π∆T (m) ≤ , ΠH (m) ≤ .
T d
Thus, we have that em T em T d
ΠFT (m) ≤ .
T d
Finally, we may apply the hint in exercise 3.27(b) with m = max{4T log2 (2e), 2T d log2 (eT )}
to see that em T em T d
< 24T log2 (2e)+2T d log2 (eT ) ,
T d
so that the VC Dimension of FT is bounded by:
2T (2 log2 (2e) + d log2 (eT )).
Note that a coarser but relatively simpler bound would be to write:
em T em T d
< (em)T (d+1) ,
T d
and to apply the hint in exercise 3.27(b) with m = 2T (d + 1) log2 (eT (d + 1)). Notice that this
is actually asymptotically optimal in T and d up to log terms.
3.29 Infinite VC-dimension
d
(a) Theorem 3.20 shows that there exists a distribution that can force an error of Ω( m ).
Thus, for an infinite VC-dimension, this lower bound requires an infinite number of points
to achieve a bounded error and thus implies that PAC-learning is not possible.
(b) Here is a description of the algorithm. Let M be the maximum value observed after
drawing m points and let p be the probability that a point greater than M be drawn. The
probability that all points drawn be smaller than or equal to M is
(1 − p)m ≤ e−pm . (E.74)
Setting δ/2 to match the upper bound, yields δ/2 = e−pm , that is
1 2
p= log . (E.75)
m δ
To bound p by /2, we can impose the following
1 2
log ≤ . (E.76)
m δ 2
Thus, with confidence at least 1 − δ/2, the probability that a point greater than M be
drawn is at most /2 if L draws m ≥ 2 log 2δ points.
In the second stage, the problem is reduced to a finite VC-dimension M . Since PAC-
learning with (/2, δ/2) is possible for a finite dimension, this guarantees the (, δ)-PAC-
learning of the full algorithm.
Solutions Manual 477
bS 0 (h) > ∧ R
bT (h) >
h i
= P bS (h) = 0 ∧ R
R
T ∼D2m : 2 2
T →(S,S 0 )
bS 0 (h) > R
bT (h) >
h i
≤ P bS (h) = 0 ∧ R
R
T ∼D2m : 2 2
T →(S,S 0 )
m
≤ 2−l ≤ 2− 2 .
(e) Using the definition of the growth function, we can provide the following union bound
that is then in turn bounded using corollary 3.18:
bS 0 (h) > ≤ ΠH (2m)2− m
h i 2em d
P ∃h ∈ H : R
bS (h) = 0 ∧ R 2 ≤ 2−m/2 .
T ∼D 2m
: 2 d
T →(S,S 0 )
Combining part (a) through (e), we finally have,
2em d
P[R(h) > ] ≤ 2 2−m/2 . (E.77)
d
478 Solutions Manual
Setting the right-hand side equal to δ and solving for show that with probability at least
1−δ
1 2em 1 2
≤ d log + log + log 2 .
m d δ log 2
3.31 Covering number generalization bound.
(h1 (x) − y)2 − (h2 (x) − y)2 = (h1 (x) − h2 (x))(h1 + h2 − 2y)
= (h1 (x) − h2 (x)) (h1 − y) + (h2 − y) ≤ kh1 − h2 k∞ 2M ,
allows us to bound both the empirical and true error, resulting in a total bound of 4M kh1 −
h2 k∞ .
(b) This follows by splitting the event into the union of several smaller events and then using
the sum rule,
h i
P sup |LS (h)| ≥
S h∈H
k
h_ i Xk h i
=P sup |LS (h)| ≥ ≤ P sup |LS (h)| ≥ .
S S
i=1 h∈Bi i=1 h∈Bi
(c) For any i let hi be the center of ball Bi with radius . Note that for any h ∈ H we have
8M
|LS (h) − LS (hi )| ≤ 4M kh − hi k∞ ≤ /2. Thus, if for any h ∈ Bi we have |LS (h)| ≥ it
must be the case that |LS (hi ) ≥ 2 |, which shows the inequality.
To complete the bound, we use Hoeffding’s inequality applied to the random variables
(h(xi ) − yi )2 /m ≤ M 2 /m, which guarantees
h i −m2
P |LS (hi )| ≥ ≤ 2 exp .
S 2 2M 4
Chapter 5
Proceed as in the proof of theorem 5.9, but choose ρk = 1/γ k . For any ρ ∈ (0, 1), there exists
k ≥ 1 such
q that ρ ∈ (ρk , ρk−1 ], with ρ0 = 1. For that k, ρ ≤ ρk−1 = γρk , thus 1/ρk ≤ γ/ρ and
p
log k = log logγ (1/ρk ) ≤ log log2 (γ/ρ). Furthermore, for any h ∈ H, R bS,ρ (h) ≤ R
k
bS,ρ (h).
Thus,
s
2γ log logγ (γ/ρ)
P ∃k : R(h) − RS,ρ (h) >
b Rm (H) + + ≤ 2 exp(−2m2 ),
ρ m
which proves the statement.
5.3 Importance weighted SVM
The modified primal optimization problem can be written as
1 Pm
minimize 2
||w||2 +C i=1 ξi pi
subject to yi [w · xi + b] ≥ 1 − ξi .
The Lagrangian holding for all w, b, αi ≥ 0, βi ≥ 0 is then
m
1 X
L(w, b, α) = ||w||2 + C ξ i pi (E.78)
2 i=1
m
X m
X
− αi [yi (w · xi + b) − 1 + ξi ] − βi ξ i .
i=1 i=1
∂L
Then ∂w and ∂L
∂b
are the same as for the regular non-separable SVM optimization problem.
∂L
We also have ∂ξ = Cpi −αi −βi . Thus, to satisfy the KKT conditions we have for all i ∈ [m],
i
m
X
w = α i yi xi (E.79)
i=1
m
X
α i yi = 0 (E.80)
i=1
αi + βi = Cpi (E.81)
αi [yi (w · xi + b) − 1 + ξi ] = 0 (E.82)
βi ξi = 0. (E.83)
Plugging equation E.79 into equation E.78, we get
m m m
1 X X X
L = || αi yi xi ||2 + C ξi pi − αi αj yi yj (xi · xj ) (E.84)
2 i=1 i=1 i,j=1
m
X m
X X m
X
− αi yi b + αi − αi ξi − βi ξi .
i=1 i=1 i=1
Using equation E.81, we can simplify:
m m
X 1 X
L= || αi yi xi ||2 ,
αi −
i=1
2 i=1
meaning that the objective function is the same as in the regular SVM problem. The difference
is in the constraints on the optimization. Recall that our dual form holds for βi ≥ 0. Using
again equation E.81, our optimization problem is to maximize L subject to the constraints:
m
X
∀i ∈ [m], 0 ≤ αi ≤ Cpi ∧ αi yi = 0.
i=1
480 Solutions Manual
(a) Starting from equation (5.33) and removing all terms that are constant with respect to
α1 and α2 yields the desired result.
(b) Substituting into Ψ1 , we have:
1 1
Ψ2 = γ − sα2 + α2 − K11 (γ − sα2 )2 − K22 α22 − sK12 (γ − sα2 )α2
2 2
− y1 (γ − sα2 )v1 − y2 α2 v2 .
We take the derivative to find the equation for the stationary point as follows:
dΨ2
= −s+1+sK11 (γ −sα2 )−K22 α2 −sK12 (γ −sα2 )+s2 K12 α2 +y1 sv1 −y2 v2 = 0 .
dα2
Noting that s2 = 1 and rearranging terms yields the statement of interest.
(c) By definition of f (·) we see that v1 = f (x1 ) − y1 α∗1 K11 − y2 α∗2 K12 and similarly, v2 =
f (x2 ) − y1 α∗1 K12 − y2 α∗2 K22 . Using these equations along with the identities α∗1 = γ − sα∗2
and y1 = sy2 we have
v1 − v2 = f (x1 ) − f (x2 ) − y1 α∗1 (K11 − K12 ) − y2 α∗2 (K12 − K22 )
= f (x1 ) − f (x2 ) − y1 (γ − sα∗2 )(K11 − K12 ) − y2 α∗2 (K12 − K22 )
= f (x1 ) − f (x2 ) − sy2 (γ − sα∗2 )(K11 − K12 ) − y2 α∗2 (K12 − K22 )
= f (x1 ) − f (x2 ) − sy2 γ(K11 − K12 ) + y2 α∗2 (K11 − K12 ) − y2 α∗2 (K12 − K22 )
= f (x1 ) − f (x2 ) − sy2 γ(K11 − K12 ) + y2 α∗2 η .
(d) Combining the results from (b) and (c), we have
h i
ηα2 = s(K11 − K12 )γ + y2 f (x1 ) − f (x2 ) − sy2 γ(K11 − K12 ) + y2 α∗2 η − s + 1
h i
= y2 f (x1 ) − f (x2 ) + y2 α∗2 η − s + 1
h
= α∗2 η + y2 f (x1 ) − f (x2 ) − y1 + y2
h
= α∗2 η + y2 (y2 − f (x2 )) − (y1 − f (x1 )) .
cat \
satimage/satimage.scale.t \
satimage/satimage.scale.val > data/train
libsvm-2.88/svm-scale \
data/train > data/train.scaled
Solutions Manual 481
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-8 -6 -4 -2 0 2 4 6 -8 -6 -4 -2 0 2 4 6
log2(C) log2(C)
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-8 -6 -4 -2 0 2 4 6 -8 -6 -4 -2 0 2 4 6
log2(C) log2(C)
Figure E.8
Average cross-validation error plus or minus one standar deviation for different values of the
trade-off constant C and of the degree of the polinomial kernel
(c) Run 10-fold cross-validation, for different values of the degree d of the polynomial kernel
and of the trade-off constant C. We test d = 1, 2, 3, 4 and log2 (C) = −8, −7.5, · · · , 5.5, 6.
We step the value of the trade-off constant logarithmically as suggested by:
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
Which gives the cross-validation error plots shown in figure E.8.
The best values of trade-off constant C are:
The best C measured on the validation set is C ∗ = 2−4.5 , with degree d∗ = 4, which gives
an average error rate of 6.4% ± 1.5%.
(d) The trade-off constant is fixed to C ∗ = 2−4.5 , and 10-fold cross-validation is run for
degrees 1 through 4. In figure E.9 we plot the resulting cross-validation training and test
errors and the average number of support vectors (nSV is the number of support vectors,
nBSV is the number of bounded support vectors, i.e. whose dual variable is equal to the
trade-off constant).
(e) Support vectors always lie on the margin hyperplanes when their dual variable is smaller
than C. This happens for all the support vectors (SV) that are not bounded (BSV). Our
measurement gives the following averages:
d = 1 \colon nSV - nBSV = 8.5
482 Solutions Manual
0.12
900
700
error rate
0.08
600
0.06
500
0.04
400
0.02
300
0 200
0 1 2 3 4 5 0 1 2 3 4 5
polynomial kernel degree polynomial kernel degree
Figure E.9
Average cross-validation error rates and average number of support vectors (nSV) and of bounded
support vectors(nSV) as a function of the degree of the polynomial kernel.
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
log2(C) log2(C)
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
log2(C) log2(C)
Figure E.10
As in figure E.8, but false positives are penalized twice as much as false negatives.
Plots equivalent to the ones in figure E.8 are given in figure E.10, figure E.11, figure E.12,
and figure E.13. We obtain the best average accuracy for k = 4.
(a) Let
x0i = (y1 xi · x1 , . . . , ym xi · xm ) .
Then the optimization problem becomes
m
1 X
min ||α||2 + C ξi
α,b,ξ 2
i=1
subject to yi α · x0i + b ≥ 1 − ξ
ξi , αi ≥ 0, i ∈ [m] ,
which is the standard formulation of the primal SVM optimization problem on samples
x0i , modulo the non-negativity constraints on αi .
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-6 -4 -2 0 2 4 -6 -4 -2 0 2 4
log2(C) log2(C)
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-6 -4 -2 0 2 4 -6 -4 -2 0 2 4
log2(C) log2(C)
Figure E.11
As in figure E.8, but false positives are penalized four times as much as false negatives.
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-8 -6 -4 -2 0 2 4 -8 -6 -4 -2 0 2 4
log2(C) log2(C)
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-8 -6 -4 -2 0 2 4 -8 -6 -4 -2 0 2 4
log2(C) log2(C)
Figure E.12
As in figure E.8, but false positives are penalized eight times as much as false negatives.
m
X m
X
− βi ξi − γi αi
i=1 i=1
m m
1X 0 0
X
0 0 1
= − α yi x i · αj yj xj + γ + γ·α
2 i=1 i j=1
2
m
X
+ Cξi − α0i (yi b − 1 + ξi )
i=1
m m
X 1 X 0 0
α0i α α yi yj x0> x0j + γ
= −
i=1
2 i,j=1 i j i
m
X X
m
+ (C α0i )ξi −
− 0
α
i yi b.
i=1
i=1
486 Solutions Manual
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-8 -6 -4 -2 0 2 -8 -6 -4 -2 0 2
log2(C) log2(C)
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
-8 -6 -4 -2 0 2 -8 -6 -4 -2 0 2
log2(C) log2(C)
Figure E.13
As in figure E.8, but false positives are penalized sixteen times as much as false-negatives.
(a) By definition of {x1 , . . . , xd }, for all y = (y1 , . . . , yd ) ∈ {−1, +1}d , there exists w such
that,
∀i ∈ [d], 1 ≤ yi (w · xi ) .
Summing up these inequalities yields
d
d
d
X
X
X
d≤w· yi xi ≤ kwk
yi x i
≤ Λ
yi x i
.
i=1 i=1 i=1
(b) Since this inequality holds for all y ∈ {−1, +1}d , it also holds on expectation over
y1 , . . . , yd drawn i.i.d. according to a uniform distribution over {−1, +1}. In view of
the independence assumption, for i 6= j we have E[yi yj ] = E[yi ] E[yj ]. Thus, since the
Solutions Manual 487
Chapter 6
x · x0
K(x, x0 ) = cos ∠(x, x0 ) =
.
kxkkx0 k
Thus, the cosine kernel is just a scaling of the standard dot product, which is a PDS
kernel. Hence, the cosine kernel is also PDS.
(f) For all x, x0 ∈ R,
[sin(x0 − x)]2 = 1 − [cos(x0 − x)]2
= 1 − [cos x0 cos x + sin x0 sin x]2
= 1 − (u(x0 ) · u(x))2 ,
488 Solutions Manual
where u(x) = (cos x, sin x)> for all x ∈ X. Observe that ku(x)k = 1 for all x ∈ X. Thus,
λ 0 2
[sin(x0 − x)]2 = 21 ku(x0 ) − u(x)k2 and K(x, x0 ) = e− 2 ku(x )−u(x)k . Since the Gaussian
kernel is known to be PDS, K is also PDS (the fact that Gaussian kernels are PDS can be
shown easily by observing that they are the normalized kernels associated to the kernel
obtained by applying exp to standard inner products).
(g) It suffices to show that K is the normalized kernel associated to the kernel K 0 defined by
∀(x, y) ∈ RN × RN , K 0 (x, y) = eφ(x,y)
1
where φ(x, y) = σ
[kxk + kyk − kx − yk], and to show that K 0 is PDS. For the first part,
observe that
K 0 (x, y) 1 1 kx−yk
p = eφ(x,y)− 2 φ(x,x)− 2 φ(y,y) = e− σ .
0 0
K (x, x)K (y, y)
To show that K 0 is PDS, it suffices to show that φ is PDS, since composition with a power
series with non-negative coefficients (here exp) preserve the PDS property. Now, for any
c1 , . . . , cn ∈ R, let c0 = − n
P
i=1 ci , then, we can write
n n
X 1 X
ci cj φ(xi , xj ) = ci cj [kxi k + kxj k − kxi − xj k]
i,j=1
σ i,j=1
n n n
1h X X X i
= − c0 ci kxi k + − c0 cj kxj k − ci cj kxi − xj k
σ i=1 i=1 i,j=1
n
1 X
=− ci cj kxi − xj k,
σ i,j=0
with x0 = 0. Now, for any z ∈ R, the following equality holds:
1 − e−tz
Z +∞
1 1
z2 = 1 3 dt.
2Γ( 2 ) 0 t2
Thus,
n n 2
1 − e−tkxi −xj k
Z +∞
1 X 1 1 X
− ci cj kxi − xj k = − c i c j dt
σ i,j=0 2Γ( 21 ) 0 σ i,j=0 t2
3
Z +∞ Pn −tkxi −xj k2
1 1 i,j=0 ci cj e
= 1 3 dt.
2Γ( 2 ) 0 σ t2
Pn 2
Since a Gaussian kernel is PDS, the inequality i,j=0 ci cj e−tkxi −xj k ≥ 0 holds and the
Pn
right-hand side is non-negative. Thus, the inequality − σ1 i,j=0 ci cj kxi − xj k ≥ 0 holds,
which shows that φ is PDS.
Alternatively, one can also apply the theorem on page 43 of the lecture slides on kernel
methods to reduce the problem to showing that the norm G(x, y) = kx − yk is a NDS
function. This can be shown through a direct application of the definition of NDS together
with the representation of the norm given in the hint.
(h) Let h·, ·i denote the inner product over the set of all measurable functions on [0, 1]:
Z 1
hf, gi = f (t)g(t)dt.
0
Note that k1 (x, y) = 0 1t∈[0,x] 1t∈[0,y] dt = h1[0,x] , 1[0,y] i, and k2 (x, y) = 01 1t∈[x,1] 1t∈[y,1] dt =
R1 R
h1[x,1] , 1[y,1] i. Since they are defined as inner products, both k1 and k2 are PDS. Note
that k1 (x, y) = min(x, y) and k2 (x, y) = 1 − max(x, y). Observe that K = k1 k2 , thus K
is PDS as a product of two PDS kernels.
Solutions Manual 489
1/2
Now, let an = n
(−1)n . Then, for x, x0 ∈ X,
∞
X N
X n
f (x0 · x) = an xi x0i
n=0 i=1
∞
X X n
= an (xi1 x0i1 )s1 · · · (xiN x0iN )sN
n=0 s1 +···+sN =n
s 1 , . . . , sN
s + · · · + s
1 N
X
= as1 +···+sN (xi1 x0i1 )s1 · · · (xiN x0iN )sN ,
s ,...,s ≥0
s1 , . . . , sN
1 N
where the sums can be permuted since the series is absolutely summable. Thus, we can
write f (x0 · x) = Φ(x) · Φ(x0 ) with
s s + · · · + s
1 N s
Φ(x) = as1 +···+sN xsi11 · · · xi N .
s1 , . . . , sN N
s1 ,...,sN ≥0
Φ is a mapping to an infinite-dimensional space.
(j) For x ≥ 0, the integral 0+∞ e−sx e−s ds is well defined and
R
" #+∞
e−s(1+x)
Z +∞ Z +∞
−sx −s −s(1+x) 1
e e ds = e ds = − = . (E.88)
0 0 1+x 1+x
0
kx−yk2
−
Since the Gaussian kernel, (x, y) 7→ e is PDS for any σ 6= 0, the kernel (x, y) 7→
σ2
R +∞ − skx−yk2 −s
0 e σ2 e ds is also PDS since for any x1 , . . . , xm in RN and c1 , . . . , cm in R,
m −x k 2 m skxi −xj k2
skx i j
Z +∞ X
− −
X
ci cj e σ2 ≥0⇒ ci cj e σ2 e−s ds ≥ 0. (E.89)
i,j=1 0 i,j=1
skx−yk2
R +∞ − 1
By (E.88), 0 e σ2 e−s ds = kx−yk2
for all x, y in RN , which concludes the
1+
σ2
proof.
Thus, (x, y) 7→ min(|x|, |y|) is a PDS kernel over R × R since for any x1 , . . . , xm in RN
and c1 , . . . , cm in R,
m
X
ci cj min{|xi |, |xj |}
i,j=1
Z +∞ m
X N
X
= ci cj 1t∈[0,|xi |] 1t∈[0,|xj |] dt
0 i,j=1 k=1
c1 1
t∈[0,|x1 |]
2
+∞
Z
..
=
dt ≥ 0.
c 1 .
0
m t∈[0,|xm |]
PN N × RN as a sum of PDS
Thus, (x, y) 7→ i=1 min(|xi |, |yi |) xis a PDS kernel over R
kernels. Its composition with x 7→ σ with admits a power series with an infinite radius
2
of convergence and non-negative coefficients is thus also PDS, which concludes the proof.
6.3 Graph kernel For any i, j ∈ V, let E[i, j] denote the set of edges between i and j which is either
reduced to one edge or is empty. For convenience, let V = {1, . . . , n} and define the matrix
W = (Wij ) ∈ Rn×n by Wij = 0 if E[i, j] = ∅, Wij = w[e] if e ∈ E[i, j]. Then, we can write
X X
K(p, q) = w[e]w[e0 ] = Wpr Wrq = Wpq 2
.
e∈E[p,r],e0 ∈E[r,q] r
Let K = (Kpq ) denote the kernel matrix. Since W is symmetric For any vector X ∈ Rn ,
X> KX = X> W2 X = X> W> WX = kWXk2 ≥ 0. Thus, the eigenvalues of K are non-
negative. The same holds similarly for the kernel matrix restricted to any subset of V, thus
K is PDS.
6.4 Symmetric difference kernel
k : (A, B) 7→ |A∩B| is a PDS kernel over 2X since k(A, B) = x∈X 1A (x)1B (x) = Φ(A)·Φ(B),
P
where, for any subset A, Φ(A) ∈ {0, 1} |X| is the vector whose coordinate indexed by x ∈ X is
1 if x ∈ A, 0 otherwise. Since k is PDS, K 0 = exp(k) is also PDS (PDS property preserved
by composition with power series). Since K is the result of the normalization of K 0 , it is also
PDS.
6.5 Set kernel
Let Φ0 be a feature mapping associated to k0 , then K 0 (A, B) = x∈A,x0 ∈B φ0 (x) · φ0 (x0 ) =
P
P P
0
P
x∈A φ0 (x) · x0 ∈B φ0 (x ) = Ψ(A)·Ψ(B), where for any subset A, Ψ(A) = x∈A φ0 (x).
6.7 Consider the Gram matrix defined as Kij = K(xi , xj ). It is clear that K will have all zeros on
the diagonal. Hence tr(K) = 0. When K 6= 0, this means it must have at least one negative
eigenvalue. Hence K is not PDS.
6.8 The kernel K is NDS. In fact, more generally, kx − ykp for 0 < p ≤ 2 defines an NDS kernel.
This follows a general fact: if a kernel K is NDS, then K α is NDS for any 0 < α < 1.
Indeed, if K is NDS, then exp(−tK) is PDS for all t > 0. Then − exp(−tK) is NDS and so
is 1 − exp(−tK). Now, the following formula holds for all 0 < α < 1:
Z +∞
dt
K α = cα [1 − exp(−tK)] α+1 , (E.93)
0 t
where cα is a positive constant. Thus, K α is NDS for all 0 < α < 1.
6.9 Let n ≥ 1, {c1 , . . . , cn } ⊆ R and {x1 , . . . , xn } ⊆ H with n
P
i=1 ci = 0. Then,
n n n n n
* +
X X X X X
ci cj K(xi , xj ) = ( ci )2 − c i xi , cj xj = −k ci xi k ≤ 0. (E.94)
i,j=1 i=1 i=1 j=1 i=1
6.10 It is straightforward to see that the kernel K defined over R+ × R+ by K(x, y) = x + y
is negative definite symmetric (NDS) and that K(x, x) ≥ 0 for all x ≥ 0. Thus, for any
0 < α ≤ 1, K α is also NDS. Thus, exp(−tK α ) is PDS for all t > 0, in particular t = 1.
Assume now that Kp is PDS for some p > 1. Then, for all {c1 , . . . , cn } ⊆ R, {x1 , . . . , xn } ⊆ X,
and t > 0,
n n
X p X 1/p 1/p p
ci cj e−t(xi +xj ) = ci cj e−(t xi +t xj ) ≥ 0. (E.95)
i,j=1 i,j=1
Thus, Ψ(x, y) = (x + y)p is NDS. This contradicts
2
X
ci cj (xi + xj )p = 2p − 2 > 0 (E.96)
i,j=1
for x1 = 0, xj = 1, and c1 + c2 = 0 with c1 = 1.
6.11 Explicit mappings
(a) Because K is positive semidefinite, it can be diagonalized as K = SΛS> where Λ is
a diagonal matrix of K’s eigenvalues and S is the matrix of K’s eigenvectors. Further
decomposing, we get K = SΛ1/2 Λ1/2 S> . We then have
m
m 2
X X
αi αj Kij = αi Φ(xi ) ≥ 0
i,j=1 i=1
• Assume exp(−tK) is PDS. Then it is easy to see that − exp(−tK) is NDS. It is also
straightforward to show that shifting by a constant and scaling by a positive constant t > 0
1−exp(−tK)
preserves the NDS property, thus, t
is NDS. Finally, we examine the one-sided
limit of the expression, which is also NDS,
1 − exp(−tK) ∂ exp(−xK)
lim =− =K,
t→0+ t ∂x
x=0
showing that K is NDS.
• Assume K is NDS. Theorem 6.16 then implies for some PDS function K 0 and x0 :
−K(x, x0 ) = K 0 (x, x0 ) − K(x, x0 ) − K(x0 , x0 ) + K(x0 , x0 )
0 0
(x,x0 ) −K(x,x0 )
⇐⇒ e−K(x,x )
= eK e e−K(x0 , x0 ) eK(x0 ,x0 ) ,
| {z }
φ(x)φ(x0 )
where K 0 is PDS. Since K 0 is PDS exp(K 0 ) is also PDS, it is easy to show that any func-
tion of the form K(x, x0 ) = φ(x)φ(x0 ) is PDS and finally the constant function K(x, x0 ) =
exp(K(x0 , x0 )) is also PDS. Using the closure of PDS functions with respect to multiplica-
tion completes the proof.
6.18 Metrics and kernels
(a) If K is an NDS kernel, then by theorem 6.16 the kernel K 0 defined for any x0 ∈ X by:
1
K 0 (x, x0 ) = [K(x, x0 ) + K(x0 , x0 ) − K(x, x0 )]
2
is a PDS kernel (K(x0 , x0 ) = 0). Let H be the reproducing Hilbert space associated to K 0 .
There exists a mapping Φ(x) from X to H such that ∀x, x0 ∈ X, K 0 (x, x0 ) = Φ(x) · Φ(x0 ).
Then,
||Φ(x) − Φ(x0 )||2 = K 0 (x, x) + K 0 (x0 , x0 ) − 2K 0 (x, x0 )
1
= [2K(x, x0 ) − K(x, x)] +
2
1
[2K(x0 , x0 ) − K(x0 , x0 )] −
2
[K(x, x0 ) + K(x0 , x0 ) − K(x, x0 )]
= √K(x, x0 )
It is then straightforward to show that K is a metric.
(b) Suppose that K(x, x0 ) = exp(−|x − x0 |p ), x, x0 ∈ R, is positive definite for p > 2. Then,
for any t > 0, {x1 , . . . , xn } ⊆ X, {c1 , . . . , cn } ⊆ R,
Xn X n
ci cj exp(−t|xj − xk |p ) = ci cj exp(−|t1/p xj − t1/p xk |p ) ≥ 0
i,j=1 i,j=1
√
Thus, by theorem 6.17, K 0 (x, x0 ) = |x − x0 |p is an NDS kernel. But, K 0 is not a metric
for p > 2 since it does not verify the triangle inequality (take x = 1, x0 = 2, x00 = 3),
which contradicts part (a).
(c) If a < 0 or b < 0, a||x||2 + b < 0 for some non-null vectors x. For such values, K(x, x) =
tanh(a||x||2 + b) < 0. The kernel is thus not PDS and the SVM training may not converge
to an optimal value. The equivalent neural network may also converge to a local minimum.
t:e/1
t:e/1
g:e/1
g:e/1
c:e/1
c:e/1
a:e/1
a:e/1
X* - I: X* - I/r
0 1/1
Figure E.14
Weighted transducer T . e represents the empty string, and r = ρ. X ∗ − I stands for a finite
automaton accepting X ∗ − I.
(c) The set of strings Y over the alphabet X of length less than n form a regular language
since they can be described by:
n−1
[
Y = Xi. (E.105)
i=0
Thus, Y1 = Y ∩ (X ∗ − I) and Y2 = (X ∗ − I) − Y1 are also regular languages. It suffices to
replace in the transducer T of figure E.14 the transition labeled with X ∗ − I : X ∗ − I/ρ
with two transitions:
• Y1 : Y1 /ρ1 , and
• Y2 : Y2 /ρ2 ,
with the same origin and destination states and with Y1 and Y2 denoting finite automata
representing them. The kernel is thus still rational and PDS since it is of the form T 0 ◦T 0−1 .
(c) Note that the losses over D coincide with the losses in standard binary classification
problems in the case when all points in the source domain are labeled with the positive
class. Thus, we may directly apply theorem 5.9. We simply need to bound the empirical
Rademacher complexity of H:
h Xm i
mR(H)
b = E sup σi h(xi )
σ h∈H i=1
h m
X i
= E sup σi (r2 − kΦ(xi ) − ck2 )
σ r,c
i=1
h m
X i m
hX i
≤ E sup σi r2 + E kΦ(xi )k2
σ r σ
i=1 i=1
h m
X ih m
X i
+ E sup kck2 + E sup 2c · Φ(xi )) .
σ c σ c
i=1 i=1
We bound each of these four terms separately. First we have
hXm i hXm i Xm
kΦ(xi )k2 = E
E σi K(xi , xi ) = E σi K(xi , xi ) = 0 .
σ σ σ
i=1 i=1 i=1
Next we bound
h m
X i h m
X i
E sup σi kck2 = E C 2 σi 1Pm σi >0
σ c σ i=1
i=1 i=1
h Xm i
≤ C2 E | σi |
σ
i=1
v
u
u X m
≤ C 2 tE[| σi | 2 ] (Jensen’s ineq.)
σ
i=1
v
u X m
X
2t
σi2 ]
u
=C E[ σi σj +
σ
i6=j i=1
v
u m
u X √
= C 2 tE[ σi2 ] = C 2 m.
σ
i=1
496 Solutions Manual
(d) The solution follows very similar steps as in part (a), but using the same modifications as
seen in the analysis of soft-margin SVMs.
Chapter 7
(a) The direction eu taken by coordinate descent after T − 1 rounds is the argminu of:
m
dL(α + βeu ) X
= − yi hu (xi )Φ0 (−yi f (xi ))
dβ
β=0 i=1
m
X Φ0 (−yi f (xi ))
∝ − yi hu (xi ) Pm 0
i=1 i=1 Φ (−yi f (xi ))
m
X
∝ − yi hu (xi )DT −1 (i)
i=1
= −(1 − 2u ),
1 P Φ0 (−y f (x ))
i i
where DT −1 (i) = m m Φ0 (−y f (x )) . Thus, the base classifier hu selected at each round
i=1 i i
is the one with the minimal error rate over the training data.
(b) Only the logistic loss (Φ4 ) verifies the hypotheses (assuming log with base 2).
1 1−t
and αt = 2
log t
,
m
X Dt (i)e−αt yi ht (xi )
R
bD
t+1
(ht ) = 1yi ht (xi )<0
i=1
Zt
m
X Dt (i)eαt
=
Zt
yi ht (xi )<0
m
eαt X
= Dt (i)
Zt
yi ht (xi )<0
q
1−t
t 1
= p t = .
2 t (1 − t ) 2
This shows that ht cannot be selected at round t + 1.
7.6 It is not hard to see that the base hypotheses in this problem can be defined to be thresh-
old functions based on the first or second axis, or constant functions (horizontal or vertical
thresholds outside the convex hull of all the points).
The hypotheses selected by AdaBoost are therefore chosen from this set. It can be shown that
the hypotheses selected in two consecutive rounds of AdaBoost are distinct (see exercise 7.3).
Furthermore, ht and −ht cannot be selected in consecutive rounds, since misclassified and
correctly classified points by ht are assigned the same distribution mass. Thus, at each round
a distinct hypothesis is chosen. The points at coordinate (−1, −1) are misclassified by all
these base hypotheses.
The algorithm should be stopped when the best t found is 1/2. It can be shown, then, that
the error of the final classifier returned on the training set is 14 (1 − ), since it misclassifies
exactly the points at at coordinate (+1, −1).
(a) Let G1 (x) = ex and G2 (x) = x + 1. G1 and G2 are continuously differentiable over R and
G01 (0) = G02 (0). Thus, G is differentiable over R. Note that G0 ≥ 0.
Both G1 and G2 are convex, thus
G(y) − G(x) ≥ G0 (x)(y − x) (E.108)
for x, y ≤ 0 or x, y ≥ 0. Assume now that y ≤ 0 and x ≥ 0, then
G(y) − G(x) = ey − (x + 1) ≥ (y + 1) − (x + 1) = G0 (x)(y − x), (E.109)
since G0 (x) = 1. Thus G is convex.
498 Solutions Manual
(b) The direction eu taken by coordinate descent after T − 1 rounds is the argminu of:
m
dG(α + βeu ) X
= − yi hu (xi )G0 (−yi f (xi )) (E.110)
dβ
β=0 i=1
m
X G0 (−yi f (xi ))
(since G0 ≥ 0) ∝ − yi hu (xi ) Pm 0
(E.111)
i=1 i=1 G (−yi f (xi ))
m
X
∝ − yi hu (xi )DT −1 (i) (E.112)
i=1
= −(1 − 2u ), (E.113)
1 P G0 (−y f (x ))
i i
with DT −1 (i) = m m G0 (−y f (x )) . Thus, the base classifier hu selected at each round
i=1 i i
is the one with the minimal error rate over the training data.
The step size β is the solution of:
m
dF (α + βeu ) X
=− yi hu (xi )G0 (−yi f (xi ) − βyi hu (xi )) = 0, (E.114)
dβ i=1
which can be solved numerically. A closed-form solution can be given under certain con-
ditions, e.g., if
β ≤ ρ = min |f (xi )|. (E.115)
i∈[m]
(c) By definition of the objective function, this algorithm is less aggressively reducing the
empirical error rate than AdaBoost.
(b) As in the standard case, at round t, the probability mass assigned to correctly classified
points is p+ = (1 − t )e−α and the probability mass assigned to the misclassified points
is p− = t eα . Thus,
1 1 1
p− t 2
+γ 2
−γ 2
+γ
= ≤ = 1. (E.123)
p+ 1 − t 12 − γ 1
2
+γ 1
2
−γ
This contrasts with AdaBoost’s property.
Solutions Manual 499
(c)
1 1
Zt ≤ − γ)eα + ( + γ)e−α
( (E.124)
2 v 2 v
u1 u 1
1 u
2
+ γ 1 u
2
−γ
= ( − γ)t 1 + ( + γ)t 1
(E.125)
2 2
−γ 2 2
+γ
r
1 1
= 2 ( + γ)( − γ). (E.126)
2 2
Thus, the empirical error can be bounded as follows:
T
Y
R
bS (h) ≤ Zt (E.127)
t=1
r
1 1
≤ [2 (+ γ)( − γ)]T (E.128)
2 2
= (1 − 4γ 2 )T /2 (E.129)
−2γ 2 T
≤ e . (E.130)
bS (h) = 1 Pm 1y f (x )≤0 ≤ 1 , then clearly R
(d) If R bS (h) = 0. Using the bound obtained
m i=1 i i m
2 1
in the previous question, if e−2γ T < m , the empirical error is zero. This can be rewritten
as
log m
T > . (E.131)
2γ 2
(e) Using the bound for the consistent case,
m 2em d − m
P[R(h) > ] ≤ 2ΠC (2m)2− 2 ≤ 2(
) 2 2 . (E.132)
d
Setting the right-hand side to δ, with probability at least 1 − δ, the following bound holds
for that consistent hypothesis:
2 2em 2
errorD (H) ≤ d log2 + log2 , (E.133)
j m k d δ
log m
with d = 2(s + 1)T log2 (eT ) and T = 2γ 2 + 1.
q
The bound is vacuous for γ(m) = O( log m
m
). This could suggest overfitting.
(a) At t = 1 we have:
√ √ √ √ √ √
5−1 3 − 5 3 5 − 1 3 5 − 1 3 5 − 1 1 11 − 3 5
d>1 M= , 0, , , , , , ,
2 2 12 12 12 2 12
so we pick weak classifier 1. Now, the distribution at round two is:
√ √ √ √ √ >
1 1 5−1 5−1 5−1 3− 5 3− 5
d2 = , , , , , , ,0 ,
4 4 12 12 12 8 8
and the edges at round 2 are:
√ √ √ √ √ √ √
3− 5 5−1 4− 5 4− 5 4− 5 5−1 5+ 5
d>2 M = 0, , , , , , , ,
2 2 6 6 6 4 12
so we pick weak classifier 3. Continuing this process, we then pick weak classifier 2 in
round 3. However, now we observe that d4 = d1 ; hence, we have found a cycle, in which
we repeatedly select classifiers 1, 3, 2, 1, 3, 2, . . .
√
(b) rt = 5−1 2
, t ∈ 1, 2, 3. Thus, the coefficients used to combine classifiers in our example
are: [ 13 , 31 , 31 , 0, 0, 0, 0, 0] and the margin equals the minimum value in the following vector:
M[ 31 , 13 , 31 , 0, 0, 0, 0, 0]> , which is 13 .
500 Solutions Manual
1
(c) 16
M[2, 3, 4, 1, 2, 2, 1, 1]> = 83 for all training points. This margin is greater than the one
generated by AdaBoost. Therefore AdaBoost does NOT always maximize the L1 norm
margin.
7.10 Boosting in the presence of unknown labels
(a) Say a ‘boosting-style algorithm’ is just AdaBoost with a possibly different step size αt .
Recall
P these definitions from the description of AdaBoost: The Pfinal hypothesis is f (x) =
t αt ht (x) and the normalization constant in round t is Zt = i Dt (i) exp(−αt yi ht (xi )).
We proved in the chapter that
1 X 1 X Y
1yi f (xi )<0 ≤ exp(−yi f (xi )) = Zt
m i m i t
and that AdaBoost’s step size can be derived by minimizing this objective in each round
t. Taking that same approach, observe that
Dt (i) exp(−αt yi ht (xi )) = 0t + −
X
+
Zt = t exp(αt ) + t exp(−αt ).
i
Differentiating the right-hand side with
respect
to αt and setting equal to zero shows that
1 +
t
Zt is minimized by letting αt = 2 log − .
t
−
+
t −t
(b) One possible assumption is q ≥ γ > 0. Informally, this assumption says that the
1−0t
difference between the accuracy and error of each weak hypothesis is non-negligible relative
to the fraction of examples on which the hypothesis makes any prediction at all. In part
(d) we will prove that this assumption suffices to drive the training error to zero.
(c) 1. Given: Training examples ((x1 , y1 ), . . . , (xm , ym )).
2. Initialize D1 to the uniform distribution on training examples.
3. for t = 1, . . . , T :
a. ht ← base classifier
in H.
1 +
b. αt ← log t
.
2 −
t
Dt (i) exp(−αt yi ht (xi )) P
c. For each i = 1, . . . , m: Dt+1 (i) ← Zt
, where Zt ← i Dt (i) exp(−αt yi ht (xi ))
is the normalization constant.
4. f ← T
P
t=1 αt ht .
5. Return: sgn(f ).
(d) Plug in the qvalue of αt from part (a) into Zt = 0t + − +
t exp(αt ) + t exp(−αt ) to obtain
− +
Zt = 0t + 2 t t . Therefore
1 X Y q
0t + 2 −
Y
+
1yi f (xi )<0 ≤ Zt =
t t .
m i t t
Moreover, if the weak learning assumption from part (b) is satisfied then
q q
0t + 2 − + 0
t t = t + (1 − 0t )2 − (+ − 2
t − t )
s
(+ − −t )
2
= 0t + (1 − 0t ) 1 − t 0 2
(1 − t )
s
+ − 2
( − t )
≤ 1− t
1 − 0t
p
≤ 1 − γ2.
The first equality follows from (+ t + − 2 + − 2 + −
t ) − (t − t ) = 4t t (just multiply and gather
−
terms) and +t + t = 1 − 0 . The first inequality follows from the fact that square root is
t
Solutions Manual 501
√ √ p
concave on [0, ∞), and thus λ x + (1 − λ) y ≤ λx + (1 − λ)y for λ ∈ [0, 1]. The last
inequality follows from the weak learning assumption.
p T 2
Therefore we have m 1 P
i 1yi f (xi )<0 ≤ 1 − γ2 ≤ exp − γ 2T , where we used 1+x ≤
exp(x).
7.11 HingeBoost
(a) Since the hinge loss is convex, its composition with affine function of α is also convex and
F is convex as as sum of convex functions.
For the existence of one-sided directional derivatives, one can use the fact that any convex
function has one-sided directional derivatives or alternatively, that our specific function is
the sum of piecewise affine functions, which are also known to have one-sided directional
derivatives (think of one-dimensional hinge loss).
(b) Distinguishing different cases depending on the value of yi ft−1 (xi ) = 1, it is straightfor-
ward to derive the following expressions for all j ∈ [N ]:
m
X
0
F+ (αt−1 , ej ) = −yi hj (xi )[1yi ft−1 (xi )<1 + 1(yi hj (xi )<0)∧(yi ft−1 (xi )=1) ]
i=1
m
X
0
F− (αt−1 , ej ) = −yi hj (xi )[1yi ft−1 (xi )<1 + 1(yi hj (xi )>0)∧(yi ft−1 (xi )=1) ].
i=1
The key here is that when yi ft−1 (xi ) 6= 1, each term in the sum will be either 0 or the
affine function independent of yi hj (xi ). On the other hand, when yi ft−1 (xi ) = 1, the
sign of yi hj (xi ) determines whether the finite differences will extend into the 0 portion of
the affine portion of the term.
(c)
(b) Since exp is convex and that composition with an affine function (of α) preserves convexity,
each term of the objective is a convex function and Gρ is convex as a sum of convex
functions. The differentiability follows directly that of exp.
(d) For the coordinate descent algorithm to make progress at each round, the step size selected
along the descent direction must be non-negative, that is
(1 − ρ) (1 − t )
> 1 ⇔ (1 − ρ)(1 − t ) > ρt + t
(1 + ρ) t
1−ρ
⇔ t < .
2
Thus, the error of the base classifier chosen must be at least ρ/2 better than one half.
(e) The normalization factor Zt can be expressed in terms of t and ρ using its definition:
Xm
Zt = Dt (i) exp(−yi αt ht (xi ))
i=1
= e−αt (1 − t ) + eαt t
s s
1+ρ 1−ρ
= (1 − t )t + (1 − t )t
1−ρ 1+ρ
"s s #
p 1+ρ 1−ρ
= t (1 − t ) +
1−ρ 1+ρ
" #
p 2
= t (1 − t ) p
1 − ρ2
s
t (1 − t )
=2 .
1 − ρ2
504 Solutions Manual
i. In view of the definition of Dt and the bound derived in the first part of this exercise,
we can write
m T T
!
bS,ρ (f ) ≤ 1
X X X
R exp −yi αt ht (xi ) + ρ αt
m i=1 t=1 t=1
m T T
! !
1 X Y X
= m Zt Dt (i) exp ρ αt
m i=1 t=1 t=1
T
! T !
X Y
= exp ρ αt Zt .
t=1 t=1
ii. The expression of Zt was already given above. Plugging in that expression in the
bound of the previous question and using the expression of αt gives
T p
!
Y Y 1 1
bS,ρ (f ) ≤ (
R eαt )ρ t (1 − t )(u 2 + u− 2 )
t t=1
s !ρT s !ρ T p
!
1−ρ Y 1 − t Y 1 −1
= t (1 − t )(u + u
2 2 )
1+ρ t
t t=1
T q
1+ρ 1−ρ T Y
= u 2 + u− 2 1−ρ
t (1 − t )1+ρ .
t=1
Solutions Manual 505
iii. Using the hint and the result of the previous question, we can write
2
1−ρ
Y 2
− t
bS,ρ (f ) ≤
R 1 − 2
1 − ρ2
t
2
1−ρ
Y 2
− t
≤ exp −2
1 − ρ2
t
2
1−ρ
2
− t
= exp −2 T
1 − ρ2
2γ 2 T
≤ exp − 2
.
1−ρ
Thus, if the upper bound is less that 1/m, then R
bρ (f ) = 0 and every training point
2
has margin at least ρ. The inequality exp − 2γ 1−ρ2
T
< 1/m is equivalent to T >
(log m)(1−ρ2 )
2γ 2
.
Chapter 8
Let w be the weight vector. Since each update is of the form w ← w + yi xi and since the
components of the sample points are integers, the components of w are also integers.
Let n1 , . . . , nN ∈ Z denote the components of w. w correctly classifies all points iff yi (w ·xi ) >
0 for i = 1, . . . , m, that is,
n1 > 0
n1 > 0
n1 − n2 < 0
n2 > n1
−n1 − n2 + n3 > 0 ⇔ n3 > n1 + n2
. . . ...
(−1)N (n1 + n2 + . . . + nN −1 − nN ) < 0
nN > n1 + n2 + . . . + nN −1 .
These last inequalities show that the data is linearly separable with w = (1, 2, . . . , 2N −1 ).
They also imply that n1 ≥ 1, n2 ≥ 2, n3 ≥ 4, . . . , nN ≥ 2N −1 . Since each update can at most
increment nN by 1, the number of updates is at least 2N −1 = Ω(2N ).
The bound is unaffected, as shown by the following, using the same definitions and steps as
in this chapter:
P
v · t∈I yt xt
Mρ ≤
kvk
P
v · t∈I (wt+1 − wt )/η
= (definition of updates)
kvk
v · wT +1
=
ηkvk
≤ kwT +1 k/η (Cauchy-Schwarz ineq.)
= kwtm + ηytm xtm k/η (tm largest t in I)
h i1/2
2 2 2
= kwtm k + η kxtm k + ηytm wtm · xtm ] /η
h i1/2 | {z }
2 2 2 ≤0
≤ kwtm k + η R ] /η
h i1/2 √
≤ M η 2 R2 ] /η = M R. (applying the same to previous ts in I).
y (v·x )
(a) By assumption, there exists v ∈ RN such that for all t ∈ [T ], ρ ≤ t kvk t , where ρ is the
maximum margin achievable on S. Summing up these inequalities gives
P
v · t∈I yt xt
X
Mρ ≤ ≤
yt x t
(Cauchy-Schwarz inequality)
kvk t∈I
X
=
(wt+1 − wt )
(definition of updates)
t∈I
= kwT +1 k (telescoping sum, w0 = 0).
Solutions Manual 507
QT
Thus, WT +1 = N t=1 (1 − (1 − β)Lt ) and
T
X
log WT +1 = log N + log(1 − (1 − β)Lt )
t=1
T
X
≤ log N + −(1 − β)Lt
t=1
= log N − (1 − β)LT .
(b) For all i ∈ [N ],
log WT +1 ≥ log wT +1,i
T
Y
= log (1 − (1 − β)lt,i
t=1
T
X
= log(1 − (1 − β)lt,i
t=1
T
X
≥ −(1 − β)lt,i − (1 − β)2 lt,i
2
t=1
T
X
= −(1 − β)LT,i − (1 − β)2 2
lt,i .
t=1
(c) Comparing the lower and upper bounds gives:
T
X
− (1 − β)LT,i − (1 − β)2 2
lt,i ≤ log N − (1 − β)LT
t=1
T
log N X
2
=⇒ LT ≤ LT,i + + (1 − β) lt,i .
(1 − β) t=1
PT 2
Clearly, for any i ∈ [N ], t=1 lt,i ≤ T . Thus, for all i ∈ [N ],
log N
LT ≤ LT,i + + (1 − β)T,
(1 − β)
log N
in particular, LT ≤ Lmin
T + (1−β) + (1 − β)T . Differentiating with respect to β and setting
log N
the result to zero gives (1−β) 2 − T = 0, as in the case of the RWM algorithm. Thus, for
p √ √
β = max{1/2, 1 − (log N )/T }, LT ≤ Lmin T + 2 T log N , that is RT ≤ 2 T log N .
(b) The bound follows directly using the same steps as in the poriginal proof, but with the
general inequality. The optimal choice of β is max{ α2
, 1 − α(log N )/T }, which gives
r
log(N )T p
RT ≤ + α log(N )T .
α
(c) Setting α close to 2 forces β close to 1, which results in an algorithm that downweights
experts in a very conservative fashion. From the bound in part (b) we see that α = 1, as
is used in the chapter, is the optimal choice.
(a) Let u ∈ H with kuk = 1. Observe that for any t ∈ [T ], we can write
yt (u · Φ(xt )) yt (u · Φ(xt ))
1− 1− ≤ .
ρ + ρ
Let I be the set of t ∈ [T ] at which kernel Perceptron makes an update, and let MT be the
total number of updates made, then, summing up the previous inequalities over all such
ts and using the Cauchy-Schwarz inequality yields
P
X k t∈I yt Φ(xt )k
X
yt (u · Φ(xt )) yt (u · Φ(xt ))
MT − 1− ≤ ≤ .
t∈I
ρ t∈I
ρ ρ
510 Solutions Manual
Chapter 9
9.5 Decision trees. A binary decision tree with n nodes has exactly n + 1 leaves. Each node can be
labeled with an integer from {1, . . . , N } indicating which dimension is queried to make a binary
split and each leaf can be labeled with ±1 to indicate the classification made at that leaf. Fix
an ordering of the nodes and leaves and consider all possible labelings of this sequence. There
can be no more than (N + 2)2n+1 distinct binary trees and, thus, the VC-dimension of this
finite set of hypotheses can be no larger than (2n + 1) log(N + 2) = O(n log N ).
Chapter 11
(a) Using the closed form solutions α = (K+λI)−1 y and the fact M0−1 −M−1 = −M0−1 (M0 −
M)M−1 (this can be verified by simply expanding the right-hand side), we have
α0 − α = (K0 + λI)−1 − (K + λI)−1 y
(b) Using the fact that for any vector v and matrix A, kAvk2 = v> A> Av ≤ kvk2 kA> Ak =
kvk2 kAk2 we have
kα0 − αk ≤ k(K0 + λI)−1 kkK0 − Kkk(K + λI)−1 kkyk .
√
Since |y| ≤ M , we have kyk ≤ mM and we can use the observation k(K + λI)−1 k =
λmax ((K + λI)−1 ) = λmin (K + λI)−1 ≤ 1/λ, where λmax (A) and λmin (A) are the maxi-
mum and minimum eigenvalues of A, respectively.
(a) Using the closed-form solution for the inner maximization problem α = (K + λI)−1 y,
simplifies the joint optimization to a simpler minimization:
min y> (K + λI)−1 y , s.t. kKk2 ≤ 1 .
K0
Note that for any invertible matrix A, y> A−1 y ≥ kyk2 λmin (A−1 ) = kyk2 λmax (A)−1 .
kyk2
Thus, it is easy to see that minK0 y> (K + λI)−1 y ≥ 1+λ since kKk2 = λmax (K) ≤ 1.
1 > achieves this lower bound. First, note that ( 1 yy> +
We now show K = kyk 2 yy kyk2
λI)y = (1 + λ)y, so y is an eigenvector of the matrix with eigenvalue (1 + λ). Since the
1 > + λI)−1
matrix is invertible, it can be shown that y is also an eigenvector of ( kyk 2 yy
1
with eigenvalue 1+λ
(for example, consider the eigen decomposition of the matrix).
(b) The kernel matrix alone is not useful for classifying future unseen points x, which requires
computing m
P
i=1 K(xi , x) and needs access to an underlying kernel function that in con-
sistent with the kernel matrix. Finding such a kernel function may be difficult in general,
and furthermore the choice of function may not be unique.
(a) Note that the hypothesis hSi will make zero error on the ith point of Si0 and is defined as
the minimizer with respect to the remainder of the points. Thus, hSi is also the minimizer
with respect to the set Si0 .
(b) Using part (a) and the definition of the KRR hypothesis with respect to the dual variables
we have hSi (xi ) = hS 0 (xi ) = αS 0 Ki , where αS 0 is the optimal set of dual variable for
i i i
KRR trained with Si0 . Noting that the closed-form solution is α = (K + λI)−1 yi proves
the equality.
(c) Using part (b) we can write
hSi (xi ) − yi = yi> (K + λI)−1 Kei − yi
= (y − yi ei + hSi (xi )ei )> (K + λI)−1 Kei − yi
= hS (xi ) − yi + (hSi (xi ) − yi )e>i (K + λI)
−1
Kei ,
which implies hSi (xi ) − yi = (hS (xi ) − yi )/(e>
i (K + λI) −1 Ke ). Thus, we can write
i
m
1 X hS (xi ) − yi 2
R
bLOO (A) =
> −1
.
m i=1 ei (K + λI) Kei
√
(d) In this case the two losses differ only by the factor γ12 . Thus, if γ = m, the two
performance measures coincide.
Solutions Manual 513
Chapter 14
√
Using the bound on K, we have kkx k ≤ mr, and using the result from exercise 10.3 completes
the bound.
14.5 Stability of relative-entropy regularization
R R
(a) This result follows from the property | g(x) dx| ≤ |g(x)| dx for integrable function g
and the bound on L:
Z
|H(g, z) − H(g 0 , z)| ≤ L(hθ (x), y) g(θ) − g 0 (θ) dθ
ΘZ Z
≤ L(hθ (x), y) g(θ) − g 0 (θ) dθ ≤ M |g(θ) − g 0 (θ)| dθ .
Θ Θ
(b) i. We can write
Z 2 1
Z 2 1
Z 2
|g(θ) − g 0 (θ)| dθ = |g(θ) − g 0 (θ)| dθ + |g 0 (θ) − g(θ)| dθ
Θ 2 Θ 2 Θ
≤ K(g, g 0 ) + K(g 0 , g) .
Thus, it suffices to show K(g, g 0 ) = BK(·,f0 ) (gkg 0 ) + Θ g(θ) − g 0 (θ) dθ in order to show
R
g 0 (θ)
the result. Note that [∇g0 K(g 0 , f0 )]θ = log f0 (θ)
+ 1, thus
BK(·,f0 ) (gkg 0 ) = K(g, f0 ) − K(g 0 , f0 ) − g − g 0 , ∇g0 K(g 0 , f0 )
g 0 (θ)
Z
= K(g, f0 ) − K(g 0 , f0 ) − g(θ) − g 0 (θ) log
+ 1 dθ
Θ f 0 (θ)
g 0 (θ)
Z
g(θ) 0
= g(θ) log − − g(θ) + g (θ) dθ
Θ f0 (θ) f0 (θ)
Z
g(θ)
= g(θ) log 0 − g(θ) + g 0 (θ) dθ
Θ g (θ)
Z
= K(g, g 0 ) + g 0 (θ) − g(θ) dθ .
Θ
ii. To show the first inequality, note that BFS = BR b S + λBK(·,f0 ) , and, since Bregman
divergences are positive, it implies BK(·,f0 ) ≤ BFS . Then, using the definition of g
and g 0 as minimizers,
1
BK(.,f0 ) (gkg 0 ) + BK(.,f0 ) (g 0 kg) ≤ BFS 0 (gkg 0 ) + BFS (g 0 kg)
λ
1b 0 0
= R S 0 (g) − RS 0 (g ) + RS (g ) − RS (g)
b b b
λ
1
H(g 0 , zm ) − H(g, zm ) + H(g, zm 0
) − H(g 0 , zm
0
= ) .
mλ
Where the last equality follows from the fact that only the mth point differs in S and
S 0 . Finally, using the M -Lipschitz property from part (a) completes the question.
iii. Combining part (i) and (ii), we have
Z 2 2M
Z
|g(θ) − g 0 (θ)| dθ ≤ |g(θ) − g 0 (θ)| dθ .
Θ mλ Θ
Z
2M
⇐⇒ |g(θ) − g 0 (θ)| dθ ≤
Θ mλ
Thus, using the M -Lipschitz property, we have
2M 2
|H(g, z) − H(g 0 , z)| ≤ .
mλ
Chapter 15
(a)
m m
1 X > 1 X 2
variance = (xi u − x̄> u)2 = (xi − x̄)> u
m i=1 m i=1
m
1 X >
(xi − x̄)> u (xi − x̄)> u)
=
m i=1
m
1 X >
= u (xi − x̄)(xi − x̄)> u = u> Cu .
m i=1
(b) We want to maximize u> Cu subject to u> u = 1. We can directly apply the Rayleigh
quotient (section A.2.3) to show that u∗ is the largest eigenvector (or equivalently the
largest singular vector) of C, which proves that PCA with k = 1 projects the data onto
the direction of maximal variance. Alternatively, we can prove this by introducing a
Lagrange multiplier α to enforce the constraint u> u = 1. To maximize the Lagrangian
u> Cu + α(1 − u> u), we set the derivative w.r.t. u to 0, which gives us Cu = αu.
This equality implies that u∗ is an eigenvector of C. Finally, we observe that among all
eigenvectors of C, the eigenvector associated with the largest eigenvalue of C maximizes
u> Cu.
(a) Observe that kxi − xj k2 = (xi − xj )> (xi − xj ) = x> > >
i xi + xj xj − 2xi xj and rearrange
terms.
1
(b) Noting that X∗ = X − m
X11> and plugging into the equation K∗ = X∗> X∗ yields the
result.
(c) Note that the scalar form of the equation in (b) is
m m
1 X 1 X 1 XX
K∗ij = Kij − Kik − Kkj + 2 Kk,l .
m k=1 m k=1 m k l
Substituting with the equation D2ij = Kii + Kjj − 2Kij from (a) and simplifying yields
the result.
(d) We first observe that − 12 HDH = − 12 (D − m 1 1
D11> − m 11> D + m12 11> D11> ). By
inspection, the matrix expression on the RHS corresponds to the scalar expression with
four terms on the RHS of the equation in (c).
2 2
X X
argmin Wij kyi0 − yj0 k22 = argmin Wij (y 0 i + y 0 j − 2yi0 yj0 )
y0 i,j y0 i,j
2
X X
= argmin 2 Dii y 0 i − 2 Wij yi0 yj0
y0 i i,j
> >
= argmin y0 Dy0 − y0 Wy0
y0
>
= argmin y0 Ly0 .
y0
(a) For the first part of question, note that W is SPSD if x> Wx ≥ 0 for all x ∈ Rl . This
condition is equivalent to y> Ky ≥ 0 for all y ∈ Rm where yi = 0 for l + 1 ≤ i ≤ m. Since
K is SPSD by assumption, this latter condition holds. For the second part, we write K e
516 Solutions Manual
in block form as
"# " #
W h i W K>
K
e = W† W K> 21 = 21 .
K21 K21 K21 W† K> 21
Comparison with the block form of K then immediately yields the desired result.
(b) Observe that C = X> X0 and W = X0> X0 . Thus,
e = CW† C> = X> X0 X0> X0 † X0 > X = X> UX 0 U> 0 X = X> PU X.
K X X0
Chapter C
C.1 For any δ > 0, let t = f −1 (δ). Plugging this in P[X > t] ≤ f (t) yields P[X > f −1 (δ)] ≤ δ,
that is P[X ≤ f −1 (δ)] ≥ 1 − δ.
C.2 By definition of expectation and using the hint, we can write
X X
E[X] = n P[X = n] = n(P[X ≥ n] − P[X ≥ n + 1]).
n≥0 n≥1
P in this sum, for n ≥ 1, P[X ≥ n] is added n times and subtracted n − 1 times, thus
Note that
E[X] = n≥1 P[X ≥ n].
Solutions Manual 517
8 8
6 6
score10
score50
4 4
2 2
0 0
0 100 200 300 400 500 0 100 200 300 400 500
k k
(a) (b)
Figure E.15
score10 and score50 as functions of k for random projection nearest neighbors.
8 8
6 6
score10
score50
4 4
2 2
0 0
0 100 200 300 400 500 0 100 200 300 400 500
k k
(a) (b)
Figure E.16
score10 and score50 as functions of k for PCA nearest neighbors.
More generally, by definition of the Lebesgue integral, for any non-negative random variable
X, the following identity holds:
Z +∞
E[X] = P[X ≥ t] dt.
0
Chapter D
D.2 Estimating label bias. Let pb+ be the fraction of positively labeled points in S = (x1 , . . . , xm ):
m
1 X
pb+ = 1f (xi )=+1
m i=1
518 Solutions Manual
h i h i
(c) If m is odd, since P N (S) ≥ m = xA ≥ P N (S) ≥ m+1 = xA , we can use the
2 x 2 x
lower bound
1 m + 1
error(fo ) ≥ P N (S) ≥ x = xA .
2 2
Thus, in both cases we can use the lower bound expression with dm/2e instead of m/2.
2dm/2e2 i 1 i
−
h h
1 1−2 2
(d) If error(fo ) is at most δ, then 4
1− 1−e < δ, which gives
2dm/2e2
−
e 1−2 < 1 − (1 − 4δ)2 = 4δ(2 − 4δ) = 8δ(1 − 2δ),
and
1 − 2
1
m>2 log .
22 8δ(1 − 2δ)
The lower bound varies as 12 .
Solutions Manual 519
(e) Let f be an arbitrary rule and denote by FA the set of samples for which f (S) = xA and
by FB the complement. Then, by definition of the error,
X X
error(f ) = P[S ∧ xB ] + P[S ∧ xA ]
S∈FA S∈FB
1 X 1 X
= P[S|xB ] + P[S|xA ]
2 S∈F 2 S∈F
A B
1 X 1 X
= P[S|xB ] + P[S|xB ]+
2 S∈FA
2 S∈FA
N (S)<m/2 N (S)≥m/2
1 X 1 X
P[S|xA ] + P[S|xA ].
2 S∈FB
2 S∈FB
N (S)<m/2 N (S)≥m/2
Now, if N (S) ≥ m/2, clearly P[S|xB ] ≥ P[S|xA ]. Similarly, if N (S) < m/2, clearly
P[S|xA ] ≥ P[S|xB ]. In view of these inequalities, error(f ) can be lower bounded as
follows
1 X 1 X
error(f ) ≥ P[S|xB ] + P[S|xA ]+
2 S∈F 2 S∈F
A A
N (S)<m/2 N (S)≥m/2
1 X 1 X
P[S|xB ] + P[S|xA ]
2 S∈FB
2 S∈FB
N (S)<m/2 N (S)≥m/2
1 X 1 X
= P[S|xB ] + P[S|xA ]
2 2
S : N (S)<m/2 S : N (S)≥m/2
= error(fo ).
Oskar’s rule is known as the maximum likelihood solution.