Foundations Computational Mathematics: Online Learning Algorithms
Foundations Computational Mathematics: Online Learning Algorithms
Foundations Computational Mathematics: Online Learning Algorithms
DOI: 10.1007/s10208-004-0160-z
Found. Comput. Math. 145170 (2006)
The Journal of the Society for the Foundations of Computational Mathematics
FOUNDATIONS OF
COMPUTATIONAL
MATHEMATICS
Online Learning Algorithms
Steve Smale
1
and Yuan Yao
2
1
Toyota Technological Institute at Chicago
1427 East 60th Street
Chicago, IL 60637, USA
[email protected]
2
Department of Mathematics
University of California at Berkeley
Berkeley, CA 94720, USA
Current address
Toyota Technological Institute at Chicago
1427 East 60th Street
Chicago, IL 60637, USA
[email protected]
Abstract. In this paper, we study an online learning algorithm in Reproducing
Kernel Hilbert Spaces (RKHSs) andgeneral Hilbert spaces. We present a general form
of the stochastic gradient method to minimize a quadratic potential function by an
independent identically distributed (i.i.d.) sample sequence, and showa probabilistic
upper bound for its convergence.
1. Introduction
Consider learning from examples (x
t
, y
t
) X R (t N), drawn at random
from a probability measure on X R. For > 0, one wants to approximate the
Date received: October 26, 2004. Final version received: April 25, 2005. Date accepted: April 28, 2005.
Communicated by Filipe Cucker. Online publication: September 23, 2005.
AMS classication: 62L20, 68Q32, 68T05.
Key words and phrases: Online learning, Stochastic approximation, Regularization, Reproducing Ker-
nel Hilbert Spaces.
146 S. Smale and Y. Yao
function f
: X Y by
f
(x) =
_
Y
y d
Y[x
,
the regression function of . In other words, for each x X, f
X
(X) be the Hilbert space of square integrable
functions with respect to
X
, and denoted by L
2
(X) and | |
= ess sup
X
[ f (x)[). We assume that | f
< and
Online Learning Algorithms 147
f
L
2
n
i =m
x
i
= 0 and
n
i =m
x
i
= 1, for any summand x
i
.
2. An Online Learning Algorithm in RKHS
Let K: X X Rbe a Mercer kernel, i.e., a continuous symmetric real function
which is positive semidenite in the sense that
l
i, j =1
c
i
c
j
K(x
i
, x
j
) 0 for any
l Nand any choice of x
i
X and c
i
R(i = 1, . . . , l). Note K(x, x) 0 for all
x. In the following || and , ) denote the Euclidean normand the Euclidean inner
product in R
n
, resp ectively. We give two typical examples of Mercer kernels. The
rst is the Gaussian kernel K: R
n
R
n
R dened by K(x, x
/
) = exp(|x
x
/
|
2
/c
2
) (c > 0). The second is the linear kernel K: R
n
R
n
R dened by
K(x, x
/
) = x, x
/
) 1. The restriction of these functions on X X will induce
the corresponding kernels on subsets of R
n
.
Let H
K
be the Reproducing Kernel Hilbert Space (RKHS) associated with a
Mercer kernel K. Recall the denition as follows. Consider the vector space V
K
generated by {K
t
: t X], i.e., all the nite linear combinations of K
t
, where,
for each t X, the function K
t
: X R is dened by K
t
(x) = K(x, t ). An
inner product , )
K
on this vector space can be dened as the unique linear
extension of K
x
, K
x
/ )
K
:= K(x, x
/
) and its induced norm
1
is | f |
K
=
f, f )
K
for each f V
K
. Let H
K
be the completion of this inner product space V
K
/V
0
.
It follows that for any f H
K
, f (x) = f, K
x
)
K
(x X). This is often called
the reproducing property in literature. Dene a linear map L
K
: L
2
(X) H
K
by L
K
( f )(x) =
_
X
K(x, t ) f (t ) d
X
. The operator L
K
I : H
K
H
K
is an
isomorphism if > 0 (endomorphism if 0), where L
K
: H
K
H
K
is the
restriction of L
K
: L
2
(X) H
K
.
Given a sequence of examples z
t
= (x
t
, y
t
) X Y (t N), our online
learning algorithm in RKHS is
f
t 1
= f
t
t
(( f
t
(x
t
) y
t
)K
x
t
f
t
) for some f
1
H
K
, e.g., f
1
= 0,
(1)
where:
(1) for each t N, (x
t
, y
t
) is drawn identically and independently according
to ;
(2) the regularization parameter 0; and
(3) the step size
t
> 0.
1
Note that the zero set V
0
= { f V
K
: | f |
K
= 0] = {0], whence | |
K
is in fact a norm. To see
this, by the reproducing property and CauchySchwartz inequality,
| f |
K
= 0 [ f (t )[ = [ f, K
t
)
K
[ | f |
K
|K
t
| = 0, t X f = 0,
which implies V
0
= {0].
148 S. Smale and Y. Yao
Note that for each f , the map X Y R given by (x, y) . f (x) y is a
real-valued randomvariable and K
x
: X H
K
is a H
K
-valued randomvariable.
Thus f
t 1
is a randomvariable with values in H
K
depending on (z
i
)
t
i =1
. Moreover,
we see that f
t 1
span{ f
1
, K
x
i
: 1 i t ], a nite-dimensional subspace of
H
K
. The derivation of (1) is given in the next section from a stochastic gradient
algorithm in general Hilbert spaces.
In the sequel we assume that
C
K
:= sup
xX
_
K(x, x) < . (2)
For example, the following typical kernels have C
K
= 1:
(1) Gaussian kernel: K: R
n
R
n
R such that K(x, x
/
) = e
|xx
/
|
2
/c
2
.
(2) Homogeneous polynomial kernel: K: R
n
R
n
Rsuch that K(x, x
/
) =
x, x
/
)
d
. Bythe scalingproperty, we canrestrict K tothe sphere S
n1
S
n1
.
(3) Translation invariant kernels: Any K: X X R such that K(x, x
/
) =
K(x x
/
) and K(0) = 1.
In the sequel, we decompose | f
t
f
= (L
K
I )
1
L
K
f
, (3)
where f
L
2
satises
E[( f
(x) y)K
x
f
] = 0. (5)
To see this, it is enough to notice that, by L
K
( f )(x) =
_
X
K(x, t ) f (t ) d
X
, we
have
L
K
( f
) = E
x
[ f
(x)K
x
],
and
L
K
( f
) = E
x
[[E
y[x
y]K
x
],
Online Learning Algorithms 149
whence equation (5) turns out to be L
K
( f
) f
= L
K
( f
in (3).
Notice that the map (x, y) . ( f
(x) y)K
x
f
is a H
K
-valued random
variable, with zero mean. Thus the following variance
2
= E
z
[|( f
(x) y)K
x
f
|
2
K
], (6)
characterizes the uctuation about the equilibrium caused by the random sample
z = (x, y). If
2
= 0, we have the deterministic gradient method (see Section 3).
If M
, M
], then Proposition
3.4 in the next section implies
_
2C
K
M
( C
2
K
)
_
2
.
The main purpose in this paper is to obtain a probabilistic upper bound for
| f
t
f
.
By the triangle inequality we may write
| f
t
f
| f
t
f
| f
. (7)
The second part of the right-hand side in (7), | f
|
K
, whence via | f
t
f
C
K
| f
t
f
|
K
we obtain an upper bound
on the sample error. Before the statement of the theorem, we dene
=
C
2
K
, (8)
whose meaning, as the inverse condition number, will be discussed in the next
section.
Theorem A. Let
t
= 1/( C
2
K
)t
(t N) for some
_
1
2
, 1
_
. Then, for
each t N, we may write
| f
t
f
|
K
E
init
(t ) E
samp
(t ), (9)
where
E
init
(t ) e
[/(1)](1t
1
)
| f
1
f
|
K
;
and with probability at least 1 ( (0, 1)) in the space Z
t 1
,
E
2
samp
(t )
C
2
( C
2
K
)
2
_
1
_
/1
_
1
t
_
.
150 S. Smale and Y. Yao
Here
2
is the variance in (6) and the positive constant C
satises
C
= 8
2
2 1
_
e(2 2
)
_
/1
.
The proof will be deferred to later sections.
Remark 2.1. Assume 1 and consider the upper bound
2
(2C
K
M
( C
2
K
)/)
2
.
Then the following holds with probability at least 1 ( (0, 1)),
| f
t
f
|
K
e
C
1
(1t
1
)
| f
1
f
|
K
C
2
_
1
_
(2)/2(1)
_
1
t
_
/2
, (10)
where
C
1
=
1
(1 )(1 C
2
K
)
and C
2
= 2C
K
M
_
C
_
1 C
2
K
_
/2(1)
.
Remark 2.2. In decomposition (9) in Theorem A, E
init
(t ) has a deterministic
bound and characterizes the accumulated effect from the initial choice, which is
called the initial error. E
samp
(t ) depends on the random sample and thus has a
probabilistic bound, which is called the sample error. We can also give upper
bounds on the approximation error, | f
.
The approximation error can be bounded if we put some regularity assump-
tions on the regression function f
L
2
r
|L
r
K
f
.
(2) Suppose L
r
K
f
L
2
|
K
r1/2
|L
r
K
f
.
Notice that since L
1/2
K
is an isomorphism, H
K
L
2
H
K
.
Online Learning Algorithms 151
3. A Stochastic Gradient Algorithm in Hilbert Spaces
In this section, we extend the setting in the rst section to general Hilbert spaces.
Let W be a Hilbert space with inner product , ). Consider the quadratic potential
map V: W R given by
V(w) =
1
2
Aw, w) B, w) C, (11)
where A: W W is a positive denite bounded linear operator whose inverse is
bounded, i.e., |A
1
| < , B W, and C R. Then the gradient grad V: W
W is given by
grad V(w) = Aw B,
V has a unique minimal point w
) = Aw
B = 0,
i.e.,
w
= A
1
B.
Our concern is to nd an approximation of this point, when A, B, and C are random
variables on a space Z. We give a sample complexity analysis (i.e., the sample
size sufcient to achieve an approximate minimizer with high probability) of the
so-called stochastic gradient method given by the update formula
w
t 1
= w
t
t
grad V(w
t
), for t = 1, 2, 3, . . . , (12)
with
t
a positive step size. For each example z, the stochastic gradient of V
z
,
grad V
z
: W W, is given by the afne map grad V
z
(w) = A(z)w B(z),
with A(z), B(z) denoting the values of random variables A, B at z Z. Our
analysis will benet from this afne structure and independent sampling. Thus
(12) becomes:
For t = 1, 2, 3, . . ., let z
t
be a sample sequence and dene an update by
w
t 1
= w
t
t
(A
t
w
t
B
t
) for some w
1
W, (13)
where:
(1) z
t
Z (t N) are drawn independently and identically according to ;
(2) the step size
t
> 0; and
(3) the map A: Z SL(W) is a random variable depending on z with values
in SL(W), the vector space of symmetric bounded linear operators on W,
and B: Z W is a W-valued random variable depending on z. For each
t N, let A
t
= A(z
t
) and B
t
= B(z
t
).
From the stochastic gradient method in equation (12), we derive equation (1)
for our online algorithm in RKHSs. Consider the Hilbert space W = H
K
. For
xed z = (x, y) Z, take the following quadratic potential map V: H
K
R
dened by
V
z
( f ) =
1
2
{( f (x) y)
2
| f |
2
K
].
152 S. Smale and Y. Yao
Recall that the gradient of V
z
is a map grad V
z
: H
K
H
K
such that, for all
g H
K
,
grad V
z
( f ), g)
K
= DV
z
( f )(g),
where the Fr` echet derivative at f , DV
z
( f ): H
K
R, is the linear functional
such that, for g H
K
,
lim
|g|0
[V
z
( f g) V
z
( f ) DV
z
( f )(g)[
|g|
= 0.
Hence
DV
z
( f )(g) = ( f (x) y)g(x) f, g)
K
= ( f (x) y)K
x
f, g)
K
,
where the last step is due to the reproducing property g(x) = g, K
x
)
K
. This gives
the following proposition:
Proposition 3.1. grad V
z
( f ) = ( f (x) y)K
x
f .
Taking f = f
t
and (x, y) = (x
t
, y
t
), by f
t 1
= f
t
t
grad V
z
t
( f
t
), we have
f
t 1
= f
t
t
(( f
t
(x
t
) y
t
)K
x
t
f
t
),
which establishes equation (1).
In the sequel we assume that
Finiteness Condition.
(1) For almost all z Z,
min
I A(z)
max
I (0 <
min
max
< );
and
(2) |B(z)| < for almost all z Z.
Consider the following averaging of equation (13), by taking the expectation
over the truncated history (z
i
)
t
i =1
,
E
z
1
,...,z
t
[w
t 1
] = E
z
1
,...,z
t 1
[w
t
]
t
(E
z
t
[A
t
]w
t
E
z
t
[B
t
]), (14)
where w
t
depends on the truncated sample up to time t 1, (z
i
)
t 1
i =1
. Then the
equilibrium for this averaged equation (14) will satisfy
E
z
t
[A
t
]w
t
E
z
t
[B
t
] = 0 w
t
= E
z
t
[A
t
]
1
E
z
t
[B
t
]. (15)
This motivates the following denitions:
Denition A.
(1) The equilibrium w
=
A
1
B where
A = E
z
[A(z)] and
B = E
z
[B(z)].
Online Learning Algorithms 153
(2) The inverse condition number for the family {A(z): z Z],
=
min
/
max
(0, 1].
For each w W, the stochastic gradient at w as a map grad V
z
(w): Z W
such that z . A(z)w B(z), is a W-valued random variable depending on z. In
particular, grad V
z
(w
2
= E[| grad V
z
(w
)|
2
] = E
z
[|A
z
w
B
z
|
2
],
which reects the uctuations of grad V
z
(w
(t
N) for some (
1
2
, 1). Then, for each t N, we have
|w
t
w
| E
init
(t ) E
samp
(t ) (16)
where
E
init
(t ) e
[/(1)](1t
1
)
|w
1
w
|,
and with probability at least 1 ( (0, 1)),
E
2
samp
(t )
2
max
2
(t, ).
Here
(t, ) =
t 1
k=1
1
k
2
t 1
i =k1
_
1
i
_
2
.
Remark 3.2. As in the rst section, E
init
(t ) has a deterministic upper bound
and characterizes the accumulated effect from the initial choice, which is called
the initial error, and E
samp
(t ) depends on the random sample and thus has a
probabilistic bound, which is called the sample error.
154 S. Smale and Y. Yao
Remark 3.3. In summary, w
t
in equation (13) satises that for arbitrary integer
t N, the following holds with probability at least 1 in the space of all samples
of length t 1, i.e., Z
t 1
,
|w
t
w
| e
[/(1)](1t
1
)
|w
1
w
max
1/2
(t, ).
When
2
= 0, we have the following convergence rate for the deterministic
gradient algorithm
|w
t
w
| e
[/(1)](1t
1
)
|w
1
w
|,
which is faster than any polynomial rate.
Proposition 3.4. Let (0, 1] and (
1
2
, 1). The following upper bounds
hold for all t N:
(1)
2
(2/)
2
; and
(2)
(t, ) C
(1/)
/(1)
(1/t )
, where
C
= 8
2
2 1
_
e(2 2
)
_
/(1)
.
Remark 3.5. In the setting of equation (1) in RKHS, we have
= C
K
M
and =
C
2
K
,
whence
_
2C
K
M
( C
2
K
)
_
2
.
Remark 3.6. Choose the initialization w
1
= 0 for simplicity. Notice that |w
| =
|
A
1
B| /
min
. Then we have the following bound, with probability at least
1 ,
|w
t
w
min
_
1
t
_
/2
_
t
/2
e
[/(1)](1t
1
)
2
_
C
_
.
Remark 3.7. Consider the case that = 1 and (0,
1
2
). Then, by Lemma
A.2(3), we obtain that
E
init
(t ) t
|w
1
w
|
and
E
samp
(t )
max
1/2
1
(t, )
4
min
(1 2)
t
.
Online Learning Algorithms 155
Choosing w
1
= 0 and using |w
| /
min
, we obtain that
|w
t
w
min
_
1
t
_
_
1
4
(1 2)
_
.
The proof of Theorem B and Proposition 3.4 will be given in Section 4. Here
is the proof of Theorem A from Theorem B.
Proof of Theorem A. In this case W = H
K
. Before applying Theorem B, we
need to rewrite equation (1) by the notations used in Theorem B.
For any f H
K
, let the evaluation functional at x X be E
x
: H
K
R be
such that E
x
( f ) = f (x) ( x X). Denote by E
x
: R H
K
the adjoint operator
of E
x
such that E
x
( f ), y)
R
= f, E
x
(y))
K
(y R). From this denition, we see
that E
x
(y) = yK
x
.
Dene the linear operator A
x
: H
K
H
K
by A
x
= E
x
E
x
I , i.e.,
A
x
( f ) = f (x)K
x
f , whence A
x
is a random variable depending on x. Taking
the expectation of A
x
, we have
A = E
x
[A
x
] = L
K
I .
Moreover, dene B
z
= E
x
(y) = yK
x
H
K
, which is a random variable
depending on z = (x, y). Notice that the expectation of B
z
,
B = E
z
[B
z
] =
E
x
[E
y
[y]K
x
] = L
K
f
= (L
K
I )
1
L
K
f
satises 0 = E
z
[A(z) f
B(z)] =
A f
B.
Thus f
.
Finally, by identifying w
t
= f
t
and w
= f
|
K
_
1
t
_
_
| f
|
K
( C
2
K
)
1/2
1
(t, )
_
.
By Lemma A.2(3), we have an upper bound for
1
(t, ),
1
(t, )
4
1 2
t
2
.
With this upper bound and
2
(2/)
2
= 4C
2
K
M
2
(C
2
K
)
2
/
2
, we obtain that
| f
t
f
|
K
_
1
t
_
_
| f
|
K
4C
K
M
(1 2)
_
,
156 S. Smale and Y. Yao
which holds with probability at least 1 . Notice that this upper bound has a
polynomial decay O(t
).
4. Proof of Theorem B
In this section we shall use E
z
[] to denote the expectation with respect to z. When
the underlying random variable in expectation is clear from the context, we will
simply write E[].
Dene the remainder vector at time t , r
t
= w
t
w
, which is a randomvariable
depending on (z
i
)
t 1
i =1
Z
t 1
when t 2. The following lemma gives a formula
to compute r
t 1
.
Lemma 4.1.
r
t 1
=
t
i =1
(I
i
A
i
)r
1
t
k=1
k
_
t
i =k1
(I
i
A
i
)
_
(A
k
w
B
k
).
Proof. Since w
t 1
= w
t
t
(A
t
w
t
B
t
), then
r
t 1
= w
t 1
w
= w
t
t
(A
t
w
t
B
t
) (I
t
A
t
)w
t
A
t
w
= (I
t
A
t
)r
t
t
(A
t
w
B
t
).
The result then follows from induction on t N.
For simplicity we introduce the following notations, a symmetric linear operator
X
t
k1
: W W which depends on z
k1
, . . . , z
t
,
X
t
k1
(z
k1
, . . . , z
t
) =
t
i =k1
(I
i
A
i
) (X
t
k1
= I if k t ),
and a vector Y
k
W which depends on z
k
only,
Y
k
(z
k
) = A
k
w
B
k
.
Clearly, E[Y
k
] = 0 and E[|Y
k
|
2
] =
2
for every 1 k t . With this notation
Lemma 4.1 can be written as
r
t 1
= X
t
1
r
1
t
k=1
k
X
t
k1
Y
k
, (17)
where the rst term X
t
1
r
1
reects the accumulated error caused by the initial choice;
the second term
t 1
k=1
k
X
t
k1
Y
k
is of zero mean and reects the uctuation caused
by the random sample. Based on this observation we dene the initial error:
E
init
(t 1) = |X
t
1
r
1
|, (18)
Online Learning Algorithms 157
and the sample error:
E
samp
(t 1) =
_
_
_
_
_
t
k=1
k
X
t
k1
Y
k
_
_
_
_
_
. (19)
The main concern in this section is to obtain upper bounds on the initial error and
the sample error. The following estimates are crucial in the proofs of Theorem B
and Proposition 3.4.
Proposition 4.2. Let
t
=1/
max
t
for some (
1
2
, 1]. For all =
min
/
max
/
(1(t 1)
1
)
|r
1
|, (
1
2
, 1);
(t 1)
|r
1
|, = 1.
(2) |Y
k
| 2/,
(3) E
__
_
_
_
_
t
k=1
k
X
t
k1
Y
k
_
_
_
_
_
2
_
2
max
2
(t 1, ).
From this proposition and the following Markovs inequality, we give the proof
of Theorem B.
Lemma 4.3 (Markovs Inequality). Let X be a nonnegative random variable.
Then, for any real number > 0, we have
Prob{X ]
E[X]
.
Proof of Theorem B. By (18) and the estimation (1) in Proposition 4.2, where
_
1
2
, 1
_
, we have
E
init
(t ) e
[/(1)](1t
1
)
|w
1
w
|.
By (19) and the estimation (3) in Proposition 4.2 and Markovs inequality with
X = E
2
samp
(t ), we obtain, for t 2,
Prob{E
2
samp
(t )
2
]
2
max
2
(t, ).
Setting the right-hand side to be (0, 1), we get the probabilistic upper bound
on the sample error. It remains to check that when t = 1, E
init
(t ) = |w
1
w
|
and E
samp
(t ) = 0, whence the bound still holds.
Next we give the proof of Proposition 4.2.
158 S. Smale and Y. Yao
Proof of Proposition 4.2. (1) By
min
I A
max
I and
t
= 1/
max
t
(
_
1
2
, 1]
_
, then
|X
t
k1
r
1
|
t
i =k1
|I
i
A
i
||r
1
|
i =k1
_
1
i
_
|r
1
|, =
min
/
max
. (20)
Setting k = 0 and by (1) in Lemma A.2, we obtain the result.
(2) Note that |w
| /
min
. Thus we have
|Y
k
| = |A
k
w
B
k
| |A
k
||w
| |B
k
|
max
/
min
= (
1
1) 2/,
since (0, 1]. This gives part (2).
(3) Note that
E
_
_
_
_
_
_
_
t
k=1
k
X
t
k1
Y
k
_
_
_
_
_
2
_
_
= E
_
t
k=1
k
X
t
k1
Y
k
,
t
k=1
k
X
t
k1
Y
k
_
,
=
t
k,l=1
l
EX
t
k1
Y
k
, X
t
l1
Y
l
),
where, if k ,= l, say k < l,
l
E
z
k
,...,z
t
X
t
k1
Y
k
, X
t
l1
Y
l
) =
k
l
E
z
k1
,...,z
t
[E
z
k
[z
k1
,...,z
t
[Y
k
]
T
X
t
k1
X
t
l1
Y
l
] =0,
by E[Y
k
] = 0. Thus we have
t
k,l=1
l
EX
t
k1
Y
k
, X
t
l1
Y
l
) =
t
k=1
2
k
EX
t
k1
Y
k
, X
t
k1
Y
k
)
k=1
2
k
E[|X
t
k1
|
2
|Y
k
|
2
]
max
2
(t 1, ),
where the last inequality is due to E|Y
k
|
2
=
2
for all k and
t
k=1
2
k
|X
t
k1
|
2
k=1
1
max
2
k
2
t
i =k1
_
1
i
_
2
=
1
max
2
(t 1, ).
Online Learning Algorithms 159
Finally, we derive the upper bounds for
2
and (t, ) as in Proposition 3.4.
Proof of Proposition 3.4. The rst upper bound follows from estimation (2) in
Proposition 4.2,
2
(|Y
k
|)
2
_
2
_
2
for all 1 k t .
For t 2, the second upper bound is an immediate result from Lemma A.1;
for t = 1, note that
i =1
( f (x
i
) y
i
)
2
f, f )
K
, > 0.
The existence and uniqueness of f
,z
given as in Section 6 of [7] says
f
,z
(x) =
t
i =1
a
i
K(x, x
i
)
where a = (a
1
, . . . , a
t
) is the unique solution of the well-posed linear system in
R
t
,
(t I K
z
)a = y,
with (t t )-identity matrix I , (t t )-matrix K
z
whose (i, j ) entry is K(x
i
, x
j
)
and y = (y
1
, . . . , y
t
) R
t
.
Aprobabilistic upper bound for | f
,z
f
|
K
C
,K
_
1
t
_
,
160 S. Smale and Y. Yao
where C
,K
= C
2
K
_
3C
2
K
| f
and
=
_
XY
(y f
(x))
2
d.
Remark 5.2. Notice that if 1 without loss of generality, equation (10) in
Remark 2.1 shows the following convergence rate:
| f
t
f
|
K
O
__
1
_
(2)/2(1)
_
1
t
_
/2
_
,
where
_
1
2
, 1
_
. Since the function () = (2 )/2(1 ) = 1/2(1 )
1
2
,
is an increasing function of , then () (
3
4
, ) as (
1
2
, 1). For small ,
when is close to
1
2
, the upper bound is close to O(
3/4
t
1/4
) which is tighter in
but looser in t in comparison with the theorem above; on the other hand, when
increases, the upper bound becomes tighter in t but much looser in .
6. Adaline
Example 6.1 (Adaline or WidrowHoff Algorithm). The Adaline or Widrow
Hoff algorithm [5, p. 23] is a special case of the online learning algorithm (1)
where the step size
t
is a constant , the regularization parameter = 0, and the
reproducing kernel is the linear kernel such that K(x, x
/
) = x, x
/
) 1 for x, x
/
X = R
n
. To see that, dene two kernels by K
0
(x, x
/
) = x, x
/
) and K
1
(x, x
/
) = 1.
Then H
K
= H
K
0
H
K
1
. Notice that H
K
0
R
n
and H
K
1
R, whence
H
K
R
n1
. In fact, for w R
n
and b R, a function in H
K
can be written
as f (x) =
n
i =1
w
i
x
i
b for x X. By the use of the Euclidean inner product
in R
n1
, we can write f (x) = (w, b), (x, 1)). Therefore, the Adaline update
formula
(w
t 1
, b
t 1
) = (w
t
, b
t
) (w, x
t
) b y
t
)(x
t
, 1), t N,
can be written as the following formula, by taking the Euclidean inner product of
both sides with the vector (x, 1) R
n1
,
f
t 1
= f
t
( f
t
(x
t
) y
t
)K
x
t
.
This is equivalent to setting
t
= and = 0 in the online learning algorithm (1).
The case for xed step size and zero regularization parameter is not included
in Theorems A or B. In the case of nonstochastic samples, Cesa-Bianchi et al. [4]
have some worst-case analysis on the upper bounds for the following quantity:
T
t =1
(w
t
, x
t
) y
t
)
2
min
|w|W
T
t =1
(w, x
t
) y
t
)
2
.
Online Learning Algorithms 161
Adam Kalai has shown us how one might convert these results of Cesa-Bianchi
et al. to a form comparable to Theorem A. Beyond the square loss function above,
some related works include [15] which presents a general gradient descent method
in RKHS for bounded differentiable functions, and [24] which studies the gradient
method with arbitrary differentiable convex loss functions. These works suggest
different schemes on choosing the step size parameter and howthese choices might
affect the convergence rate under various conditions.
Appendix A: Some Estimates
The following lemma gives an upper bound for
(t, ) =
t 1
k=1
1
k
2
t 1
i =k1
_
1
i
_
2
.
Lemma A.1 (Main Analytic Estimate). Let (0, 1] and
_
1
2
, 1
_
. Then for
t N,
(t 1, ) C
_
1
_
/(1)
_
1
t 1
_
,
where
C
= 8
2
2 1
_
e(2 2
)
_
/(1)
.
Proof. The following fact will be used repeatedly in this section,
ln(1 x) x, for all x > 1. (A.1)
Thus we have
t
i =k1
ln
_
1
i
_
2
2
t
i =k1
1
i
2
_
t 1
k1
1
x
dx,
which equals
2
1
((k 1)
1
(t 1)
1
)
if
_
1
2
, 1
_
.
From this estimate it follows that
(t 1, ) e
2
/
(t 1)
1
t
k=1
1
k
2
e
2
/
(k1)
1
= S
1
S
2
,
162 S. Smale and Y. Yao
where
/
= /(1 ) and
S
1
= e
2
/
(t 1)
1
(t 1)/2
k=1
1
k
2
e
2
/
(k1)
1
,
S
2
= e
2
/
(t 1)
1
t
k=(t 1)/2
1
k
2
e
2
/
(k1)
1
,
where x denotes the largest integer no larger than x.
Next we give upper bounds on S
1
and S
2
. First,
S
1
e
2
/
(12
1
)(t 1)
1
(t 1)/2
k=1
1
k
2
e
2
/
(12
1
)(t 1)
1
_
t /2
1/2
1
x
2
dx
= e
2
/
(12
1
)(t 1)
1 1
1 2
__
t
2
_
12
_
1
2
_
12
_
2
2 1
e
2
/
(12
1
)(t 1)
1
as
_
1
2
, 1
_
. Togive a polynomial upper boundfor exp{2
/
(12
1
)(t 1)
1
],
we use the fact that for any c > 0, a > 0, and x (0, ),
e
cx
_
a
ec
_
a
x
a
.
To see this, it is enough to observe that the function f (x) = x
a
/e
cx
is maximized
at x = a/c. Let a = (1/ 1)
1
, c = 2
/
(1 2
1
), and x = (t 1)
1
=
(t 1)
(1/1)
, then
e
2
/
(12
1
)(t 1)
1
_
e(2 2
)
_
/(1)
(t 1)
,
Thus, for
_
1
2
, 1
_
, (0, 1), and t N,
S
1
2
2 1
_
e(2 2
)
_
/(1)
(t 1)
.
Second, notice that 1/(t 1)/2 2/t 4/(t 1) (for t N), then let
p(t ) = e
2
/
(t 1)
1
/t
and we have
S
2
e
2
/
(t 1)
1 4
(t 1)
_
p(t )
t 1
k=
t 1
2
1
k
e
2
/
(k1)
1
_
2
2
e
2
/
(t 1)
1
(t 1)
_
p(t )
_
t
t /21
1
x
e
2
/
(x1)
1
dx
_
for t 4,
Online Learning Algorithms 163
where
_
t
t /21
1
x
e
2
/
(x1)
1
dx
_
t
t /21
2
(x 1)
e
2
/
(x1)
1
dx by
1
x
2
x 1
for t 4
=
2
1
_
(t 1)
1
(t /2)
1
e
2
/
y
dy by y = (x 1)
1
=
2
1
/
(1 )
e
2
/
(t 1)
1
(1 e
2
/
((t /2)
1
(t 1)
1
)
)
1
e
2
/
(t 1)
1
,
whence
S
2
2
2
(t 1)
_
t
_
8
(t 1)
for t 4.
It is easy to check that (t 1, ) 2 (8/)(t 1)
for 1 t 3.
Therefore, for t N,
(t 1, ) S
1
S
2
_
2
2 1
_
e(2 2
)
_
/(1)
_
(t 1)
=
_
2
2 1
_
e(2 2
)
_
/(1)
8
(21)/(1)
_
_
1
_
/(1)
(t 1)
_
2
2 1
_
e(2 2
)
_
/(1)
8
__
1
_
/(1)
(t 1)
,
where the last step is due to
(21)/(1)
< 1 as (0, 1).
The following lemma is also useful in the various upper bound estimations in
Proposition 4.2.
Lemma A.2. (1) For (0, 1] and [0, 1],
t
i =k1
_
1
i
_
_
_
exp
_
2
1
((k 1)
1
(t 1)
1
)
_
, [0, 1),
_
k 1
t 1
_
, = 1.
164 S. Smale and Y. Yao
(2) For (0, 1] and [0, 1],
t
k=1
1
k
i =k1
_
1
i
_
3
.
(3) If = 1, and for (0, 1],
1
(t 1, ) =
t
k=1
1
k
2
t
i =k1
_
1
i
_
2
_
4
1 2
(t 1)
2
,
_
0,
1
2
_
;
4(t 1)
1
ln(t 1), =
1
2
;
6
2 1
(t 1)
1
,
_
1
2
, 1
_
;
6(t 1)
1
, = 1.
Proof. (1) By inequality (A.1), we have, for [0, 1],
ln
_
1
i
_
i
.
Thus
t
i =k1
ln
_
1
i
_
t
i =k1
1
i
_
t 1
k1
1
x
dx (A.2)
which equals
1
((k 1)
1
(t 1)
1
),
if [0, 1), and
ln
_
k 1
t 1
_
,
if = 1. Taking the exponential gives the inequality.
(2) If [0, 1), from (1) we have
1
k
i =k1
_
1
i
_
e
[2/(1)](t 1)
1 1
k
e
[2/(1)](k1)
1
,
whence
t
k=1
1
k
i =k1
_
1
i
_
e
[2/(1)](t 1)
1
_
t 1
k=1
1
k
e
[2/(1)](k1)
1
1
t
e
[2/(1)](t 1)
1
_
Online Learning Algorithms 165
where
t 1
k=1
1
k
e
[2/(1)](k1)
1
2
t 1
k=1
_
1
k 1
_
e
[2/(1)](k1)
1
2
_
t 1
2
e
[2/(1)]x
1
x
dx
1
e
[2/(1)](t 1)
1
.
Therefore
e
[2/(1)](t 1)
1
_
t 1
k=1
1
k
e
[2/(1)](k1)
1
1
t
e
[2/(1)](t 1)
1
_
1
<
3
.
If = 1, from inequality (A.2),
t 1
k=1
1
k
t
i =k1
_
1
i
_
t 1
k=1
1
k
_
k 1
t 1
_
2
t
t 1
k=1
(k 1)
k 1
=
2
t
t 1
k=1
(k 1)
1
2
t
_
t
1
x
1
dx,
where, if = 1,
2
t
_
t
1
x
1
dx = 2;
and, if 0 < < 1,
2
t
_
t
1
x
1
dx =
2
_
t
1
t
_
2
.
Therefore
t
k=1
1
k
t
i =k1
_
1
i
_
t
,
which completes the proof of part (2).
(3) If = 1, using inequality (A.1), we have
t
i =k1
ln
_
1
i
_
2
2
t
i =k1
1
i
2
_
t 1
k1
1
x
dx = ln
_
k 1
t 1
_
2
.
166 S. Smale and Y. Yao
Thus
1
(t 1, ) t
2
t 1
k=1
1
k
2
_
k 1
t 1
_
2
4
(t 1)
2
2
2
(t 1)
2
t 1
k=1
k
22
4
(t 1)
2
2
2
(t 1)
2
_
t 1/2
1/2
x
22
dx,
where, if (0,
1
2
),
r.h.s. =
4
(t 1)
2
2
2
1 2
(t 1)
2
(2
12
(t
1
2
)
21
)
_
2
2
1 2
_
(t 1)
2
4
1 2
(t 1)
2
;
if =
1
2
,
r.h.s. =
4
(t 1)
2
2
t 1
(ln(t
1
2
) ln
1
2
)
_
2
t 1
ln 2
_
2
t 1
ln(t 1)
4
t 1
ln(t 1);
if (
1
2
, 1),
r.h.s. =
4
(t 1)
2
2
2
2 1
(t 1)
2
((t
1
2
)
21
(
1
2
)
21
)
_
4
t 1
4
2 1
_
(t 1)
1
6
2 1
(t 1)
1
;
and, if = 1,
r.h.s. =
4
(t 1)
2
4
(t 1)
2
(t 1) 6(t 1)
1
.
This nishes the proof of the fourth part.
Appendix B: Generalized Bennetts Inequality
In the direction of proving an exponential version of the main theorems with 1/
replaced by log 1/, it has seemed useful for us to consider Bennetts inequality for
random variables in a Hilbert space. In the meantime, such a theorem was found
useful in other work yet to appear. Thus we include Appendix B.
The following theorem might be considered as a generalization of Bennetts in-
equality for independent sums in Hilbert spaces, whose counterpart in real random
variables is given in Theorem 3 of Smale and Zhou [20].
Online Learning Algorithms 167
Theorem B.1 (Generalized Bennett). Let H be a Hilbert space, let
i
H
(i = 1, . . . , n) be independent random variables and let T
i
: H H be deter-
ministic linear operators. Dene
i
= |T
i
| and
= sup
i
i
. Suppose that for all
i almost surely |
i
| M < . Dene
2
i
= E|
i
|
2
and
2
=
n
i =1
i
2
i
. Then
P
__
_
_
_
_
n
i =1
T
i
(
i
E
i
)
_
_
_
_
_
_
2 exp
_
M
2
g
_
M
__
where g(t ) = (1 t ) log(1 t ) t for all t 0. Considering that g(t )
(t /2) log(1 t ), then
P
__
_
_
_
_
n
i =1
T
i
(
i
E
i
)
_
_
_
_
_
_
2 exp
_
M
log
_
1
M
__
.
The proof needs the following lemma due to Pinelis and Sakhanenko [17]. Its
current form is taken from Theorem 3.3.4(a) in Yurinsky [22].
Lemma B.2 (Pinelis and Sakhanenko, 1985). Let
i
H (i = 1, . . . , n) be a
sequence of independent random variables with values in a Hilbert space H and
E[
i
] = 0. Then, for any t > 0,
E
_
cosh
_
t
_
_
_
_
_
n
i =1
i
_
_
_
_
_
__
n
j =1
E(e
t |
j
|
t |
j
|).
Proof of Theorem B.1. Without loss of generality we assume E[
i
] = 0. For
arbitrary s > 0, by Markovs inequality,
P
__
_
_
_
_
n
i =1
T
i
i
_
_
_
_
_
_
= P
_
exp
_
s
_
_
_
_
_
n
i =1
T
i
i
_
_
_
_
_
_
e
s
_
e
s
Eexp
_
s
_
_
_
_
_
n
i =1
T
i
i
_
_
_
_
_
_
2e
s
Ecosh
_
s
_
_
_
_
_
n
i =1
T
i
i
_
_
_
_
_
_
,
where the last inequalityis due toe
x
e
x
e
x
= 2 cosh(x). Then, byLemma B.1,
P
__
_
_
_
_
n
i =1
T
i
i
_
_
_
_
_
_
2e
s
n
j =1
E(e
s|T
j
j
|
s|T
j
j
|).
Denote
I = 2e
s
n
j =1
E(e
s|T
j
j
|
s|T
j
j
|).
168 S. Smale and Y. Yao
For each 1 j n, considering E|
j
|
2
=
2
j
and |
j
| M almost surely,
E(e
s|T
j
j
|
s|T
j
j
|) = 1
n
k=2
s
k
E|T
j
j
|
k
k
1
n
k=2
s
k
k1
M
k2
k
j
2
j
exp
_
n
k=2
s
k
k1
M
k2
k
j
2
j
_
= exp
_
e
s
M
1 s
M
2
j
2
j
_
,
where the second last inequality is due to 1 x e
x
for all x. Therefore
I exp
_
s
e
s
M
1 s
M
2
n
j =1
2
j
_
,
where the right-hand side is minimized at
s
0
=
1
M
log
_
1
M
n
j =1
j
2
j
_
.
Notice that
2
=
n
j =1
j
2
j
, then with this choice we arrive at
I exp
_
M
2
g
_
M
__
,
where the function g(t ) = (1 t ) log(1 t ) t for all t 0. This is the rst
inequality.
Moreover, we can check the lower bound of g,
g(t )
t
2
log(1 t ),
which leads to the second inequality.
By taking T
i
= (1/n)I , the following corollary gives a form of Bennetts
inequality for random variables in Hilbert spaces.
Corollary B.3 (Bennett). Let H be a Hilbert space and let
i
H (i =
1, . . . , n) be independent random variables such that |
i
| M and E|
i
|
2
2
for all i . Then
P
__
_
_
_
_
1
n
n
i =1
[
i
E
i
]
_
_
_
_
_
_
2 exp
_
n
2
M
2
g
_
M
2
__
.
Online Learning Algorithms 169
Noticing that g(t ) t
2
/2(1 t /3), the corollary leads to the following Bernstein
inequality for independent sums in Hilbert spaces.
Corollary B.4 (Bernstein). Let H be a Hilbert space and let
i
H (i =
1, . . . , n) be independent random variables such that |
i
| M and E|
i
|
2
2
for all i . Then
P
__
_
_
_
_
1
n
n
i =1
[
i
E
i
]
_
_
_
_
_
_
2 exp
_
n
2
2(
2
M/3)
_
.
Yurinsky [22] also gives Bernsteins inequalities for independent sums in Hilbert
spaces and Banach spaces. The following result is a varied form of Theorem
3.3.4(b) in [22]. Note that it is weaker than the form above in that the constant
1
3
changes to 1.
Theorem B.5. Let
i
be independent random variables with values in a Hilbert
space H. Suppose that for all i almost surely |
i
| M < and E|
i
|
2
2
< . Then, for n 0,
P
__
_
_
_
_
1
n
n
i =1
(
i
E[
i
])
_
_
_
_
_
_
2 exp
_
n
2
2(
2
M)
_
.
Acknowledgment
The authors were supported by NSF grant 0325113.
The authors would like to acknowledge Peter Bartlett and Pierre Tarr` es for
their suggestions on stepsize rate; Yifeng Yu and Krishnaswami Alladi for their
helpful discussions on proving the Main Analytic Estimate (Lemma A.1) and
Lemma A.2; Iosif Pinelis and Yiming Ying for pointing out Lemma B.1 and their
helpful comments. We thank the reviewers for many suggestions. We also thank
Ding-Xuan Zhou, David McAllester, Adam Kalai, Gang Liang, Leon Bottou and,
especially, Tommy Poggio, for many helpful discussions.
References
[1] M. Benam, Dynamics of stochastic approximations, in Le Seminaire de Probabilites, Lectures
Notes in Mathematics, Vol. 1709, Springer-Verlag, New York, 1999, pp. 168.
[2] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientic, Belmont,
MA, 1996.
[3] O. Bousquet and A. Elisseeff, Stability and generalization. J. Mach. Learn. Res. 2 (2002), 499
526.
[4] N. Cesa-Bianchi, P. M. Long, and M. K. Warmuth, Worst-case quadratic loss bounds for prediction
using linear functions and gradient descent, IEEE Trans. Neural Networks 7(3) (1996), 604619.
170 S. Smale and Y. Yao
[5] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-
based Learning Methods, Cambridge University Press, Cambridge, 2000.
[6] F. Cucker and S. Smale, Best choices for regularization parameters in learning theory, Found.
Comput. Math. 2(4) (2002), 413428.
[7] F. Cucker and S. Smale, On the mathematical foundations of learning, Bull. Amer. Math. Soc.
29(1) (2002), 149.
[8] E. De Vito, A. Caponnetto, and L. Rosasco, Model selection for regularized least-squares algo-
rithm in learning theory, Found. Comput. Math. 5(1), 5985.
[9] M. Duo, Cibles atteignables avec une probabilit e positive dapr es M. Benam, Unpublished
manuscript, 1997.
[10] M. Duo, Algorithmes Stochastiques, Springer-Verlag, Berlin, 1996.
[11] T. Evgeniou, M. Pontil, and T. Poggio, Regularization networks and support vector machines,
Adv. Comput. Math. 13(1) (1999), 150.
[12] L. Gy or, Stochastic approximation from ergodic sample for linear regression, Z. Wahrsch. Verw.
Gebiete 54 (1980), 4755.
[13] L. Gy or, M. Kohler, A. Krzy zak, and H. Walk, A Distribution-Free Theory of Nonparametric
Regression, Springer-Verlag, New York, 2002.
[14] J. Kiefer and J. Wolfowitz, Stochastic estimation of the maximum of a regression function, Ann.
Math. Statist. 23 (1952), 462466.
[15] J. Kivinen, A. J. Smola, and R. C. Williamson, Online learning with kernels, IEEE Trans. Signal
Process. 52(8) (2004), 21652176.
[16] H. J. Kushner and G. G. Yin, Stochastic Approximations and Recursive Algorithms and Applica-
tions, Springer-Verlag, Berlin, 2003.
[17] I. F. Pinelis and A. I. Sakhanenko, Remarks on inequalities for probabilities of large deviations,
Theory Probab. Appl. 30(1) (1985), 143148.
[18] H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Statist. 22(3) (1951),
400407.
[19] H. Robbins and D. Siegmund, A convergence theorem for nonnegative almost supermartingales
and some applications, in (J. S. Rustagi, editor), Optimizing Methods in Statistics, Academic
Press, New York, 1971, pp. 233257.
[20] S. Smale and D.-X. Zhou, Shannon sampling ii. connections to learning theory, Appl. Comput.
Harmonic Anal. (2005), submitted.
[21] V. B. Tadic, On the almost sure rate of convergence of linear stochastic approximation algorithms,
IEEE Trans. Inform. Theory 50 (2004), 401409.
[22] Y Yurinsky, Sums and Gaussian Vectors, Lecture Notes in Mathematics, Vol. 1617, Springer-
Verlag, Berlin, 1995.
[23] T. Zhang, Leave-one-out bounds for kernel methods, Neural Comput. 15 (2003), 13971437.
[24] M. Zinkevich, Online convex programming and generalized innitesimal gradient ascent, Tech-
nical report, CMU-CS-03-110, School of Computer Science, Carnegie Mellon University, 2003.