CDC18 1363 Fi
CDC18 1363 Fi
CDC18 1363 Fi
Abstract—A widely studied filtering algorithm in signal pro- time intervals, see Theorem 7 of Part II, Section 4.4.1 [3].
cessing is the least mean square (LMS) method, due to B. Widrow The advantage of their approach is that their framework allows
and T. Hoff, 1960. A popular extension of the LMS algorithm, recursive algorithms with feedback effects, which is typical,
which is also important in deep learning, is the LMS method
with momentum, originated by S. Roy and J.J. Shynk back in e.g., for recursive estimation of linear stochastic systems.
1988. This is a fixed gain (or constant step-size) version of the A refined characterization of the (piecewise constant exte-
LMS method modified by an additional momentum term that sion of the) of LMS in a different direction was given by
is proportional to the last correction term. Recently, a certain A. Heunis and L.A. Joslin [4], providing a limit theorem in
equivalence of the two methods has been rigorously established the form of a functional law of the iterated logarithm.
by K. Yuan, B. Ying and A.H. Sayed, assuming martingale
difference gradient noise. The purpose of this paper is to present Higher order moments of the estimation error of LMS
the outline of a significantly simpler and more transparent were estimated in [5] for bounded signals satisfying a certain
asymptotic analysis of the LMS algorithm with momentum under mixing condition, showing that the L p -norms of these errors
the assumption of stationary, ergodic and mixing signals. are proportional to the square root of the gain. A similar result
Index Terms—least mean square methods, statistical analysis, was established under much weaker conditions for general
recursive estimation, gradient methods, machine learning stochastic approximation (SA) methods, allowing discontin-
uous correction terms, satisfying a relaxed mixing condition
I. I NTRODUCTION by H. N. Chau, Ch. Kumar, M. Rásonyi and S. Sabanis [6].
A classical, widely studied recursive estimation method for A common experimental finding with stochastic gradient
determining the mean-square optimal linear filter is the least methods is that they tend to be slow in the initial phase,
mean square (LMS) method, due to B. Widrow and T. Hoff especially if the number of parameters is huge, as is the case
[1], devised for pattern recognition problems. The algorithm with problems in deep learning. A currently widely applied
can be seen as a stochastic gradient (SG) method with fixed modification of standard stochastic gradient methods, resulting
gain. The fine structure of the estimation error process for in the acceleration of the early stages of the algorithms, is
small adaptation gain has been studied in a number of works. the use of a momentum term, a device that has proven to
A general class of fixed gain recursive estimation methods, be succesful in determimistic optimization, see Polyak [7].
under mild ergodicity assumptions, with applications to vari- The original method is also known as the heavy-ball method
ants of the LMS algorithm, including the sign-error and sign- referring to the fact that the dynamics of the minimization
sign algorithms, was studied by J. A. Bucklew, T. G. Kurtz and method can be described as the motion of a heavy-ball along
W. A. Sethares [2, Theorem 2], leading to a result establishing a hilly terrain trying to find its way to the absolute minimum
the weak convergence of the (piecewise constant extension by trying to avoid undesirable local minima.
of the) rescaled estimation error process to the solution of Theoretical justification of the superiority of SG methods
a linear stochastic differential equation on the semi-infinite with momentum, in the early stages, are not available in the
interval [0, ∞) with a concise and transparent proof. literature, however the “steady-state” behavior of the estimator
An alternative general class of fixed gain recursive estima- process generated by SG methods with momentum have been
tion methods defined in a Markovian framework was studied known to be inferior to that of the standard SG methods since
by A. Benveniste, M. Metivier and P. Priouret, see [3]. They the works of Polyak [8]. In a paper of 2016 K. Yuan, B. Ying
formulate a similar weak convergence result for fixed finite and A.H. Sayed established a remarkable equivalence of SG
methods with momentum to the standard SG methods with
This research was partially supported by the Royal Society International a rescaled gain [9]. Their result is obtained among others
Exchange Program, UK, Grant no. IE150128 and the National Research, under the condition that what is called the gradient noise is a
Development and Innovation Office (NKFIH), Hungary, Grant no. 2018-1.2.1-
NKP-00008. B. Cs. Csáji was supported by NKFIH, Grant no. KH 17 125698, martingale difference. In case of LMS, paper [9] assumes an
and the János Bolyai Research Fellowship, Grant no. BO/00217/16/6. The independent sequence of observations to ensure this.
third author is a Turing fellow and was supported by The Alan Turing Institute The objective of the present paper is to significantly relax
under the EPSRC grant EP/N510129/1 (Turing award number TU/B/000026).
L. Gerencsér and B. Cs. Csáji are with MTA SZTAKI: The Institute for the assumptions on the “gradient noise”, and to provide an
Computer Science and Control of the Hungarian Academy of Sciences, Kende accurate characterization of the relationship between the two
utca 13–17, Budapest, Hungary, 1111; email: {laszlo.gerencser, estimator processes in an asymptotic sense, relying on weak
balazs.csaji}@sztaki.mta.hu
S. Sabanis is with School of Mathematics, University of Edinburgh, UK, convergence results developed in [2], leading to a transparent
and Alan Turing Institute, London, UK; email: [email protected] proof. In particular, we show that the asymptotic distribution of
the two estimator processes are identical modulo scaling, and where b := E [xn yn ]. For the sake of convenience in formulating
the effect of the various scaling factors is precisely explored. the relevant results, we set θ̄0 = θ0 .
For the sake of simplicity, our results will be presented for One of the benefits of the ODE method is that it provides
the LMS method, but they can be adapted directly to general quantified bounds or even characterization of the estimation
recursive estimation methods discussed in [2]. error. To describe the magnitude of the estimator error process
(θn ) let us first consider its piecewise constant extension
II. P RELIMINARIES defined by θtc = θn for n ≤ t < n + 1. Equivalently, we may
Let (xn , yn ), ∞ < n < +∞ be a jointly wide sense stationary write θtc = θ[t] , where [t] denotes the integer part of t. Then,
stochastic process, where (xn ) is R p -valued and (yn ) is real- an early result along the lines of applying the ODE method
valued. The best linear mean-square estimator of yn in terms is that, assuming bounded signals, satisfying certain mixing
of the instantenous signal xn is defined as the solution of the conditions, we have for any fixed T > 0, and k being a non-
following minimization problem negative integer, that the following holds:
min E [(yn − xnT θ )2 ], (1) sup |θtc − θ̄t | = OM ((µT )1/2 ), (9)
θ
kT ≤t≤(k+1)T
the solution of which will be denoted by θ ∗ . Thus, θ ∗ is the
assuming the initial condition θ̄kT = θkT c , see [5].
solution of the linear algebraic equation
The assumption on the boundedness of the signals would
E [xn xnT ] θ = R∗ θ = E [xn yn ] with R∗ := E [xn xnT ]. (2)
ensure that the estimator process itself stay bounded w.p.1, and
[C0] We assume that matrix R∗ is non-singular, so that θ ∗ thus a common problem in recursive estimation, namely the
is uniquely defined as θ ∗ = (R∗ )−1 E [x0 y0 ]. need to enforce the boundedness of the estimator process, does
not arise. In the general case of possibly unbounded signals we
Then, the LMS method is described by the algorithm
resort to a standard device, which is the use of truncation. This
T
θn+1 = θn + µ xn+1 (yn+1 − xn+1 θn ), n ≥ 0, (3) is in fact applied in our prime reference, [2]. Thus the original
LMS algorithm is modified by taking a truncation domain D,
with some non-random initial condition θ0 . Here µ > 0 is
where D is the interior of a compact set, and we stop the
a fixed gain or constant step-size, also called learning rate.
estimator process (θn ) if it leaves D. In technical terms,
Introducing an artificial observation error vn , and the (filter
coefficient) estimation error ∆n as τ := inf{t : θtc ∈
/ D }. (10)
vn := yn − xnT θ ∗ and ∆n := θn − θ , ∗
(4)
[C2] We assume that the truncation domain is such that the
the estimation error process (∆n ) follows the dynamics: solution of the ODE (8), with θ̄0 = θ0 , does not leave D.
T
∆n+1 = ∆n − µ xn+1 xn+1 ∆n + µ xn+1 vn+1 , n ≥ 0, (5) To describe the finer structure of the estimator error process
(θn ) let us define the error processes
with ∆0 = θ0 − θ ∗.Note that E [xn vn ] = 0 for any n ≥ 0, i.e.,
the observation error vn is orthogonal to data xn for any n ≥ 0. θ̃n := (θn − θ̄n ), (11)
Henceforth, we shall strengthen our initial condition by
assuming the following: for n ≥ 0, and similarly, set θ̃tc := (θtc − θ̄t ). The key object
[C1] The joint process (xn , yn ), ∞ < n < +∞ is a strictly of study for the weak convergence theory of the LMS, and in
stationary and ergodic stochastic process. fact for more general class of SA processes is the normalized
and time-scaled process (Vt (µ)) defined by
The above algorithm is a special case of the more general
stochastic approximation (SA) method defined by Vt (µ) := µ −1/2 θ̃[(t∧τ)/µ] = µ −1/2 θ̃(t∧τ)/µ
c
. (12)
θn+1 = θn + µH(θn , Xn+1 ), n ≥ 0, (6) In describing the weak limit of the stopped SA process a
with some non-random initial condition θ0 , where (Xn ) is a crucial role is played by the asymptotic covariance matrices of
strictly stationary, ergodic stochastic process and H(θ , X) is the empirical means of the centered correction terms (Hn (θ )−
integrable w.r.t. the law of X0 . In the case of the LMS method, h(θ ), which can be expressed, under reasonable conditions, as
+∞
H(θ , Xn ) = xn (yn − xnT θ ) =: Hn (θ ), (7)
S(θ ) := ∑ E [(Hk (θ ) − h(θ )(H0 (θ ) − h(θ ))T , (13)
with Xn = (xnT ,yn )T . k=−∞
A standard tool for the analysis of stochastic approximation which series converges, e.g., under various mixing conditions.
methods is the associated ODE, two early, scholarly references This is ensured by [C3] bellow (cf. [10, Theorem 19.1]).
for which are [2], [3]. The ODE in our case takes the form, For θ = θ ∗ , in the case of the LMS method, we get
T θ )],
with the notation h(θ ) := E [xn+1 (yn+1 − xn+1
+∞
d S := S(θ ∗ ) = ∑ E [xk wk w0 x0T ]. (14)
θ̄t = h(θ̄ (t)) = b − R∗ θ̄t , t ≥ 0, (8)
dt k=−∞
[C3] We assume that the process defined by The parameter-error process, (∆n ), is then defined by the
[t/µ]−1 following second order dynamics
√
Lt (µ) = ∑ Hn (θ̄µn ) − h(θ̄µn ) µ, (15) T
∆n+1 = ∆n − µxn+1 xn+1 ∆n + γ(∆n − ∆n−1 ) + µxn+1 vn+1 , (20)
n=0
converges weakly, as µ → 0, to a time-inhomogeneous zero- for n ≥ 0, with ∆−1 = ∆0 .
mean Brownian motion (Lt ) with local covariances (S(θ̄t )). In order to analyze the behaviour of (∆n ) we follow standard
We conjecture that for the verification of the above condi- recipes of the theory of linear systems and introduce the state-
tion, it is sufficient to check that for any fixed θ̄ the process vector having twice the dimension of that of ∆n ,
[t/µ]−1
√ ∆n
Lt (µ) = ∑ Hn (θ̄ ) − h(θ̄ ) µ, (16) Un := . (21)
∆n−1
n=0
converges weakly, as µ → 0, to a time-homogeneous zero- Then, the state-space dynamics will become:
mean Brownian motion Lt (θ̄ ) with covariance matrix S(θ̄ ). Un+1 = Un + An+1Un + µWn+1 , (22)
We note that there is a wide range of results ensuring a
Donsker-type theorem as stated above, including stochastic where
T
processes with various mixing conditions, or martingales, see γI − µ · xn+1 xn+1 −γI
An+1 = , (23)
[10]. A prominent example is given in [10, Theorem 19.1]. I −I
We can conclude, using Theorem 2 of [2], that the following
weak convergence result holds: xn+1 vn+1
Wn+1 = . (24)
0
Theorem 1. Under conditions C0, C1, C2 and C3, process
(Vt (µ)) converges weakly, as µ → 0, to a process (Zt ) satis- It is not obvious if and how the above dynamics can be
fying the linear stochastic differential equation (SDE), interpreted as a SA method. Note that for small µ and γ close
to 1 the matrix An+1 is close to the singular matrix
dZt = −R∗ Zt dt + S 1/2 (θ̄t )dWt , (17)
I −I
for t ≥ 0, with initial condition Z0 = 0, where (Wt ) is a T1+ = . (25)
I −I
standard Brownian motion in R p .
for which we have (T1+ )2 = 0.
Let us denote the asymptotic covariance matrix of process
(Zt ) by P0 , It is known that matrix P0 is the unique solution Linear transformation of the state-space. In order to
of the algebraic Lyapunov equation capture the effect and the interaction of the small parameters
µ and 1 − γ on the dynamics (22), following [9], we introduce
−R∗ P0 − P0 R∗ + S = 0, (18) a linear state-space transformation Ū := TU with
where matrix S := S(θ ∗ ) is given by equation (14). Although 1
I −γI
the weak convergence of (Vt (µ)) does not imply directly T := T (γ) = , (26)
1 − γ I −I
that the distribution of µ −1/2 θ̃[(t∧τ)/µ] converges weakly to
N (0, P0 ), when µ → 0 and t → ∞, the corresponding claim
−1 −1 I −γI
T := T (γ) = . (27)
for general SA processes in a Markovian framework has been I −I
established in [3, Part II, Chapter 4, Theorem 15]. Surprisingly,
(2)
the covariance matrix P0 will pop up also in the asymptotic We decompose Ān into two parts Ān = Ā(1) + Ān , where
analysis of the LMS method with momentum.
−µxn xnT 0
(1) γI −γI (2)
A = and An = .
III. LMS WITH M OMENTUM I −I 0 0
A widely studied modification of the fixed gain LMS Then, multiplying (22) by T from the left, and substituting
method is the LMS method with momentum, using a device U = T −1Ū we get that the new state-transition matrix Ān can
(2)
that has proven to be succesful in determimistic optimization be written as the sum Ān = Ā(1) + Ān , where
[7]. The original method is also known as the heavy-ball
method, since the dynamics of the minimization method can (1) (1) −1 1 I −γI γI −γI −1
Ā = TA T = T
be described as the motion of a heavy-ball along a hilly terrain: 1 − γ I −I I −I
T
θn+1 = θn + µ xn+1 (yn+1 − xn+1 θn ) + γ (θn − θn−1 ), (19) 1 0 0 I −γI
=
1 − γ (γ − 1)I (−γ + 1)I I −I
where 0 < γ < 1 and n ≥ 0, with some non-random initial
condition θ0 , and θ−1 = θ0 . The momentum term intruduces
0 0
some kind of memory into the dynamics, and it is hoped that it = (1 − γ) , (28)
0 −I
has a smoothing effect on the estimator process. Note that the
(2) (2)
LMS with momentum is driven by a second order dynamics. and for Ān = TAn T −1 we have
(2) 1 I −γI −µxn xnT 0 −1 that k∆Ūn k ≤ Cn λ 2 , where (Cn ) is a strictly stationary process.
Ān = T
1 − γ I −I 0 0
The associated ODE. Let us define the random field R2p →
R2p ,
and introduce the notations
1 −µxn xnT 0 I −γI
=
1 − γ −µxn xnT 0 I −I H̄n (Ū) := (B̄n + λ D̄n )Ū + W̄n (40)
1 −µxn xnT µγxn xnT h(Ū) := E[H̄n (Ū)] = B̄λ Ū, (41)
=
1 − γ −µxn xnT µγxn xnT
where
µ −1 γ
= ⊗ xn xnT . (29) 0 0 −1 1 − λ
1 − γ −1 γ B̄λ := E[B̄n + λ D̄n ] = +c ⊗ R∗ . (42)
0 −I −1 1 − λ
After multiplication by T , the stochastic input becomes
Then, the associated ODE takes the form
xn vn 1 I −γI xn vn
W̄n = T µ = ·µ d ¯
0 1 − γ I −I 0 Ūt = h̄(Ū¯t ) = B̄λ Ū¯t , t ≥ 0. (43)
dt
µ xn vn
= . (30) For the sake of convenience, we set Ū¯ 0 = Ū0 . The solution for
1 − γ xn vn the limit when λ ↓ 0, corresponding to (38), is denoted by Ū¯t∗ .
The transformed dynamics. A shorthand description for
the dynamics of the transformed state process is Lemma 1. If λ is sufficiently small, then B̄λ is stable.
Ūn+1 = Ūn + Ān+1Ūn + W̄n+1 . (31) The proof of Lemma 1 can be found in Appendix A. It is
straightforward to show that
For the initial condition we have
1
I −γI ∆0 k Ū¯t − Ū¯t∗ k ≤ c̄¯ λ , (44)
¯
Ū0 = T ∆0 =
1 − γ I −I ∆0 for all t ≥ 0, where c̄¯ is a deterministic constant.
As in the plain LMS case, the assumption on the bound-
1 (1 − γ)∆0 ∆
= = 0 , (32) edness of the signals xn , vn would ensure that the estimator
1−γ 0 0
process itself stay bounded w.p.1. In the general case of
thus the initial condition is independent of µ and γ ! possibly unbounded signals we resort to a (virtual) truncation
The point of this transformation is to get a fixed gain SA in order to analyze Ūn . Thus transformed estimator process
procedure for Ūn in its standard form. This is achived by is modified by taking a truncation domain D̄, where D̄ is the
synchronizing the parameters µ and γ. Note that Ā(1) is scaled interior of a compact set, such that Ū ∗ := 0 ∈ D̄, and we stop
(2)
by 1 −γ, while Ān and the input noise is scaled by µ/(1−γ). the process (Ūn ) if it leaves D̄.
Therefore, a natural way of synchronizing them is to set [C2’] We assume that the truncation domain is such that the
µ
= c(1 − γ) leading to µ = c(1 − γ)2 . (33) solution of the ODE (43), with Ū¯ 0 = Ū0 , does not leave D̄.
1−γ
We set
with some fixed constant c > 0. Thus (31) can be rewritten as τ̄ := inf{ n : Ūn ∈
/ D̄ }. (45)
a SA recursion with the fixed gain λ := 1 − γ as follows:
Let us define the error process, for n ≥ 0, as
Ūn+1 = Ūn + λ B̄n+1Ūn + λ 2 D̄n+1Ūn + λ W̄n+1 , (34)
Ū˜ n := (Ūn − Ū¯ n ). (46)
for n ≥ 0, where
0 0
−1 1
and define the normalized and time-scaled error process as
B̄n := +c ⊗ xn xnT , (35)
0 −I −1 1 V̄t (λ ) := λ −1/2Ū˜ [(t∧τ̄)/λ ] . (47)
0 −1 Analogously for the process (Ūn∗ ) we take a truncation domain
D̄n := c ⊗ xn xnT , (36)
0 −1 D̄∗ such that D̄ ⊆ int(D̄∗ ) and define τ̄ ∗ as in (45). Repeating
the above procedure we get
x v
W̄n = c n n . (37)
xn vn V̄t∗ (λ ) := λ −1/2Ū˜ [(t∧
∗
τ̄ ∗ )/λ ] . (48)
Let us approximate (34) by a standard SA recursion where the It can be shown under suitable and reasonable technical
term with step-size λ 2 has been removed, that is conditions that the following assumption is satisfied
∗
Ūn+1 = Ūn∗ + λ B̄n+1Ūn∗ + λ W̄n+1 , with Ū0∗ = Ū0 . (38)
[CW] V̄t (λ ) − V̄t∗ (λ ) converges weakly to zero, as λ → 0.
Using the linearity of the dynamics and under some technical
We note in passing that P(τ̄ ∗ ≥ τ̄) tends to 1 as λ → 0. Due to
conditions it can be shown for the difference process,
assumption [CW] we can work with the asymptotic properties
∆Ūn := Ūn − Ūn∗ , (39) of (Ūn∗ ) and thus henceforth we will focus on this process.
The asymptotic covariance matrices of the empirical means where S̄ and B̄∗ are defined by (50) and (53), respectively.
of the centered correction terms (H̄n∗ (Ū) − h̄∗ (Ū)), can be
expressed, under reasonable conditions (e.g., [10]) as Lemma 2. The solution of the Lyapunov equation (55) is
+∞ c cS + 2P0 cS
P̄ = . (56)
S̄(Ū) := ∑ E [(H̄k∗ (Ū) − h̄∗ (Ū)(H̄0∗ (Ū) − h̄∗ (Ū))T , (49) 2 cS cS
k=−∞
The proof of Lemma 2 can be found in Appendix B.
where Hk∗ and h∗ denote the limit of Hk and h as λ ↓ 0.
It can be easily seen that, in the case of the approximate With Theorem 2 and matrix P̄ at hand, we aim at establish-
LMS method with momentum (38), for Ū = Ū ∗ = 0 , we get ing a weak convergence result and a corresponding covariance
matrix for the LMS method with momentum.
S S Recall that the linear transformation introduced for the state
S̄ := S̄(0) = c2 . (50) space recursion, Ūn = TUn , implies that Un = T −1Ūn . However,
S S
matrix T −1 = T −1 (γ) depends on γ, and T −1 (1) is singular.
In analogy with Condition 2 of [2], we have:
Nevertheless, since (V̄t (λ )) ⇒ (Z̄t ), as λ → 0, where “⇒”
[C3’] We assume that the process defined by denotes weak convergence; and T −1 (γ) → T1+ , as γ → 1, where
[t/λ ]−1 √ T −1 (γ) and T1+ are constant matrices; we can apply Slutsky’s
∗ ∗
L̄t (λ ) = ∑ H̄n∗ (Ū¯ λ n ) − h̄∗ (Ū¯ λ n ) λ, (51) theorem for Polish spaces to conclude that (T −1 (γ)V̄t (λ )) ⇒
n=0 (T1+ Z̄t ), as γ → 1 (or, equivalently, λ → 0, since λ = 1 − γ).
converges weakly, as λ → 0, to a time-inhomogeneous zero- In other words, we essentially established that, as λ → 0,
∗
mean Brownian motion (L̄t ) with local covariances (S̄(Ū¯ t )).
λ −1/2 U[t/λ ] − T −1Ū¯ [t/λ ] ⇒ (T1+ Z̄t ). (57)
Then, analogously to Theorem 1, also using Theorem 2 of [2],
the following weak convergence result: Let us denote the asymptotic covariance matrix of process
(T1+ Z̄t ) by P. Matrix P can be computed from P̄ by
Theorem 2. Under conditions C0, C1, C2’, C3’ and CW,
process (V̄t (λ )) converges weakly, as λ → 0, to a process (Z̄t ) P P0
P = T1+ P̄(T1+ )T = c 0 , (58)
satisfying the linear stochastic differential equation (SDE), P0 P0
∗
d Z̄t = B̄∗ Z̄t dt + S̄ 1/2 (Ū¯ t )dW̄t , (52) using the special structure of matrix T1+ , see (25). As this
matrix was obtained from a “doubled” process, cf. (21),
for t ≥ 0, with initial condition Z̄0 = 0, where (W̄t ) is a its submatrices provide the corresponding covariance in the
standard Brownian motion in R2p and B̄∗ is original space. Now we can state the following theorem:
0 0 −1 1
B̄∗ := lim B̄λ = +c ⊗ R∗ . (53) Theorem 3. Assume C0, C1, C2, C2’, C3, C3’, CW and that
λ ↓0 0 −I −1 1
the weak convergences carry over to N (0, P0 ) and N (0, P),
Let us denote the asymptotic covariance matrix of the as t → ∞, in case of plain and momentum LMS, respectively.
process (Z̄t ) by P̄. Then, matrix P̄ is the unique solution of Then, the covariance (sub)matrix of the asymptotic distribution
the algebraic Lyapunov equation associated with LMS with momentum is c · P0 , where P0 is the
B̄∗ P̄ + P̄B̄T∗ + S̄ = 0, (54) corresponding covariance of plain LMS and c = µ/(1 − γ)2 .
where matrix S̄ := S̄(0) is given by equation (50). Recall that constants µ and γ are the gains of the correction
The relationship between P̄ and P0 will be given in Lemma and momentum terms, respectively. Then, for any µ and γ the
2. Assuming that the weak convergence of (V̄t (λ )) to N (0, P̄), asymptotic covariances of the associated processes of plain and
when λ → 0 and t → ∞, can be established, we will be able to momentum LMS methods differ only by a constant factor.
infer a weak convergence result for the original error process. If we set c = 1, then the two asymptotic covariances are the
same, and in this sense the two algorithms are equivalent.
IV. C OMPARING LMS WITH AND WITHOUT M OMENTUM However, while the weak convergence of standard LMS was
The main aim of this section is to compute the asymptotic obtained by normalizing with µ −1/2 , in case of LMS with
√
covariance of the weak limit process associated with momen- momentum, we need to normalize with λ −1/2 , where λ = µ,
tum LMS and compare it to that of plain LMS. We do this in which implies a slower convergence to the limiting process;
two steps. First, we compute the asymptotic covariance of the in fact there is an order of magnitude difference.
transformed process, then, we map it to the original space. We can decrease the covariance of the asymptotic distribu-
The asymptotic covariance matrix of process (Z̄t ), namely, tion for
p the momentum LMS by decreasing c, however, since
the one obtained from the extended and transformed filter λ = µ/c, this will further slow the convergence down.
coefficient estimaton error process of LMS with momentum, If, on the contrary, we want a smaller normalization factor
is denoted by P̄. Matrix P̄ satisfies the Lyapunov equation for the case of LMS with momentum by setting c large enough,
it will obviously increase the covariance of the asymptotic
B̄∗ P̄ + P̄B̄T∗ + S̄ = 0, (55) distribution. Therefore, there is a trade-off between achieving a
small asymptotic covariance and having a fast rate (i.e., smaller Performing a diagonalization of R∗ via an oprthonormal
normalization factors for the weak convergence). coordinate transformation, and denoting the eigenvalues of R∗
by σk , the left hand side can be written
V. C ONCLUSIONS p
∏ ρ 2 + ρ + σk .
(62)
In this paper we have presented the outline of a transparent k=1
proof related to a recent result [9]. We studied the asymptotic Now σk > 0 for all k implies the claim of the lemma by well-
behavior of the LMS method with momentum, under different, known, elementary calculations.
but significantly more realistic conditions. The key technical
tool of our analysis was a beautiful and powerful weak
A PPENDIX B
convergence result of [2]. We slightly extended the setup of
P ROOF OF L EMMA 2
[9] by allowing the correction and momentum gains to be
independently chosen, resulting in a trade-off between the rate Proof. First, we can observe that
∗
−R R∗ P̄11 P̄12
and the covariance of the asymptotic distribution. 0 0 P̄11 P̄12
B̄∗ P̄ = +c
0 −I P̄21 P̄22 −R∗ R∗ P̄21 P̄22
R EFERENCES ∗
−R (P̄11 − P̄21 ) −R∗ (P̄12 − P̄22 )
0 0
= +c
[1] B. Widrow and M. E. Hoff, “Adaptive switching circuits,” tech. rep., −P̄21 −P̄22 −R∗ (P̄11 − P̄21 ) −R∗ (P̄12 − P̄22 )
Standford Electrodics Lab, Standford University, California, 1960.
[2] J. A. Bucklew, T. G. Kurtz, and W. A. Sethares, “Weak convergence and and thus
−(P̄11 − P̄21 )R∗ −(P̄11 − P̄21 )R∗
local stability properties of fixed step size recursive algorithms,” IEEE 0 −P̄21
Transactions on Information Theory, vol. 39, no. 3, pp. 966–978, 1993. P̄T B̄T∗ = +c .
[3] A. Benveniste, M. Métivier, and P. Priouret, Adaptive algorithms and
0 −P̄22 −(P̄12 − P̄22 )R∗ −(P̄12 − P̄22 )R∗
stochastic approximations. Springer Science & Business Media, 1990. One then observes that the (1,1) element of the (block) matrix
[4] J. A. Joslin and A. J. Heunis, “Law of the iterated logarithm for a
constant-gain linear stochastic gradient algorithm,” SIAM Journal on
B̄∗ P̄ + P̄T B̄T∗ + S̄ satisfies the equation
Control and Optimization, vol. 39, no. 2, pp. 533–570, 2000.
−cR∗ (P̄11 − P̄21 ) − (P̄11 − P̄21 )cR∗ + c2 S = 0. (63)
[5] L. Gerencsér, “Rate of convergence of the LMS method,” Systems &
Control Letters, vol. 24, no. 5, pp. 385–388, 1995. It follows from the uniqueness of the solution of the Lyapunov
[6] H. N. Chau, C. Kumar, M. Rásonyi, and S. Sabanis, “On fixed gain
recursive estimators with discontinuity in the parameters,” arXiv preprint equation associated with the standard LMS, i.e., (18), that
arXiv:1609.05166, 2016.
[7] B. T. Polyak, “Some methods of speeding up the convergence of iter-
P̄11 − P̄21 = cP0 . (64)
ation methods,” USSR Computational Mathematics and Mathematical The latter also implies (by using transposition) that
Physics, vol. 4, no. 5, pp. 1–17, 1964.
[8] B. T. Polyak, Introduction to optimization. Optimization Software, 1987. P̄11 − P̄12 = cP0 . (65)
[9] K. Yuan, B. Ying, and A. H. Sayed, “On the influence of momentum
acceleration on online learning,” Journal of Machine Learning Research, Summing the last two equations yields
vol. 17, no. 192, pp. 1–66, 2016.
[10] P. Billingsley, Convergence of Probability Measures. John Wiley & 2P̄11 − P̄12 − P̄21 = 2cP0 . (66)
Sons, 2nd ed., 1999.
Moreover, the elements (1,2), (2,1) and (2,2) of the (block)
A PPENDIX A matrix B̄∗ P̄ + P̄T B̄T∗ + S̄ satisfy the following equations:
P ROOF OF L EMMA 1 −cR∗ (P̄12 − P̄22 ) − (P̄11 − P̄12 )cR∗ − P̄12 + c2 S = 0 (67)
Proof. It is sufficient to prove the lemma for λ = 0. We may −cR∗ (P̄11 − P̄21 ) − (P̄21 − P̄22 )cR∗ − P̄21 + c2 S = 0 (68)
also assume c = 1, simply replacing R∗ by cR∗ in the proof ∗ ∗ 2
−cR (P̄12 − P̄22 ) − (P̄21 − P̄22 )cR − 2P̄22 + c S = 0 (69)
below. Then, using the Schur complement corresponding to
the (1, 1) block, the characteristic polynomial of B̄ is and recall the equation for the element (1,1), i.e. (63),
∗
−R − ρI R∗
−cR∗ (P̄11 − P̄21 ) − (P̄11 − P̄21 )cR∗ + c2 S = 0. (70)
det (B̄ − ρI) = = (59)
−R∗ R∗ − I − ρI When adding (70) and (69) together and subtracting from them
det (−R∗ − ρI) det R∗ − I − ρI + R∗ (−R∗ − ρI)−1 R∗ . (67) and (68), one concludes that the overall sum of terms
having cR∗ as a multiplier vanishes. Consequently, due to (66)
The matrix in the second term can be written, using the P̄22 = P̄11 − cP0 (71)
commutativity of (−R∗ − ρI)−1 and R∗ , as
which yields, also using (65) and (64), that
(−R∗ − ρI)−1 (−R∗ − ρI) (R∗ − I − ρI) + (R∗ )2
(60)
P̄11 P̄11 − cP0
P̄ = . (72)
P̄11 − cP0 P̄11 − cP0
Since R∗ was assumed to be positive definite, it is sufficient
to show that the roots of Thus, equation (69) is reduced to 2P̄22 = S, which yields P̄22 =
c2 S/2, and consequently due to (71), one obtains P̄11 = S/2 +
det ρ 2 I + ρI + R∗ = 0.
(61) P0 and the solution to the Lyapunov equation is (56).