Abstract
In this paper we would like to address the classical optimization problem of minimizing a proper, convex and lower semicontinuous function via the second order in time dynamics, combining viscous and Hessian-driven damping with a Tikhonov regularization term. In our analysis we heavily exploit the Moreau envelope of the objective function and its properties as well as Tikhonov regularization properties, which we extend to a nonsmooth case. We introduce the setting, which at the same time guarantees the fast convergence of the function (and Moreau envelope) values and strong convergence of the trajectories of the system to a minimal norm solution—the element of the minimal norm of all the minimizers of the objective. Moreover, we deduce the precise rates of convergence of the values for the particular choice of parameters. Various numerical examples are also included as an illustration of the theoretical results.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
1.1 The formulation of the problem
In the Hilbert space H, endowed with the inner product \(\langle \cdot , \cdot \rangle \) and the norm \( \Vert \cdot \Vert = \sqrt{\langle \cdot , \cdot \rangle } \), we consider the classical minimization problem
of a proper, convex and lower semicontinuous function \(\Phi \). In order to address this question we would like to use the well-known technique of linking the gradient of the Moreau envelope \(\Phi _\lambda \) of the objective function \(\Phi \) to the second order in time differential equation
and study its convergence properties, showing that alongside the trajectory—the solution of (1)—the function \(\Phi \) converges to its minimum. The initial conditions are \(x(t_0) = x_0 \in H\) and \(\dot{x}(t_0) = x_1 \in H\) and \( \alpha , \beta \text { and } t_0 > 0 \), \( \Phi : H \longrightarrow \overline{\mathbb {R}} = \mathbb {R} \cup \{ \pm \infty \} \) is a proper, convex and lower semicontinuous function and \(\Phi _\lambda \) is its Moreau envelope of the index \(\lambda > 0\). The function \(\lambda : [t_0, +\infty ) \longrightarrow \mathbb {R}_+\) is assumed to be continuously differentiable and nondecreasing while the function \(\varepsilon : [t_0, +\infty ) \longrightarrow \mathbb {R}_+\) is continuously differentiable and nonincreasing with the property \(\lim _{t \rightarrow +\infty } \varepsilon (t) = 0\). In addition, we assume that \(\mathop {\textrm{argmin}}\limits \Phi \), which is the set of global minimizers of \(\Phi \), is not empty and denote by \(\Phi ^*\) the optimal objective value of \(\Phi \). Finally, for every \(t \ge t_0\) let us introduce the strongly convex function \(\varphi _{\varepsilon (t), \lambda (t)}: H \longrightarrow \mathbb {R}\) defined as \(\varphi _{\varepsilon (t), \lambda (t)} (x) = \Phi _{\lambda (t)}(x) + \frac{\varepsilon (t) \Vert x \Vert ^2}{2}\), and let us denote the unique minimizer of \(\varphi _{\varepsilon (t), \lambda (t)}\) as \(x_{\varepsilon (t), \lambda (t)} = \mathop {\textrm{argmin}}\limits _{H} \varphi _{\varepsilon (t), \lambda (t)}\).
The main goal of the research is to provide the setting in a nonsmooth case, where we would have fast convergence of the function values combined with strong convergence of the trajectories to the element of the minimal norm from the set of all minimizers of the objective function. This analysis is an extrapolation of the one conducted in [3] to the case of a nonsmooth objective function. We also aim to provide the exact rates of convergence of the values for the polynomial choice of the smoothing parameter \(\lambda \) and Tikhonov function \(\varepsilon \). As a conclusion, multiple numerical experiments were conducted allowing better understanding of the theoretical results.
1.2 Related results
The Moreau envelope plays a significant role in nonsmooth optimization. It is defined as (\(\Phi : H \rightarrow \overline{\mathbb {R}}\) is a proper, convex and lower semicontinuous function)
where \(\lambda > 0\). \(\Phi _\lambda \) is convex and continuously differentiable with
and \(\nabla \Phi _\lambda \) is \(\frac{1}{\lambda }\)-Lipschitz continuous. Here,
denotes the proximal operator of \(\Phi \) of parameter \(\lambda \). Moreover (see [1]),
The work [5] by Attouch-László serves as a starting point for a lot of different research topics in nonsmooth optimization. The following dynamics was considered
where \(\alpha > 1\) and \(\beta > 0\), and the term \(\frac{d}{dt} \nabla \Phi _{\lambda (t)}(x(t))\) is inspired by the Hessian driven damping term in the case of smooth functions. For this system multiple fundamental results were proven, such as convergence rates for the Moreau envelope values as well as for the velocity of the system
from where convergence rates for the \(\Phi \) along the trajectories themselves were deduced
In addition, convergence rates for the gradient of the Moreau envelope of parameter \(\lambda (t)\) and its time derivative along x(t) were established
Moreover, the weak convergence of the trajectories x(t) to a minimizer of \(\Phi \) as \(t \rightarrow +\infty \) was deduced.
From here one may go in many directions in order to continue investigating the topic of second order dynamics. Time scaling, for instance, can be introduced to improve the speed of convergence of the values, as it was done in [12]. Another way to proceed is to consider the so-called Tikhonov regularization technique, to which we devote the next few pages of our manuscript.
The presence of the Tikhonov term in the system equation dramatically influences the behaviour of its trajectories, namely, under some appropriate conditions, it improves the convergence of the trajectories from weak to a strong one. Not only that, but it also ensures the convergence not to an arbitrary element from the set of all minimizers of the objective, but to the particular one, which has the smallest norm. Under the presence of the Tikhonov term in the system it is still possible to obtain fast rates of convergence of the function values. Systems with Tikhonov regularization were studied in, for instance, in [2,3,4, 6, 10, 11, 13, 14].
One of the fine examples in a smooth setting is presented below (see [3])
where \(\varphi _t (x) = \Phi (x) + \frac{\varepsilon (t) \Vert x \Vert ^2}{2}\), \( \Phi : H \longrightarrow \mathbb {R}\) is twice continuously differentiable and convex, \(\varepsilon \) is nonincreasing and goes to zero, as \(t \rightarrow +\infty \), and p is chosen appropriately. This system inherits the properties of fast convergence rates of the function values, being of the order \(\frac{1}{t^2}\), and additionally provides the strong convergence results for the trajectories of the system in the same setting.
Concerning the nonsmooth case we refer to [11], where it was covered for the more general systems, governed by a maximally monotone operator, but with a different damping. The authors studied the following dynamics
where \(\alpha > 0\), \(\beta \ge 0\), \(0 < q \le 1\) and \(\lambda (t) = \lambda t^{2q}\) for \(\lambda > 0\), A is a maximally monotone operator and \(A_\lambda \) is its Yosida regularization of the order \(\lambda \). The system (5) is related to the inclusion problem \(0 \in Ax\). The authors showed the fast convergence rates for \(\Vert \dot{x}(t) \Vert \), \(\Vert A_{\lambda (t)} (x(t)) \Vert \) and \(\Vert \frac{d}{dt} A_{\lambda (t)} (x(t)) \Vert \) being of the order \(\frac{1}{t^q}\), \(\frac{1}{t^{2q}}\) and \(\frac{1}{t^{3q}}\) correspondingly. Moreover, they established the strong convergence of the trajectories of the system. In the section 4 of [11] the authors also considered a very interesting particular case of \(A = \partial \Phi \) using the well-known connection \(A_\lambda = \left( \partial \Phi \right) _\lambda = \nabla \Phi _\lambda \) for all \(\lambda > 0\). In this connection we would like to formulate the next remark.
Remark 1
We would like to stress that Theorem 11 of [11] does not cover the case presented in this paper.
-
1.
First of all, the systems (1) and (5) have different damping coefficients. The damping in (1) depends on the Tikhonov function \(\varepsilon \), while the damping in (5) is taken in a polynomial form \(\frac{1}{t^q}\). Thus, if we take \(\varepsilon (t) = \frac{1}{t^{2q}}\) in (5) to mimic the relation between the damping parameter and the Tikhonov function as in (1), then one of the conditions of Theorem 11 becomes
$$\begin{aligned} \int _{t_0}^{+\infty } t^{3q} \varepsilon ^2(t) dt \ = \ \int _{t_0}^{+\infty } \frac{1}{t^q} dt \ < \ +\infty , \end{aligned}$$where \(0< q < 1\), which is obviously not fulfilled.
-
2.
Secondly, the smoothing parameter \(\lambda \) in [11] is fixed, while our analysis holds for more general choice of \(\lambda \). However, if we want to consider the polynomial case of parameters (Sect. 4), then we indeed arrive at a similar restriction for \(\lambda \): in Sect. 4 we will discover that for strong convergence of the trajectories and polynomial choice of parameters, \(\lambda (t) = t^l\), we have to take \(0 \le l < 2\), which is a wider range than \(0< q < 1\) for \(\lambda (t) = t^{2q}\).
-
3.
Finally, the sets of conditions \(\left( C_0 \right) \)–\(\left( C_4 \right) \) and our assumptions (11)–(14) lead to different settings in case of polynomial choice of the function \(\varepsilon (t) = \frac{1}{t^d}\), \(d > 0\) (Sect. 4). Namely, according to Corollary 9 of [11] the setting to satisfy all the conditions in the analysis in this case for \(\beta > 0\) is \(\max \left\{ 1-q, \frac{3q + 1}{2} \right\} < d \le 1 + q\) with \(0< q < 1\), whereas as we will see later (Sect. 4) our set of conditions allows \(1 \le d \le 2\), which is more flexible in terms of the upper bound while being almost the same in terms of the lower bound. Thus, \(d = 2\) is not an option in [11], but the lower limitation for the choice of d could be wider depending on the choice of q. Since the rates of convergence are better for the bigger values of d (Theorem 8), this additional flexibility of the upper bound justifies, in our opinion, the restrictions for the lower one.
In this paper we aim to develop the ideas presented in [3] for \(p = 0\) to cover the nonsmooth case. The objective function \(\Phi \) is no longer required to be (continuously) differentiable, which gives us more freedom in choosing the latter. Moreover, we show that the main quantities \(\Phi _{\lambda (t)} (x(t)) - \Phi ^*\), \(\Phi \left( \mathop {\textrm{prox}}\limits _{\lambda (t) \Phi } (x(t)) \right) - \Phi ^*\), \(\Vert \mathop {\textrm{prox}}\limits _{\lambda (t) \Phi } (x(t)) - x(t) \Vert \) and \(\Vert x(t) - x_{\varepsilon (t), \lambda (t)} \Vert \) go to zero, as \(t \rightarrow +\infty \), without specifying (as it was done in [3]) the choice of the functions \(\varepsilon \) and \(\lambda \). We are also able to obtain rates of convergence of function values in case of the polynomial choice of parameters \(\varepsilon (t) = \frac{1}{t^d}\) for \(d = 2\), which is not an option in [3].
1.3 Our contribution
Our main focus throughout this manuscript is obtaining fast convergence of function values alongside with the strong convergence of the trajectories to a minimal norm solution. The main result is given by the Theorem 5, namely, for any \(t \ge t_0\)
and
and, as we will see later (Theorem 7), all the quantities on the right-hand side of the inequalities are going to zero, as \(t \rightarrow +\infty \), under some additional, yet not very restrictive, assumptions.
A rather interesting particular case of polynomial parameters (\(\varepsilon (t) = \frac{1}{t^d}\), \(\lambda (t) = t^l\), \(l, d > 0\)) is also covered in this paper which gives us the following precise rates of convergence. For t large enough we deduce
and
where \(1 \le d < 2\) and \(0 \le l < d\). The state-of-the-art rates of convergence of the function values are of the order \(\frac{1}{t^2}\), and since d is assumed to be strictly less than 2, we obtain almost as good as state-of-the-art estimates.
Finally, the special case of \(d = 2\) is also considered in this manuscript, which gives the following results depending on the value of the damping coefficient \(\alpha \).
-
1.
If \(0< \alpha < 2\), then for t large enough
$$\begin{aligned}{} & {} \Phi _{\lambda (t)} (x(t)) - \Phi ^* \ \le \ \frac{1}{t^{\frac{\alpha }{2} + 1}},\\{} & {} \Phi \left( \mathop {\textrm{prox}}\limits \nolimits _{\lambda (t) \Phi }(x(t)) \right) - \Phi ^* \ \le \ \frac{1}{t^{\frac{\alpha }{2} + 1}} \end{aligned}$$and
$$\begin{aligned} \Vert \mathop {\textrm{prox}}\limits \nolimits _{\lambda (t) \Phi } (x(t)) - x(t) \Vert ^2 \ \le \ \frac{1}{t^{\frac{\alpha }{2} - l + 1}}. \end{aligned}$$ -
2.
If \(\alpha \ge 2\), then for t large enough
$$\begin{aligned}{} & {} \Phi _{\lambda (t)} (x(t)) - \Phi ^* \ \le \ \frac{1}{t^2},\\{} & {} \Phi \left( \mathop {\textrm{prox}}\limits \nolimits _{\lambda (t) \Phi }(x(t)) \right) - \Phi ^* \ \le \ \frac{1}{t^2} \end{aligned}$$and
$$\begin{aligned} \Vert \mathop {\textrm{prox}}\limits \nolimits _{\lambda (t) \Phi } (x(t)) - x(t) \Vert ^2 \ \le \ \frac{1}{t^{2-l}}. \end{aligned}$$
In this case we cannot guarantee the strong convergence of the trajectories, but for \(\alpha \ge 2\) (which is often the choice of the damping parameter in the literature) we show the best known rates of convergence.
The paper is structured as follows. Section 2 gathers some preliminary results, which we will need in our analysis. The main results of our research are presented in Sect. 3. Section 4 provides the polynomial setting, in which the results are valid and the analysis works and establishes the actual rates of convergence of the values and the trajectories. Finally, Sect. 5 is all about numerical experiments, which illustrate the theory.
2 Preliminaries
2.1 Auxiliary estimates and properties
Let us begin with introducing the so-called first order optimality condition, as we will require it later in our analysis. In our case it reads as
Now we continue with the following lemma (see [9], Proposition 12.22, for the first term of the lemma and [7], Appendix, A1, for the second one).
Lemma 1
Let \(\Phi : H \longrightarrow \overline{\mathbb {R}}\) be a proper, convex and lower semicontinuous function, \(\lambda , \mu > 0\). Then
-
1.
\((\Phi _\lambda )_\mu = \Phi _{\lambda + \mu }\).
-
2.
\( \mathop {\textrm{prox}}\limits _{\mu \Phi _\lambda } = \frac{\lambda }{\lambda + \mu } \mathop {\textrm{Id}}\limits + \frac{\mu }{\lambda + \mu } \mathop {\textrm{prox}}\limits _{(\lambda + \mu ) \Phi } \).
The following estimates will be used later to evaluate the derivative of our energy function.
Lemma 2
The following properties are satisfied:
-
1.
for each \(t \ge t_0\), \(\frac{d}{dt} \left( \varphi _{\varepsilon (t), \lambda (t)} (x_{\varepsilon (t), \lambda (t)}) \right) = \frac{1}{2} \left( {\dot{\varepsilon }}(t) - \dot{\lambda }(t) \varepsilon ^2(t) \right) \Vert x_{\varepsilon (t), \lambda (t)} \Vert ^2\);
-
2.
the function \(t \mapsto x_{\varepsilon (t), \lambda (t)}\) is Lipschitz continuous on the compact intervals of \((t_0, +\infty )\), thus, is almost everywhere differentiable. Moreover, for almost every \(t \ge t_0\)
$$\begin{aligned} \left( \frac{2 {\dot{\lambda }}(t)}{\lambda (t)} - \frac{\dot{\varepsilon }(t)}{\varepsilon (t)} \right) \Vert x_{\varepsilon (t), \lambda (t)} \Vert \ge \left\| \frac{d}{dt} x_{\varepsilon (t), \lambda (t)} \right\| . \end{aligned}$$
Let us also mention two key properties of the Tikhonov regularization, which we will use later in the analysis. For the next Lemma see also Proposition 5 of [11].
Lemma 3
Suppose that
Then the following properties of the mapping \(t \longrightarrow x_{\varepsilon (t), \lambda (t)}\) are satisfied:
and
Lemmas 2 and 3 will be rigorously proven in the Appendix.
2.2 Existence and uniqueness of the solution of (1)
Our nearest goal is to deduce the existence and uniqueness of the solution of the dynamical system (1). Suppose \(\beta > 0\). Let us integrate (1) from \(t_0\) to t to obtain
Denoting \(z(t):= \int _{t_0}^t \left( \alpha \sqrt{\varepsilon (s)} \dot{x}(s) + \nabla \Phi _{\lambda (s)} (x(s)) + \varepsilon (s) x(s) \right) ds - \big ( \dot{x}(t_0) + \beta \nabla \Phi _{\lambda (t_0)} (x_0)) \big )\) for every \(t \ge t_0\) and noticing that \(\dot{z}(t) = \alpha \sqrt{\varepsilon (t)} \dot{x}(t) + \nabla \Phi _{\lambda (t)} (x(t)) + \varepsilon (t) x(t)\) we deduce, that (1) is equivalent to
Let us multiply the second one by \(\beta \) and then by summing it with the first line we get rid of the gradient of the Moreau envelope in the second equation
We denote now \(y(t) = \beta z(t) + \left( 1 - \alpha \beta \sqrt{\varepsilon (t)} \right) x(t)\), and, after simplification, we obtain the following equivalent formulation for the dynamical system
In case \(\beta = 0\) for every \(t \ge t_0\), (1) can be equivalently written as
Therefore, based on the two reformulations of the dynamical system (1) above we provide the following existence and uniqueness result, which is a consequence of the Cauchy-Lipschitz theorem for strong global solutions. The proof follows the lines of the proofs of Theorem 1 in [5] or of Theorem 1.1 in [8] with some small adjustments.
Theorem 4
Suppose that there exists \(\lambda _0 > 0\) such that \(\lambda (t) \ge \lambda _0\) for all \(t \ge t_0\). Then for every \((x_0, \dot{x}(t_0)) \in H \times H \) there exists a unique strong global solution \(x: [t_0, +\infty ) \mapsto H\) of the continuous dynamics (1) which satisfies the Cauchy initial conditions \(x(t_0) = x_0\) and \(\dot{x}(t_0) = \dot{x}_0\).
3 Abstract convergence results of the function values and strong convergence of the trajectories
This section is devoted to establishing some crucial estimates for the following quantities
\(\Phi _{\lambda (t)}(x(t)) - \Phi ^*\) and \(\Vert x(t) - x_{\varepsilon (t), \lambda (t)}\Vert \) for all \(t \ge t_0\). In order to do so we will use the ideas and methods of Lyapunov analysis. We introduce the energy function
where \(\frac{\alpha }{2} \le \gamma < \alpha \). The next theorem provides the main result of this section.
Theorem 5
Let \(x: [t_0, +\infty ) \longrightarrow H\) be a solution of (1). Then for any \(t \ge t_0\)
and
and the trajectory x(t) converges strongly to \(x^*\) as soon as \(\lim _{t \rightarrow +\infty } \frac{E(t)}{\varepsilon (t)} = 0\).
Proof
Consider
Using the definition of E we obtain
By the definition of the proximal mapping
Thus,
and
The second result immediately follows from the \(\varepsilon (t)\)-strong convexity of \(\varphi _{\varepsilon (t), \lambda (t)}\):
and thus
Finally, by \(\lim _{t \rightarrow +\infty } \varepsilon (t) = 0\) and (9) we deduce the strong convergence of the trajectories to \(x^*\) as soon as \(\lim _{t \rightarrow +\infty } \frac{E(t)}{\varepsilon (t)} = 0\). \(\square \)
Theorem 5 provided some abstract estimates for the important quantities. In order to show that these estimates are actually meaningful, we will have to first estimate the energy functional E. The idea is to show that this energy function satisfies the following differential inequality, as it was done in [3],
where \(\mu (t) = \left( \alpha - \gamma \right) \sqrt{\varepsilon (t)} - \frac{{\dot{\varepsilon }}(t)}{2 \varepsilon (t)}\) and g are positive functions. The next theorem provides the analysis needed to obtain the desired inequality.
Theorem 6
Let \(x: [t_0, +\infty ) \longrightarrow H\) be a solution of (1). Assume that (7) holds and suppose that there exist \(a, c > 0\) such that for t large enough it holds that
and
Then there exists \(t_1 \ge t_0\) such that for all \(t \ge t_1\)
and
where \(\Gamma (t) = \exp \left( \int _{t_1}^t \mu (s) ds \right) \) and \(g(t) = {\dot{\lambda }}(t) \varepsilon ^2(t) - {\dot{\varepsilon }}(t) + \frac{\gamma \beta {\dot{\varepsilon }}(t) \sqrt{\varepsilon (t)}}{2} + \gamma (2a + c \gamma ) \sqrt{\varepsilon (t)} \left( \frac{2 \dot{\lambda }(t)}{\lambda (t)} - \frac{{\dot{\varepsilon }}(t)}{\varepsilon (t)} \right) ^2 \).
Proof
We start with computing the derivative of the energy function (10). Let us denote \(v(t) = \gamma \sqrt{\varepsilon (t)} \left( x(t) - x_{\varepsilon (t), \lambda (t)} \right) + \dot{x}(t) + \beta \nabla \Phi _{\lambda (t)} (x(t))\). Once again, by the classical derivation chain rule using (1) from Lemma 2 and (3) we obtain for all \(t \ge t_0\)
Our nearest goal is to obtain the upper bound for \(\dot{E}\). Let us calculate for all \(t \ge t_0\)
where above we used (1). Thus, for all \(t \ge t_0\)
Let us use the previous estimates to evaluate the quantity \(\langle \dot{v}(t), v(t) \rangle \). Namely, by the \(\varepsilon (t)\)-strong convexity of \(\varphi _{\varepsilon (t), \lambda (t)}\) for all \(t \ge t_0\)
and then for all \(t \ge t_0\)
Again, by the \(\varepsilon (t)\)-strong convexity of \(\varphi _{\varepsilon (t), \lambda (t)}\) since \({\dot{\varepsilon }}(t) \ \le \ 0\) for all \(t \ge t_0\)
Furthermore,
It is true that for all \(a > 0\)
as well as
In the same spirit for all \(b > 0\)
Furthermore,
and
Combining all the estimates above we arrive for all \(t \ge t_0\) at
Returning to the expression for \(\dot{E}(t)\) we notice that the terms \(\left\langle \nabla \varphi _{\varepsilon (t), \lambda (t)} (x(t)), \dot{x}(t) \right\rangle \) cancel each other out.
Let us now consider
since
Therefore, using \(\mu (t) = \left( \alpha - \gamma \right) \sqrt{\varepsilon (t)} - \frac{{\dot{\varepsilon }}(t)}{2 \varepsilon (t)}\) (the terms with \(\langle x(t) - x_{\varepsilon (t), \lambda (t)}, \dot{x}(t) \rangle \) disappear), we obtain for all \(t \ge t_0\)
Further we have (\({\dot{\varepsilon }}(t) \le 0\) for all \(t \ge t_0\))
As we have established earlier by Lemma 2 item 2 and (8)
and since there exists \(t_1 \ge t_0\) such that \(\left( \sqrt{\varepsilon (t)} \rightarrow 0, \text { as } t \rightarrow +\infty \right) \)
we deduce for all \(t \ge t_1\)
Choosing \(b = c \gamma \) with \(c > 0\) we obtain for all \(t \ge t_1\)
Let us investigate the signs of the terms in the inequality above when t is large enough to satisfy what we assumed before (11)–(14). First of all,
due to (11). Secondly,
due to (12). Next we have
due to (11). Then,
due to (14). Finally,
due to (13), since
So, at the end we deduce for all \(t \ge t_1\)
Integrating (16) from \(t_1\) to t we obtain
or, neglecting the positive terms,
From (16) we also obtain for all \(t \ge t_1\)
Multiplying this with \(\Gamma (t) = \exp \left( \int _{t_1}^t \mu (s) ds \right) \) and integrating again on \([t_1, t]\) we deduce
\(\square \)
Now that we have an estimate for the energy function as well, we are able to derive, under which conditions the quantities of the right-hand side of the estimates of the Theorem 5 do converge to zero, as \(t \rightarrow +\infty \). This result is given by the next theorem.
Theorem 7
Let \(x: [t_0, +\infty ) \longrightarrow H\) be a solution of (1). Assume that (7) holds and suppose that there exist \(a, c > 0\) such that for t large enough assumptions (11) - (14) hold. Suppose additionally that
Then
The proof of this theorem reader may find in the Appendix. As we can see, the results of Theorem 7 together with (7) and \(\lim _{t \rightarrow +\infty } \varepsilon (t) = 0\) guarantee the convergence to zero, as \(t \rightarrow +\infty \), of the quantities in Theorem 5.
4 Polynomial choice of parameters
In this section we would like to specify the form of the functions \(\lambda \) and \(\varepsilon \), namely, taking \(\lambda (t) = t^l\) and \(\varepsilon (t) = \frac{1}{t^d}\), \(l \ge 0\) and \(d > 0\), and show that the main results still hold in this case. First of all, equation (1) becomes
The second step would be to show that the main result of Theorem 6 is valid and obtain the precise rates of convergence for the function values and trajectories using Theorem 5. In order to do so let us formulate the next theorem.
Theorem 8
Let \(x: [t_0, +\infty ) \longrightarrow H\) be a solution of (18). Assume that \(1 \le d < 2\) and \(0 \le l < d\). Then for t large enough
and
If \(d = 2\) and \(0 \le l < 2\) we deduce the following estimates for t large enough.
If \(0< \alpha < 2\), then
and
If \(\alpha \ge 2\), then
and
Proof
It is easy to check (see Appendix) that the choice above of d and l satisfies conditions (7) and (11)–(14). Therefore the results of Theorem 6 are valid in this case. Namely, in Theorem 6, we have obtained
where \(g(t) = {\dot{\lambda }}(t) \varepsilon ^2(t) - {\dot{\varepsilon }}(t) + \frac{\gamma \beta {\dot{\varepsilon }}(t) \sqrt{\varepsilon (t)}}{2} + \gamma (2a + c \gamma ) \sqrt{\varepsilon (t)} \left( \frac{2 \dot{\lambda }(t)}{\lambda (t)} - \frac{{\dot{\varepsilon }}(t)}{\varepsilon (t)} \right) ^2\), \(\Gamma (t) = \exp \left( \int _{t_1}^t \mu (s) ds \right) \) and \(\mu (t) = \left( \alpha - \gamma \right) \sqrt{\varepsilon (t)} - \frac{{\dot{\varepsilon }}(t)}{2 \varepsilon (t)}\). Now, our goal is to deduce the actual rates of convergence of the function values and trajectories. The proof will be divided into several sections for the convenience of the reader.
5 The functions \(\mu \) and \(\Gamma \)
Let us consider the case when \(1 \le d < 2\). The case when \(d = 2\) will be treated separately. The function \(\mu \) thus writes as follows \(\mu (t) \ = \ \frac{\alpha - \gamma }{t^{\frac{d}{2}}} + \frac{d}{2t}\). Then,
where \(C = \left( t_1^\frac{d}{2} \exp \left[ \frac{\alpha - \gamma }{1 - \frac{d}{2}} t_1^{1 - \frac{d}{2}} \right] \right) ^{-1}\). So, \(\frac{\Gamma (t_1) E(t_1)}{\Gamma (t)}\) goes to zero exponentially, as time goes to infinity due to \(1 \le d < 2\).
6 The function g
First notice that
where \(C_1 = \gamma (2a + c \gamma ) (2l + d)^2\). Then,
Let us notice that the behaviour of \(\frac{l}{t^{\frac{3d}{2} + 1 - l}} + \frac{d}{t^{\frac{d}{2}+1}} - \frac{\gamma \beta d}{2 t^{d+1}} + \frac{C_1}{t^2}\) is dictated by the term \(\frac{1}{t^{\frac{d}{2}+1}}\), as \(t \rightarrow +\infty \), since \(1 \le d < 2\) and \(0 \le l < d\).
7 Integrating the product \(\Gamma g\)
The technique, which will be used in this section, is inspired by [3]. First of all, notice that for some \(\delta > 0\)
Secondly, there exists such \(\delta \) that starting from some \(t_2 \ge t_1\) it holds that
Thus,
where \(C_2 = C \frac{\exp \left( \frac{\alpha - \gamma }{1 - \frac{d}{2}} t_2^{1 - \frac{d}{2}} \right) }{\delta t_2}\).
8 Finalizing the estimates
Let us return to
This expression converges to zero at a speed of the slowest decaying term (all the other decay exponentially):
Thus, there exists a constant \(C_3 > 0\) such that for all \(t \ge t_2\)
9 The rates themselves
Now we can deduce the actual rates for the quantities in Theorem 5. For all \(t \ge t_2\)
and
Finally, there exist constants \(C_4, C_5 > 0\) such that for all \(t \ge t_2\)
and
10 The rates of convergence of the function values in case \(d=2\)
This particular case is of a great interest, as it is in a way a bordering case, when one cannot show the strong convergence of the trajectories, but still can show the fast convergence of the values. In this case the functions \(\mu \) and \(\Gamma \) are
and
where \(C = \frac{1}{t_1^{\alpha - \gamma + 1}}\). The function g is
where \(C_1 = 4 \gamma (2a + c \gamma ) (l + 1)^2\). Thus,
So,
where \(C_2 = C \left( \frac{l t_1^{\alpha - \gamma + l - 3}}{\alpha - \gamma + l - 3} + \frac{\left( 2 + C_1 \right) t_1^{\alpha - \gamma - 1}}{\alpha - \gamma - 1} - \frac{\gamma \beta t_1^{\alpha - \gamma - 2}}{\alpha - \gamma - 2} \right) \). By Theorem 5 we have
where \(C_3 = \frac{2C t_1^{\alpha - \gamma + 1} E(t_1) - C_2 \Vert x^* \Vert ^2}{2C}\). We know that \(\frac{\alpha }{2} \le \gamma < \alpha \) and \(0 \le l < 2\). Thus, in the brackets the term with \(t^{-2}\) is dominating, as \(t \rightarrow +\infty \). Moreover, \(\alpha - \gamma + 1 > 1\). So, the behaviour of the entire expression depends on the value of \(\alpha \). There exists a constant \(C_4\) such that for all \(t \ge t_1\)
That leads us to the following rates for all \(t \ge t_1\)
and
As we can see, the strong convergence of the trajectories can no longer be shown. Nevertheless, for \(C_5 = \frac{2C_4 + \Vert x^* \Vert ^2}{2}\) we deduce for all \(t \ge t_1\)
and
Since we are free to choose \(\gamma \) such that \(\frac{\alpha }{2} \le \gamma < \alpha \), and since we want to have as fast rates as possible, we should take \(\gamma = \frac{\alpha }{2}\).
and
Here we have to consider several cases.
-
1.
If \(0< \alpha < 2\), then there exists \(C_6\) such that for all \(t \ge t_1\)
$$\begin{aligned}{} & {} \Phi _{\lambda (t)} (x(t)) - \Phi ^* \ \le \ \frac{C_6}{t^{\frac{\alpha }{2} + 1}},\\{} & {} \Phi \left( \mathop {\textrm{prox}}\limits \nolimits _{\lambda (t) \Phi }(x(t)) \right) - \Phi ^* \ \le \ \frac{C_6}{t^{\frac{\alpha }{2} + 1}} \end{aligned}$$and
$$\begin{aligned} \Vert \mathop {\textrm{prox}}\limits \nolimits _{\lambda (t) \Phi } (x(t)) - x(t) \Vert ^2 \ \le \ \frac{2C_6}{t^{\frac{\alpha }{2} - l + 1}}. \end{aligned}$$ -
2.
If \(\alpha \ge 2\), then there exists \(C_6\) such that for all \(t \ge t_1\)
$$\begin{aligned}{} & {} \Phi _{\lambda (t)} (x(t)) - \Phi ^* \ \le \ \frac{C_6}{t^2},\\{} & {} \Phi \left( \mathop {\textrm{prox}}\limits \nolimits _{\lambda (t) \Phi }(x(t)) \right) - \Phi ^* \ \le \ \frac{C_6}{t^2} \end{aligned}$$and
$$\begin{aligned} \Vert \mathop {\textrm{prox}}\limits \nolimits _{\lambda (t) \Phi } (x(t)) - x(t) \Vert ^2 \ \le \ \frac{2C_6}{t^{2-l}}. \end{aligned}$$
\(\square \)
Remark 2
Probably, it is possible to show the weak convergence of the trajectories to a minimizer of the objective function in case \(d=2\).
11 Numerical examples
11.1 The rates of convergence of the Moreau envelope values
Let us consider the following objective function \(\Phi : \mathbb {R} \rightarrow \mathbb {R}\), \(\Phi (x) = |x| + \frac{x^2}{2}\) and plot the values of its Moreau envelope for different polynomial functions \(\lambda \) and \(\varepsilon \) in order to illustrate the theoretical results with some numerical examples. We set \(\lambda (t) = t^l\) and \(\varepsilon (t) = \frac{1}{t^d}\) with \(x(t_0) = x_0 = 10\), \(\dot{x}(t_0) = 0\), \(\alpha = 10\), \(\beta = 1\) and \(t_0 = 1\).
Consider different Moreau envelope parameters \(\lambda \) with \(d = 1.9\) (Fig. 1):
We notice that a faster growing function \(\lambda \) implies faster convergence of the Moreau envelope of the objective function \(\Phi \).
Increasing the speed of decay of the Tikhonov function \(\varepsilon \) for a fixed \(l = 1\) accelerates the convergence of the Moreau envelope values, which was predicted by the theory (Fig. 2):
11.2 Strong convergence of the trajectories
For the different objective function let us investigate the strong convergence of the trajectories of (1) and show some examples when the trajectories actually diverge due to one of the key assumptions of the analysis not being fulfilled. We define
The set \(\mathop {\textrm{argmin}}\limits \Phi \) is the segment \([-1, 1]\) and clearly 0 is its element of the minimal norm. Let us investigate the influence of the Tikhonov term on the behaviour of the trajectories of the system for \(\lambda (t) = t\) (Fig. 3).
As we can see in case Tikhonov function is missing the trajectories converge to a minimizer 1 of \(\Phi \), however, Tikhonov term ensures the convergence to the minimal norm solution 0.
Finally, for the same choice of \(\lambda \) and \(\Phi \) let us take different Tikhonov functions to study their effect on the trajectories of (1). For this purpose we increase the starting point to \(x(t_0) = 100\) (Fig. 4).
As we see, the faster \(\varepsilon \) decays, the slower trajectories converge, which totally corresponds to the theoretical results.
To end this section let us break some of the fundamental conditions of our analysis and show that there is no convergence of the trajectories in this case (Fig. 5).
Availability of data and materials
In this manuscript no datasets were analysed or generated, because of the purely theoretical aspect of this research.
References
Attouch, H., Cabot, A.: Convergence of damped inertial dynamics governed by regularized maximally monotone operators. J. Differ. Equ. 264, 7138–7182 (2018)
Attouch, H., Balhag, A., Chbani, Z., Riahi, H.: Damped inertial dynamics with vanishing Tikhonov regularization: strong asymptotic convergence towards the minimum norm solution. J. Differ. Equ. 311, 29–58 (2022)
Attouch, H., Balhag, A., Chbani, Z., Riahi, H.: Accelerated gradient methods combining Tikhonov regularization with geometric damping driven by the Hessian. Appl. Math. Optim. 88, 29 (2023)
Attouch, H., Chbani, Z., Riahi, H.: Combining fast inertial dynamics for convex optimization with Tikhonov regularization. J. Math. Anal. Appl. 457, 1065–1094 (2018)
Attouch, H., László, S.C.: Continuous Newton-like inertial dynamics for monotone inclusions. Set-valued Var. Anal. 29, 555–581 (2021)
Attouch, H., László, S.C.: Convex optimization via inertial algorithms with vanishing Tikhonov regularization: fast convergence to the minimum norm solution. arXiv:2104.11987 (2021)
Attouch, H., Peypouquet, J.: Convergence of the inertial dynamics and proximal algorithms governed by maximally monotone operators. Math. Program. 174, 391–432 (2019)
Attouch, H., Peypouquet, J., Redont, P.: Fast convex optimization via inertial dynamics with Hessian driven damping damping. J. Differ. Equ. 261(10), 5734–5783 (2016)
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics. Springer, Berlin (2016)
Boţ, R.I., Csetnek, E.R., László, S.C.: Tikhonov regularization of a second order dynamical system with Hessian driven damping. Math. Program. 189, 151–186 (2021)
Boţ, R.I., Csetnek, E.R., László, S.C.: On the strong convergence of continuous Newton-like inertial dynamics with Tikhonov regularization for monotone inclusions. J. Math. Anal. Appl. 530(2), 127689 (2023)
Boţ, R.I., Karapetyants, M.A.: A fast continuous time approach with time scaling for nonsmooth convex optimization. In: Advances in Continuous and Discrete Models: Theory and Applications, vol. 73 (2022)
Csetnek, R.E., Karapetyants, M.A.: A fast continuous time approach for non-smooth convex optimization with time scaling and Tikhonov regularization, preprint (2022)
László, S.C.: On the strong convergence of the trajectories of a Tikhonov regularized second order dynamical system with asymptotically vanishing damping. J. Differ. Equ. 362, 355–381 (2022)
Acknowledgements
The author is immensely grateful to Professor R.I. Boţ (University of Vienna) and to three anonymous reviewers for valuable comments and fruitful discussions, which significantly improved the quality of this manuscript.
Funding
Open access funding provided by Austrian Science Fund (FWF).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Research supported by the Doctoral Programme Vienna Graduate School on Computational Optimization (VGSCO) which is funded by FWF (Austrian Science Fund), project W 1260.
Appendices
Appendix
Proof of the Lemma 2
Proof
By the definition of \(\varphi _{\varepsilon (t), \lambda (t)}\)
by 1 of Lemma 1. Thus,
where the second equality comes from 2 of Lemma 1. Combining the last two equalities with
we obtain
which is the first claim.
To obtain the second claim we start with (6) noticing that for \(h > 0\)
Consider
Taking the inner product of each part of this equality with \(x_{\varepsilon (t+h), \lambda (t+h)} - x_{\varepsilon (t), \lambda (t)}\), we notice that
by the monotonicity of \(\nabla \Phi _{\lambda (t+h)}\). So,
Let us divide the last inequality by \(h^2\) to obtain
Now notice that, since the mapping \(t \mapsto x_{\varepsilon (t), \lambda (t)}\) is Lipschitz continuous on the compact intervals of \(\mathbb {R}_+ \setminus \{0\}\) (according to [1]), therefore, almost everywhere differentiable. Tending h to zero we deduce for almost every \(t \ge t_0\)
where we used the following estimate from [12]
On the other hand, Cauchy-Schwartz inequality yields
Combining the last two inequalities we arrive at
Replacing \(\nabla \Phi _{\lambda (t)} (x_{\varepsilon (t), \lambda (t)})\) using (6) gives us the second claim. \(\square \)
Proof of the Lemma 3
Proof
Suppose that \(t \ge t_0\). By the monotonicity of \(\nabla \Phi _{\lambda (t)}\) we deduce
By (6) we obtain
Using Cauchy-Schwarz inequality we derive
This proves the first claim. For the second one consider (6) again and note that it is equivalent to
by the item 2 of Lemma 1. Note that by (7) we have \(\lambda (t) + \frac{1}{\varepsilon (t)} \rightarrow +\infty \) and \(\lambda (t) \varepsilon (t) + 1 \rightarrow 1\), as \(t \rightarrow +\infty \). From now on the proof is inspired by Theorem 23.44 of [9]. Take \(z \in \mathop {\textrm{argmin}}\limits \Phi = \mathop {\textrm{argmin}}\limits \Phi _\lambda \) for each \(\lambda > 0\). From (19) and from the fact that the resolvent of maximally monotone operator is maximally monotone and firmly nonexpansive (see, for instance, Corollary 23.11(i) of [9]) and Cauchy-Schwarz inequality it follows that for all \(t \ge t_0\) (note that z could be represented as \(z = \mathop {\textrm{prox}}\limits \nolimits _{\frac{1}{\varepsilon (t)}\Phi _{\lambda (t)}} (z)\))
which gives the boundedness of \(x_{\varepsilon (t), \lambda (t)}\) for all \(t \ge t_0\). Now, let y be a weak sequential cluster point of \(\{ x_{\varepsilon (t_n), \lambda (t_n)} \}_{n \in \mathbb {N}}\), namely, \(x_{\varepsilon (t_{k_n}), \lambda (t_{k_n})} \rightharpoonup y\), as \(n \rightarrow +\infty \). From (6) we deduce
Using
we further obtain
which is equivalent to
or
The sequence
lies in \(\mathop {\textrm{gra}}\limits \partial \Phi \) by (21) and converges to (y, 0) in \(H^{weak} \times H^{strong}\) due to the sequence \(\{ x_{\varepsilon (t_{k_n}), \lambda (t_{k_n})} \}_{n \in \mathbb {N}}\) being also bounded and (7). Therefore, since \(\mathop {\textrm{gra}}\limits \partial \Phi \) is sequentially closed (see Proposition 20.38(ii) of [9]) it follows that \(y \in \mathop {\textrm{argmin}}\limits \Phi \). From (20) we derive
by the definition of weak convergence, thus, \(x_{\varepsilon (t_{k_n}), \lambda (t_{k_n})} \rightarrow y\), as \(n \rightarrow +\infty \). On the other hand, (20) leads to
and thus \(y = x^*\) by the characterization of \(x^*\), namely,
So, \(x^*\) being the only weak sequential cluster point of the bounded sequence \(\left\{ x_{\varepsilon (t_n), \lambda (t_n)} \right\} _{n \in \mathbb {N}}\) means that \(x_{\varepsilon (t_n), \lambda (t_n)} \rightharpoonup x^*\), as \(n \rightarrow +\infty \), by Lemma 2.46 of [9]. By (20) again we deduce
and so the second claim follows. \(\square \)
Proof of the Theorem 7
Proof
Let us notice that we can simplify a bit the proof due to
The proof of the theorem will be divided into several steps.
The asymptotic behaviour of the function \(\Gamma \)
Let us start with the function \(\Gamma (t) \ = \ \exp \left( \int _{t_1}^t \mu (s) ds \right) \) for \(\mu (t) = -\frac{\dot{\varepsilon }(t)}{2 \varepsilon (t)} + \left( \alpha - \gamma \right) \sqrt{\varepsilon (t)}\).
Since \(\varepsilon (t)\) is positive for all \(t \ge t_1 \ge t_0\), the integral is nonnegative and the whole exponent is lower bounded by 1. Using the property of Tikhonov function, namely, \(\lim _{t \rightarrow +\infty } \varepsilon (t) = 0\), we deduce that
The asymptotic behaviour of the function E
For now it is enough to assume the following, but later we will have to strengthen these assumptions:
Let us recall the form of the energy function
where \(\frac{\alpha }{2} \le \gamma < \alpha \). Let us study the behaviour of the function
as \(t \rightarrow +\infty \). Since \(\tilde{g}(t) \Gamma (t) \ \ge \ 0\) for all \(t \ge t_1\) so is the integral \(\int _{t_1}^t \tilde{g}(s) \Gamma (s) ds\). If \(0 \ \le \ \int _{t_1}^t \tilde{g}(s) \Gamma (s) ds \ \le \ +\infty \), then E(t) goes to zero as \(t \rightarrow +\infty \) due to the properties of h and Theorem 6. Otherwise, we may apply L’Hospital’s rule to obtain
if the latest exists, which we are going to show now. Consider
since \(- \frac{{\dot{\varepsilon }}(t)}{\varepsilon ^\frac{3}{2}(t)} \ge 0\). Notice that
So, by (22) we deduce
Consider now
Again, by (11) we know that
So, again using (22) we deduce that \(\lim _{t \rightarrow +\infty } E(t) \ = \ 0\).
The asymptotic behaviour of the function \(\frac{E}{\varepsilon }\)
For this part we assume the full set of conditions (17). In the same spirit let us analyse the asymptotic behaviour of \(\frac{E(t)}{\varepsilon (t)}\) as \(t \rightarrow +\infty \). From Theorem 6 we know that
By (17) we immediately deduce that \(\lim _{t \rightarrow +\infty } \frac{E(t_1)}{\varepsilon (t) \Gamma (t)} = 0\). For the first term let us use the same technique as in the previous chapter and apply L’Hospital’s rule to obtain
Thus, we have established that \(\lim _{t \rightarrow +\infty } \frac{E(t)}{\varepsilon (t)} \ = \ 0\).
The asymptotic behaviour of the function \(\lambda E\)
In this section we need to assume the full set of conditions (17) again. We will study the behaviour of \(\lambda (t) E(t)\), as \(t \rightarrow +\infty \). Again, from Theorem 6 we know that
We immediately obtain that \(\frac{E(t_1) \lambda (t)}{\Gamma (t)} \rightarrow 0\) as \(t \rightarrow +\infty \), since
Arguing in the same way we deduce for the first term
Consider
since
and
Thus, we have established that \(\lim _{t \rightarrow +\infty } \frac{E(t)}{\lambda (t)} \ = \ 0\). \(\square \)
Polynomial choice of parameters satisfies the key assumptions (7) and (11)–(14)
The set of assumptions for \(\lambda (t) = t^l\) and \(\varepsilon (t) = \frac{1}{t^d}\), l, \(d > 0\) becomes
-
(i)
\( \lim _{t \rightarrow +\infty } t^{l-d} \ = \ 0 \); There exist \(\frac{\alpha }{2} \le \gamma < \alpha , \ a> 0 \text { and } c > 0\) such that for all t large enough
-
(ii)
\( \frac{d}{2} t^{\frac{d}{2}-1} \ \le \ \min \left\{ 2 \gamma - \alpha + \frac{\gamma \beta d}{2t}, \ \alpha - \gamma \frac{a + 1}{a} \right\} \);
-
(iii)
\( \left( 2 \gamma (\alpha - \gamma ) + \frac{\gamma }{c} - 1 \right) \frac{1}{t^d} + \frac{d \beta }{t^{d+1}} \ \le \ 0 \);
-
(iv)
\( \frac{2 \beta }{t^{2d}} - d \left( 2 - \frac{\gamma \beta }{t^\frac{d}{2}} \right) \frac{1}{t^{d+1}} \ \le \ 0 \) and
-
(v)
\( \left( \frac{\gamma }{a} + 2 (\alpha - \gamma ) \right) \beta ^2 \frac{1}{t^\frac{d}{2}} + \frac{3d \beta ^2}{2 t} - l t^{l-1} \ \le \ \beta \).
The conditions above are, in turn, equivalent to
-
(i)
\( l < d \);
-
(ii)
\(d \le 2\);
-
(iii)
\( 2 \gamma (\alpha - \gamma ) \ < \ 1 \);
-
(iv)
\(d \ge 1\) and
-
(v)
is always satisfied starting from t large enough.
Finally, we deduce for l and d
which is our desired setting to satisfy all the conditions and to guarantee that the main results of this paper are valid.
Remark 3
Condition (iii) does not contradict with the choice of \(\gamma \), namely, \(\gamma \) could be chosen to satisfy both of them at the same time:
Indeed, (iii) implies
If \(\alpha < \sqrt{2}\), then \(\gamma ^2 - \alpha \gamma + \frac{1}{2}\) is always positive and we are free to choose \(\gamma \) such that \(\frac{\alpha }{2} \le \gamma < \alpha \). Otherwise, \(\alpha \ge \sqrt{2}\) means that
and thus we take
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Karapetyants, M.A. A fast continuous time approach for non-smooth convex optimization using Tikhonov regularization technique. Comput Optim Appl 87, 531–569 (2024). https://doi.org/10.1007/s10589-023-00536-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-023-00536-6
Keywords
- Nonsmooth convex optimization
- Damped inertial dynamics
- Hessian-driven damping
- Moreau envelope
- Proximal operator
- Tikhonov regularization
- Strong convergence