Iterative regularization in classification via hinge loss diagonal descent

Vassilis Apidopoulos 111MaLGa, DIBRIS, Università di Genova, Genoa, Italy. E-mail: [email protected] (Vassilis Apidopoulos) 222Archimedes, Athena RC, Athens, Greece.    Tomaso Poggio333CBMM, MIT, MA, USA. E-mail: [email protected] (Tomaso Poggio), [email protected] (Lorenzo Rosasco)    Lorenzo Rosascofootnotemark:  footnotemark:  444Istituto Italiano di Tecnologia, Genoa, Italy.    Silvia Villa555MaLGa, DIMA, Università di Genova, Genoa, Italy. E-mail: [email protected] (Silvia Villa)
Abstract

Iterative regularization is a classic idea in regularization theory, that has recently become popular in machine learning. On the one hand, it allows to design efficient algorithms controlling at the same time numerical and statistical accuracy. On the other hand it allows to shed light on the learning curves observed while training neural networks. In this paper, we focus on iterative regularization in the context of classification. After contrasting this setting with that of linear inverse problems, we develop an iterative regularization approach based on the use of the hinge loss function. More precisely we consider a diagonal approach for a family of algorithms for which we prove convergence as well as rates of convergence and stability results for a suitable classification noise model. Our approach compares favorably with other alternatives, as confirmed by numerical simulations.

1 Introduction

Estimating a quantity of interest from finite measurements is a central problem in a number of fields including inverse problems but also machine learning, statistics and signal processing. In this context, a key idea is that reliable estimation requires imposing some prior assumptions on the problem at hand. Regularization theory for inverse problems provides a principled framework to formalize this idea [31]. The quantity of interest is typically seen as a function, or a vector, and prior assumptions take the form of suitable functionals, called regularizers. Following this idea, Tikhonov regularization provides a classic approach to estimate solutions [91, 92]. The latter are found minimizing an empirical objective where a data fit term is penalized adding a chosen regularizer. Other regularization approaches are classic in inverse problems, and in particular iterative regularization has become popular in machine learning, see e.g. [95, 78, 88]. This approach is based on the observation that iterative optimization procedures have a self-regularizing property, so that a chosen regularization can be enforced implicitly along the iterations. Such an observation seems to shed light on some theoretical properties of deep learning approaches in machine learning [41, 69, 37, 38, 77]. More generally, iterative regularization provides an approach to design algorithms striking a balance between statistical accuracy and computational efficiency [15, 98, 79, 61].

In this paper, we focus on iterative regularization in the context of classification, perhaps the most classical among machine learning problems [27, 89, 85]. After discussing the differences and similarities between classification and classical linear inverse problems, we recall how different iterative regularization schemes for classification can be defined depending on the considered loss function. Then, we focus on the hinge loss used in support vector machines [89]. In this case, compared to other loss functions such as the exponential and logistic loss [66, 87, 46], a simple gradient approach does not allow to establish iterative regularization properties. Indeed, we propose a diagonal approach in the same spirit of [34] and its inertial version [21] and prove their regularization properties, including convergence and stability. The proposed approach compares favorably to analogous results for the logistic loss [66, 87, 46, 44], but also with recent approaches considering the hinge loss [63]. Indeed, we show that fast convergence rates are possible, largely improving previous results. Further, we prove the first stability results under a suitable classification noise model which is inspired by the deterministic noise classically considered in linear inverse problems. We note that in a noiseless setting, our approach can also be seen as a way to solve the basic separable support vector machines problem introduced in the seminal work [27]. In this view, relevant studies, among the rich literature, can be found for example in [27, 33, 28, 23, 86]. Our theoretical results are illustrated via numerical simulations, where we also investigate empirically the stability properties of the proposed methods.

The rest of the paper is organized as follows. In Section 2, we briefly discuss the ideas of explicit and implicit regularization approach for regression problems. In Section 3 we are reviewing these approaches in terms of classification problems. In Section 4, we describe the main proposed schemes based on a diagonal iterative regularization procedure for the hinge loss. In Section 5 we present the main results and we provide the corresponding convergence and stability analysis. Finally in Section 6 we illustrate the performance of the proposed algorithms on some simple numerical examples. Appendix A contains some general facts and all the technical proofs and lemmas.

1.1 Notation

We first introduce some notations and recall a few basic notions that will be needed throughout the paper. The interested reader can consult [10] regarding the main tools and their associated properties used in this work. Let ,\left\langle{\cdot},{\cdot}\right\rangle⟨ ⋅ , ⋅ ⟩ denote the standard Euclidean inner product and delimited-∥∥\left\lVert{\cdot}\right\rVert∥ ⋅ ∥ the associated Euclidean norm. For a linear operator Z:nm:𝑍superscript𝑛superscript𝑚Z:\mathbb{R}^{n}\to\mathbb{R}^{m}italic_Z : roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → roman_ℝ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, we denote with (Z)𝑍\Im(Z)roman_ℑ ( italic_Z ) and ker(Z)kernel𝑍\ker(Z)roman_ker ( italic_Z ) its range and kernel respectively We also note with Zopsubscriptdelimited-∥∥𝑍𝑜𝑝\left\lVert{Z}\right\rVert_{op}∥ italic_Z ∥ start_POSTSUBSCRIPT italic_o italic_p end_POSTSUBSCRIPT and ZFsubscriptdelimited-∥∥𝑍𝐹\left\lVert{Z}\right\rVert_{F}∥ italic_Z ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT its operator and Frobenious norm respectively. We also denote with Id the identity operator and with 𝟏1\mathbf{1}bold_1 the n𝑛nitalic_n-vector with entry 1111 in each coordinate.

Given a convex and closed set Cn𝐶superscript𝑛C\subset\mathbb{R}^{n}italic_C ⊂ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the distance of a point xn𝑥superscript𝑛x\in\mathbb{R}^{n}italic_x ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT from the set C𝐶Citalic_C is dist(x,C)=infyCxydist𝑥𝐶𝑦𝐶infimumdelimited-∥∥𝑥𝑦\text{dist}(x,C)=\underset{y\in C}{\inf}\left\lVert{x-y}\right\rVertdist ( italic_x , italic_C ) = start_UNDERACCENT italic_y ∈ italic_C end_UNDERACCENT start_ARG roman_inf end_ARG ∥ italic_x - italic_y ∥. The indicator function of C𝐶Citalic_C, ιC():n¯:={+}:subscript𝜄𝐶superscript𝑛¯assign\iota_{C}(\cdot):\mathbb{R}^{n}\to\bar{\mathbb{R}}:=\mathbb{R}\cup\{+\infty\}italic_ι start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( ⋅ ) : roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → over¯ start_ARG roman_ℝ end_ARG := roman_ℝ ∪ { + ∞ } is defined as ιC(x)={0 if xC+ if not .subscript𝜄𝐶𝑥casesotherwise0 if 𝑥𝐶otherwise if not \iota_{C}(x)=\begin{cases}&0~{}\text{ if }x\in C\\ &+\infty~{}\text{ if not }\end{cases}.italic_ι start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) = { start_ROW start_CELL end_CELL start_CELL 0 if italic_x ∈ italic_C end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ∞ if not end_CELL end_ROW .

For a proper, convex and lower semicontinuous function f:n¯:𝑓superscript𝑛¯f:\mathbb{R}^{n}\to\bar{\mathbb{R}}italic_f : roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → over¯ start_ARG roman_ℝ end_ARG, we define its subdifferential f:n2n:𝑓superscript𝑛superscript2superscript𝑛\partial f:\mathbb{R}^{n}\to 2^{\mathbb{R}^{n}}∂ italic_f : roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → 2 start_POSTSUPERSCRIPT roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, as f(x)={un:f(y)f(x)+u,yx,yn}𝑓𝑥conditional-set𝑢superscript𝑛formulae-sequence𝑓𝑦𝑓𝑥𝑢𝑦𝑥for-all𝑦superscript𝑛\partial f(x)=\{u\in\mathbb{R}^{n}~{}:~{}f(y)\geq f(x)+\left\langle{u},{y-x}% \right\rangle,~{}\forall y\in\mathbb{R}^{n}\}∂ italic_f ( italic_x ) = { italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : italic_f ( italic_y ) ≥ italic_f ( italic_x ) + ⟨ italic_u , italic_y - italic_x ⟩ , ∀ italic_y ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT }. For γ>0𝛾0\gamma>0italic_γ > 0, the proximal operator of f𝑓fitalic_f, proxγf:nn:subscriptprox𝛾𝑓superscript𝑛superscript𝑛\operatorname{prox}_{\gamma f}:\mathbb{R}^{n}\to\mathbb{R}^{n}roman_prox start_POSTSUBSCRIPT italic_γ italic_f end_POSTSUBSCRIPT : roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with step γ𝛾\gammaitalic_γ, is defined by proxγf(x)=argminyn{f(y)+12γxy2}subscriptprox𝛾𝑓𝑥𝑦superscript𝑛argmin𝑓𝑦12𝛾superscriptdelimited-∥∥𝑥𝑦2\operatorname{prox}_{\gamma f}(x)=\underset{y\in\mathbb{R}^{n}}{% \operatornamewithlimits{arg\,min}}\{f(y)+\frac{1}{2\gamma}\left\lVert{x-y}% \right\rVert^{2}\}roman_prox start_POSTSUBSCRIPT italic_γ italic_f end_POSTSUBSCRIPT ( italic_x ) = start_UNDERACCENT italic_y ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_arg roman_min end_ARG { italic_f ( italic_y ) + divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∥ italic_x - italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }, for all xn𝑥superscript𝑛x\in\mathbb{R}^{n}italic_x ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. The projection operator onto a convex and closed set Cn𝐶superscript𝑛C\subset\mathbb{R}^{n}italic_C ⊂ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is defined as projC:nn:subscriptproj𝐶superscript𝑛superscript𝑛\operatorname{proj}_{C}:\mathbb{R}^{n}\to\mathbb{R}^{n}roman_proj start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT : roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT such that projC(x)=argminyCyxsubscriptproj𝐶𝑥𝑦𝐶argmindelimited-∥∥𝑦𝑥\operatorname{proj}_{C}(x)=\underset{y\in C}{\operatornamewithlimits{arg\,min}% }\left\lVert{y-x}\right\rVertroman_proj start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) = start_UNDERACCENT italic_y ∈ italic_C end_UNDERACCENT start_ARG roman_arg roman_min end_ARG ∥ italic_y - italic_x ∥. We denote the Fenchel conjugate of f𝑓fitalic_f as f:n¯:superscript𝑓superscript𝑛¯f^{\ast}:\mathbb{R}^{n}\to\bar{\mathbb{R}}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → over¯ start_ARG roman_ℝ end_ARG, such that f(x)=supyn{x,yf(y)}superscript𝑓𝑥𝑦superscript𝑛supremum𝑥𝑦𝑓𝑦f^{\ast}(x)=\underset{y\in\mathbb{R}^{n}}{\sup}\{\left\langle{x},{y}\right% \rangle-f(y)\}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) = start_UNDERACCENT italic_y ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG { ⟨ italic_x , italic_y ⟩ - italic_f ( italic_y ) }.

2 Background: explicit and implicit regularization in regression

The classical regression problem in supervised learning and statistics corresponds to estimating a function of interest f𝑓fitalic_f , given a finite number of (possibly noisy) evaluations at a number of input points. The problem takes a simple and familiar form if f𝑓fitalic_f is assumed to be linear. Indeed, in this case it corresponds to estimating wd𝑤superscript𝑑w\in\mathbb{R}^{d}italic_w ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, given a set of equations,

w,xi=yi,i=1,,n,formulae-sequence𝑤subscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛\left\langle{w},{x_{i}}\right\rangle=y_{i},\qquad i=1,\dots,n,⟨ italic_w , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_n , (2.1)

where xidsubscript𝑥𝑖superscript𝑑x_{i}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and yisubscript𝑦𝑖y_{i}\in\mathbb{R}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_ℝ.

The above problem can be restated as the linear inverse problem

Xw=y,𝑋𝑤𝑦Xw=y,italic_X italic_w = italic_y , (2.2)

where X:dn:𝑋superscript𝑑superscript𝑛X:\mathbb{R}^{d}\to\mathbb{R}^{n}italic_X : roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is a matrix with rows the input points, and yn𝑦superscript𝑛y\in\mathbb{R}^{n}italic_y ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is a vector with entries the outputs. Moreover, the simple linear case can be generalized if f𝑓fitalic_f belongs to a reproducing kernel Hilbert space (RKHS).

Remark 2.1 (Regression in RKHS [89]).

Recall that a Hilbert space \mathcal{H}caligraphic_H of real valued functions on a set 𝒳𝒳\cal Xcaligraphic_X is called a RKHS if for all f𝑓f\in\mathcal{H}italic_f ∈ caligraphic_H and x𝒳𝑥𝒳x\in\cal Xitalic_x ∈ caligraphic_X, there exists Cxsubscript𝐶𝑥C_{x}\in\mathbb{R}italic_C start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ roman_ℝ such that |f(x)|Cxf.𝑓𝑥subscript𝐶𝑥subscriptdelimited-∥∥𝑓|f(x)|\leq C_{x}\left\lVert{f}\right\rVert_{\mathcal{H}}.| italic_f ( italic_x ) | ≤ italic_C start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∥ italic_f ∥ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT . From the above definition and Riesz lemma it follows immediately that there exists a kxsubscript𝑘𝑥k_{x}\in\mathcal{H}italic_k start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ caligraphic_H such that f(x)=f,kx.𝑓𝑥subscript𝑓subscript𝑘𝑥f(x)=\left\langle{f},{k_{x}}\right\rangle_{\mathcal{H}}.italic_f ( italic_x ) = ⟨ italic_f , italic_k start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT . Then we can write the regression problem as the problem of finding f𝑓f\in\mathcal{H}italic_f ∈ caligraphic_H, given a set of equations,

f,kxi=yi,i=1,,n,formulae-sequencesubscript𝑓subscript𝑘subscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛\left\langle{f},{k_{x_{i}}}\right\rangle_{\mathcal{H}}=y_{i},\qquad i=1,\dots,n,⟨ italic_f , italic_k start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_n ,

where xi𝒳subscript𝑥𝑖𝒳x_{i}\in\cal Xitalic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_X and yisubscript𝑦𝑖y_{i}\in\mathbb{R}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_ℝ.

Keeping the above remark in mind, in the following we primarily focus on the linear case to ease the notation.

A key observation is that the problem (2.2) might not admit solutions or admit multiple ones. This latter situation is the most common in machine learning, where high dimensional (overparameterized), or even infinite dimensional (like RKHS) models are often considered. In the linear setting, this corresponds to the case where n<d𝑛𝑑n<ditalic_n < italic_d in (2.2), (assuming the inputs to be linearly independent). In this case, problem (2.2) admits infinitely many solutions and a classic way to select one is to consider the minimum norm solution

w=argminwd{w:w,xi=yi,i=1,,n}.superscript𝑤subscriptargmin𝑤superscript𝑑:delimited-∥∥𝑤formulae-sequence𝑤subscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛w^{\dagger}=\operatornamewithlimits{arg\,min}_{w\in\mathbb{R}^{d}}\big{\{}% \left\lVert{w}\right\rVert~{}:~{}\left\langle{w},{x_{i}}\right\rangle=y_{i},% \quad i=1,\dots,n\big{\}}.italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_w ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { ∥ italic_w ∥ : ⟨ italic_w , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_n } . (2.3)

The minimum norm solution can be written in terms of the pseudoinverse of X𝑋Xitalic_X as w=Xysuperscript𝑤superscript𝑋𝑦w^{\dagger}=X^{\dagger}yitalic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT = italic_X start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT italic_y [31, Definition 2.22.22.22.2]. This makes it clear that instability in the solution might occur when X𝑋Xitalic_X is ill conditioned. The basic idea of regularization is to consider a family of estimates {wν}ν>0subscriptsubscript𝑤𝜈𝜈0\{w_{\nu}\}_{\nu>0}{ italic_w start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_ν > 0 end_POSTSUBSCRIPT that approaches wsuperscript𝑤w^{\dagger}italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT with better stability properties. In this sense, the regularization property of such estimates is related with i) the convergence of wνsubscript𝑤𝜈w_{\nu}italic_w start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT to the minimal norm solution wsuperscript𝑤w^{\dagger}italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT and ii) its stability. Here the notion of stability expresses how close are the estimates {wν}ν>0subscriptsubscript𝑤𝜈𝜈0\{w_{\nu}\}_{\nu>0}{ italic_w start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_ν > 0 end_POSTSUBSCRIPT and {w~ν}ν>0subscriptsubscript~𝑤𝜈𝜈0\{\tilde{w}_{\nu}\}_{\nu>0}{ over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_ν > 0 end_POSTSUBSCRIPT that are generated from the true output y𝑦yitalic_y and a noisy version of it y~~𝑦\tilde{y}over~ start_ARG italic_y end_ARG (respectively), provided y𝑦yitalic_y and y~~𝑦\tilde{y}over~ start_ARG italic_y end_ARG are close enough (see e.g. [31, Section 3333 & Definition 3.13.13.13.1] for a detailed definition). We next recall two basic approaches with this property.

Tikhonov (explicit) regularization.

The most classic regularization approach is Tikhonov regularization

wλ=argminwRdXwy2+λw2,w_{\lambda}=\operatornamewithlimits{arg\,min}_{w\in R^{d}}\left\lVert{Xw-y}% \right\rVert^{2}+\lambda\left\lVert{w}\right\rVert^{2},italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_w ∈ italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_X italic_w - italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (2.4)

where λ>0𝜆0\lambda>0italic_λ > 0 is called the regularization parameter. The set of solutions corresponding to different values of λ𝜆\lambdaitalic_λ defines the regularization method and a direct computation shows that wλ=(XX+λId)1Xysubscript𝑤𝜆superscriptsuperscript𝑋top𝑋𝜆Id1superscript𝑋top𝑦w_{\lambda}=(X^{\top}X+\lambda\text{Id})^{-1}X^{\top}yitalic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X + italic_λ Id ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_y. From this expression is easy to see that the sequence wλsubscript𝑤𝜆w_{\lambda}italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT converges to wsuperscript𝑤w^{\dagger}italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT as λ𝜆\lambdaitalic_λ tends to zero, (see e.g. [31, Theorem 5.25.25.25.2]). The above ideas can be generalized as we discuss next.

Remark 2.2 (Loss and regularizers).

The ideas in (2.3), (2.4) can be generalized replacing the squared norm in (2.4) with other regularizers R:d[0,):𝑅superscript𝑑0R:\mathbb{R}^{d}\to[0,\infty)italic_R : roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → [ 0 , ∞ ), the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm being a popular example. Similarly instead of the least squares error in (2.4) other error cost functions (loss functions) :[0,):0\ell:\mathbb{R}\to[0,\infty)roman_ℓ : roman_ℝ → [ 0 , ∞ ) can be considered. In regression, loss functions depend on the difference yf(x)𝑦𝑓𝑥y-f(x)italic_y - italic_f ( italic_x ), examples besides the square loss include the absolute value loss, or the epsilon𝑒𝑝𝑠𝑖𝑙𝑜𝑛epsilonitalic_e italic_p italic_s italic_i italic_l italic_o italic_n-insensitive loss used in support vector machine regression [89]. For general losses, the corresponding solutions might not admit a closed form expression, but it is still possible to prove that the Tikhonov regularized solutions converge in some proper sense to the corresponding minimum norm solutions (see e.g. [30, 5]). For sake of completeness we provide a proof in a general setting in Lemma A.1 in Appendix A.

In machine learning, regularization à la Tikhonov is sometimes called explicit since a penalty is added to the data fit term. We next recall iterative regularization and discuss why it is called implicit in machine learning [40].

Iterative (implicit) regularization.

The simplest example of iterative regularization is the gradient descent iteration of the least squares error, that is

wt+1=wtγX(Xwy)subscript𝑤𝑡1subscript𝑤𝑡𝛾superscript𝑋top𝑋𝑤𝑦w_{t+1}=w_{t}-\gamma X^{\top}(Xw-y)italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_X italic_w - italic_y )

for some suitable initialization w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and step size γ𝛾\gammaitalic_γ. It is well known that the above iteration converges to the minimal norm solution (2.3), and further that the stability of the solution varies along the iterations [31, Section 6.16.16.16.1]. In this view, the stopping time becomes the regularization parameter defining a family of regularized solutions. This kind of regularization is called implicit in machine learning since there is no explicit penalty or constraint in the optimization model [70, 69], and stability relies on the self-regularizing properties of the optimization process. These ideas have recently received a lot of attention in machine learning, since they are useful to understand the theoretical properties of more complex non-linear systems like neural networks [40, 24]. Further, they have been advocated as a way to design resource efficient machine learning algorithms controlling at the same time numerical and statistical accuracy. In the remainder of the paper we discuss how these ideas extend beyond regression to the classification setting.

We first add some remarks. First, we note that other optimization approaches than gradient descent can be, and have been considered, including accelerated [31, 68, 72, 52, 99] and stochastic methods [80, 64, 29], as well as mirror descent approaches [39, 47]. Second we note that, compared to Tikhonov regularization, for iterative regularization the extension to other losses and regularizers is not straightforward. The case of regularizers that are norms in reflexive Banach space has been considered in [84, 20], whereas the case of strongly convex regularizers has been considered in [61, 37]. The case of convex regularizers has been recently considered in [62]. The case of smooth loss functions for regression is quite straightforward, considering the gradient iteration

wt+1=wtγi=1nxi(yiwt,xi),t=0,,T,formulae-sequencesubscript𝑤𝑡1subscript𝑤𝑡𝛾superscriptsubscript𝑖1𝑛subscript𝑥𝑖superscriptsubscript𝑦𝑖subscript𝑤𝑡subscript𝑥𝑖𝑡0𝑇w_{t+1}=w_{t}-\gamma\sum_{i=1}^{n}x_{i}\ell^{\prime}(y_{i}-\left\langle{w_{t}}% ,{x_{i}}\right\rangle),\quad t=0,\dots,T,italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) , italic_t = 0 , … , italic_T , (2.5)

where superscript\ell^{\prime}roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the derivative of the loss. The case of convex but non smooth loss function can also be considered using subgradient methods [54]. Note that, if the above iteration is initialized in the span of the input points, it remains in the span. This observation is at basis of the proof that the above iteration converges to the minimal norm interpolant (2.3), see Lemma A.2 in Appendix A.

Provided, with the above background we next discuss the case of classification.

3 Implicit regularization: from regression to classification

In this section, we introduce the problem of classification and investigate how the ideas reviewed in the previous section for regression and linear inverse problems translate to this setting. In particular, we first discuss a notion of minimal norm solution for classification.

Similarly to regression, in classification the goal is to estimate a functional relationship, but the difference is that the outputs are binary valued, that is yi{1,1}subscript𝑦𝑖11y_{i}\in\{-1,1\}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { - 1 , 1 }. Estimating a binary valued function is computationally unfeasible and a classic approach relies on estimating a real valued function f𝑓fitalic_f of which the sign is then taken, that is sign(f(x))=1sign𝑓𝑥1\text{sign}(f(x))=1sign ( italic_f ( italic_x ) ) = 1 if f(x)>0𝑓𝑥0f(x)>0italic_f ( italic_x ) > 0, and sign(f(x))=1sign𝑓𝑥1\text{sign}(f(x))=-1sign ( italic_f ( italic_x ) ) = - 1 otherwise (here ties are broken arbitrarily). In this context, a natural quantity is the product yf(x)𝑦𝑓𝑥yf(x)italic_y italic_f ( italic_x ), called the margin of f𝑓fitalic_f at (x,y)𝑥𝑦(x,y)( italic_x , italic_y ). If the margin is positive it means that f𝑓fitalic_f will classify correctly the input point, if the margin is large we can intuitively expect a confident prediction.

For linear functions we can formalize this idea (see e.g. [27, 89]) considering the problem of finding wd𝑤superscript𝑑w\in\mathbb{R}^{d}italic_w ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT satisfying the set of inequalities

yiw,xi>0,i=1,,n.formulae-sequencesubscript𝑦𝑖𝑤subscript𝑥𝑖0𝑖1𝑛y_{i}\left\langle{w},{x_{i}}\right\rangle>0,\quad i=1,\dots,n.italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_w , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ > 0 , italic_i = 1 , … , italic_n . (3.1)

If such a w𝑤witalic_w exists we say that the data (x1,y1),,(xn,yn)subscript𝑥1subscript𝑦1subscript𝑥𝑛subscript𝑦𝑛(x_{1},y_{1}),\dots,(x_{n},y_{n})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) are linearly separable. Deriving necessary and sufficient conditions for the above inequalities to be feasible is not an elementary problem. As shown below, it is easy to see that overparametrization (n<d𝑛𝑑n<ditalic_n < italic_d), hence interpolation, will be a sufficient condition for linear separability. Since this is the relevant setting for us, from now on, we assume that the data are linearly separable:

Assumption 1 (Linear separability).

There exists some wd𝑤superscript𝑑{w}\in\mathbb{R}^{d}italic_w ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT that separates the data, i.e.

yiw,xi>0i=1,,nformulae-sequencesubscript𝑦𝑖𝑤subscript𝑥𝑖0for-all𝑖1𝑛y_{i}\left\langle{{w}},{x_{i}}\right\rangle>0\qquad\forall i=1,...,nitalic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_w , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ > 0 ∀ italic_i = 1 , … , italic_n (3.2)

In general, also in this setting we can expect multiple solutions, the classical way to approach the problem is to consider the so called max margin solution. The linear margin M:d:𝑀superscript𝑑M:\mathbb{R}^{d}\to\mathbb{R}italic_M : roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → roman_ℝ for a given dataset is defined as,

M(w)=mini=1,,nyiw,xi,𝑀𝑤subscript𝑖1𝑛subscript𝑦𝑖𝑤subscript𝑥𝑖M(w)=\min_{i=1,\dots,n}y_{i}\left\langle{w},{x_{i}}\right\rangle,italic_M ( italic_w ) = roman_min start_POSTSUBSCRIPT italic_i = 1 , … , italic_n end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_w , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ , (3.3)

and, correspondingly, the maximum (max) margin problem is defined as

w+=argmax{M(w):w=1}.subscript𝑤argmax:𝑀𝑤delimited-∥∥𝑤1w_{+}=\operatornamewithlimits{arg\,max}\{M(w)~{}:~{}\left\lVert{w}\right\rVert% =1\}.italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR { italic_M ( italic_w ) : ∥ italic_w ∥ = 1 } . (MM)

We make a few observations. First, the intuition underling the above problem is that among all separating solutions, we are interested into one for which the margin is maximized. Second, we note that without the unit norm constraint the problem of maximizing the margin (3.3) is degenerate and one can obtain trivial solutions by rescaling arbitrarily any separating solution. Indeed, since the margin is scale invariant, the max margin becomes a direction problem, which naturally leads to considering the constrained problem (MM) (see e.g. [27]). Third, it is possible to show that the max margin problem has an equivalent formulation, that highlights the connection to minimal norm solutions. Indeed, consider the problem

w=argminwd{w:yiw,xi1,in}.subscript𝑤subscriptargmin𝑤superscript𝑑:delimited-∥∥𝑤formulae-sequencesubscript𝑦𝑖𝑤subscript𝑥𝑖1for-all𝑖𝑛w_{\ast}=\operatornamewithlimits{arg\,min}_{w\in\mathbb{R}^{d}}\big{\{}\left% \lVert{w}\right\rVert~{}:~{}y_{i}\left\langle{w},{x_{i}}\right\rangle\geq 1~{}% ,~{}\forall i\leq n\big{\}}.italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_w ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { ∥ italic_w ∥ : italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_w , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ≥ 1 , ∀ italic_i ≤ italic_n } . (MN)

Then the following result holds.

Lemma 3.1.

Problem (MM) is equivalent to Problem (MN). In particular, if wsubscript𝑤w_{\ast}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT is a solution of Problem (MN) then

w+=wwsubscript𝑤subscript𝑤delimited-∥∥subscript𝑤w_{+}=\frac{w_{\ast}}{\left\lVert{w_{\ast}}\right\rVert}italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = divide start_ARG italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG

is a solution of Problem (MM) and M(w+)=1w𝑀subscript𝑤1delimited-∥∥subscript𝑤M(w_{+})=\frac{1}{\left\lVert{w_{\ast}}\right\rVert}italic_M ( italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG. Moreover, if w+subscript𝑤w_{+}italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is the solution of Problem (MM) then

w=w+M(w+)subscript𝑤subscript𝑤𝑀subscript𝑤w_{\ast}=\frac{w_{+}}{M(w_{+})}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = divide start_ARG italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG start_ARG italic_M ( italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) end_ARG

is a solution of Problem (MN). Further, it holds that M(w)=1𝑀subscript𝑤1M(w_{\ast})=1italic_M ( italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = 1

A few observations can be made. The above lemma is a classical result in the theory of SVM. Indeed, Problem (MN) is called hard margin support vector machines [27]. For sake of completeness we provide its proof (see Lemma A.3 in Appendix A). In terms of the constrained min-norm problem (MN), Assumption 1, ensures that the feasible set is non-empty and hence such a solution wsubscript𝑤w_{\ast}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT (thus w+subscript𝑤w_{+}italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT) exists. In addition since (MN) can also be equivalently expressed as

w=argminwd{w22:yiw,xi1,in},subscript𝑤subscriptargmin𝑤superscript𝑑:superscriptdelimited-∥∥𝑤22formulae-sequencesubscript𝑦𝑖𝑤subscript𝑥𝑖1for-all𝑖𝑛w_{\ast}=\operatornamewithlimits{arg\,min}_{w\in\mathbb{R}^{d}}\left\{\frac{% \left\lVert{w}\right\rVert^{2}}{2}~{}:~{}y_{i}\left\langle{w},{x_{i}}\right% \rangle\geq 1~{},~{}\forall i\leq n\right\},italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_w ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { divide start_ARG ∥ italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG : italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_w , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ≥ 1 , ∀ italic_i ≤ italic_n } , (3.4)

the solution wsubscript𝑤w_{\ast}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT (thus w+subscript𝑤w_{+}italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT) is unique, thanks to the strong convexity of the squared norm in (3.4). Note that,from an optimization point of view, formulation (3.4) is more useful and will be used hereafter, instead of (MN).

From a regularization perspective, the above result shows that the max margin problem can be see as a minimum norm problem akin to Problem (2.3) in linear inverse problems, but the linear equations are now replaced by inequalities. This can be made even more explicit noting that for binary valued outputs, it holds:

yi=w,xiyiw,xi=1subscript𝑦𝑖𝑤subscript𝑥𝑖subscript𝑦𝑖𝑤subscript𝑥𝑖1y_{i}=\left\langle{w},{x_{i}}\right\rangle~{}\Leftrightarrow~{}y_{i}\left% \langle{w},{x_{i}}\right\rangle=1italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⟨ italic_w , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ⇔ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_w , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ = 1

This last expression also clarifies the earlier observation that interpolation implies separation, and that n<d𝑛𝑑n<ditalic_n < italic_d with linearly independent inputs is a sufficient condition for linear separability.

We next discuss how the ideas of explicit and implicit regularization can be adapted to the classification context. Note that in the following, in analogy to the regression case, we will say that a family of solutions has the regularization property if it converges to the max margin (min norm) solution is stable with respect to a noisy version of the true output y𝑦yitalic_y (see the related discussion in paragraph 5.2).

Explicit regularization for classification.

To extend Tikhonov regularization approach to classification, an appropriate loss function :[0,):0\ell:\mathbb{R}\to[0,\infty)roman_ℓ : roman_ℝ → [ 0 , ∞ ) needs to be considered. As it turns out in classification, loss functions depend on the margin yf(x)𝑦𝑓𝑥yf(x)italic_y italic_f ( italic_x ), rather than the difference yf(x)𝑦𝑓𝑥y-f(x)italic_y - italic_f ( italic_x ) as in regression, and indeed are called margin loss functions. Some popular examples are the following:

  • hinge loss :   (a)=|1a|+=max{0,1a}𝑎subscript1𝑎01𝑎\ell(a)=\lvert{1-a}\rvert_{+}=\max\{0,1-a\}roman_ℓ ( italic_a ) = | 1 - italic_a | start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = roman_max { 0 , 1 - italic_a }

  • exponential loss :  (a)=ea𝑎superscript𝑒𝑎\ell(a)=e^{-a}roman_ℓ ( italic_a ) = italic_e start_POSTSUPERSCRIPT - italic_a end_POSTSUPERSCRIPT

  • logistic loss :  (a)=ln(1+ea)𝑎1superscript𝑒𝑎\ell(a)=\ln(1+e^{-a})roman_ℓ ( italic_a ) = roman_ln ( 1 + italic_e start_POSTSUPERSCRIPT - italic_a end_POSTSUPERSCRIPT )

Refer to caption
Figure 1: The exponential (blue), logistic (red) and hinge (green) loss function.

Given a convex margin loss function, and considering again linear models, Tikhonov regularization in classification corresponds to

wλ=argminwRdi=1n(yiw,xi)+λw2,subscript𝑤𝜆subscriptargmin𝑤superscript𝑅𝑑superscriptsubscript𝑖1𝑛subscript𝑦𝑖𝑤subscript𝑥𝑖𝜆superscriptdelimited-∥∥𝑤2w_{\lambda}=\operatornamewithlimits{arg\,min}_{w\in R^{d}}\sum_{i=1}^{n}\ell(y% _{i}\left\langle{w},{x_{i}}\right\rangle)+\lambda\left\lVert{w}\right\rVert^{2},italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_w ∈ italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_w , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) + italic_λ ∥ italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (3.5)

for λ>0𝜆0\lambda>0italic_λ > 0. It is natural to ask whether the above approach can be shown to be regularizing in the sense that the sequence {wλ}λ>0subscriptsubscript𝑤𝜆𝜆0\{w_{\lambda}\}_{\lambda>0}{ italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_λ > 0 end_POSTSUBSCRIPT converges to the minimum norm solution (MN). Indeed, for the hinge loss this can be proven, while in the cases of the exponential or logistic loss, where the set of minimizers is empty, one can show that wλsubscript𝑤𝜆w_{\lambda}italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT diverges (see e.g. Lemma A.1 in Appendix A). Nevertheless under some suitable assumptions on the loss function, convergence in direction that is

wλwλλ0ww=w+subscript𝑤𝜆delimited-∥∥subscript𝑤𝜆𝜆0subscript𝑤delimited-∥∥subscript𝑤subscript𝑤\frac{w_{\lambda}}{\left\lVert{w_{\lambda}}\right\rVert}\underset{\lambda\to 0% }{\to}\frac{w_{\ast}}{\left\lVert{w_{\ast}}\right\rVert}=w_{+}divide start_ARG italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ∥ end_ARG start_UNDERACCENT italic_λ → 0 end_UNDERACCENT start_ARG → end_ARG divide start_ARG italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG = italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT

has been proven for a wide class of margin loss functions (see e.g. [82, Theorem 2.12.12.12.1] and [42, 81]). Notice that according to the previous discussion and Lemma 3.1, since the max margin problem (MM) is a direction problem, it is still relevant to consider convergence in direction. We next turn our attention to implicit regularization reviewing some recent results.

Implicit regularization for classification.

Following the same reasoning as in regression for the minimal norm interpolating solution (2.3) and the gradient method (2.5), it is natural to ask whether it is possible to prove the regularization properties of the gradient iteration applied to a margin loss function, i.e.

wt+1=wtγi=1nyixi(yiwt,xi),t=0,,T.formulae-sequencesubscript𝑤𝑡1subscript𝑤𝑡𝛾superscriptsubscript𝑖1𝑛subscript𝑦𝑖subscript𝑥𝑖superscriptsubscript𝑦𝑖subscript𝑤𝑡subscript𝑥𝑖𝑡0𝑇w_{t+1}=w_{t}-\gamma\sum_{i=1}^{n}y_{i}x_{i}\ell^{\prime}(y_{i}\left\langle{w_% {t}},{x_{i}}\right\rangle),\quad t=0,\dots,T.italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) , italic_t = 0 , … , italic_T . (3.6)

where γ>0𝛾0\gamma>0italic_γ > 0 is a suitable step size. The above iteration is well defined for smooth loss functions such as the exponential or the logistic loss. However, for losses like exponential or logistic which do not admit any minimizer, the gradient descent iteration in (3.6) diverges. Interestingly, in a recent line of works (see [87, 66, 46] and references therein), convergence in direction of gradient descent has been proved for these loss functions, i.e.

limtwtwt=w+.subscript𝑡subscript𝑤𝑡delimited-∥∥subscript𝑤𝑡subscript𝑤\lim_{t\to\infty}\frac{w_{t}}{\left\lVert{w_{t}}\right\rVert}=w_{+}.roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT divide start_ARG italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG = italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT .

In addition, rates of convergence for the normalized iterates wtwtw+delimited-∥∥subscript𝑤𝑡delimited-∥∥subscript𝑤𝑡subscript𝑤\left\lVert{\frac{w_{t}}{\left\lVert{w_{t}}\right\rVert}-w_{+}}\right\rVert∥ divide start_ARG italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG - italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ∥, the angle gap 1wtwt,w+1subscript𝑤𝑡delimited-∥∥subscript𝑤𝑡subscript𝑤1-\left\langle{\frac{w_{t}}{\left\lVert{w_{t}}\right\rVert}},{w_{+}}\right\rangle1 - ⟨ divide start_ARG italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG , italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ⟩, and the normalized margin gap M(wtwt)M(w+)𝑀subscript𝑤𝑡delimited-∥∥subscript𝑤𝑡𝑀subscript𝑤M\left(\frac{w_{t}}{\left\lVert{w_{t}}\right\rVert}\right)-M(w_{+})italic_M ( divide start_ARG italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG ) - italic_M ( italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) were also proved, but are very slow. For example for the logistic and exponential losses the margin rate is of order O(1logt)𝑂1𝑡O\mathopen{}\left(\frac{1}{\log t}\right)italic_O ( divide start_ARG 1 end_ARG start_ARG roman_log italic_t end_ARG ) (see [87]). Improved rates of order O(logtt)𝑂𝑡𝑡O\mathopen{}\left(\frac{\log t}{\sqrt{t}}\right)italic_O ( divide start_ARG roman_log italic_t end_ARG start_ARG square-root start_ARG italic_t end_ARG end_ARG ) were given recently considering a variable step-size gradient descent version in [66]. Similar works include also (accelerated) mirror descent approaches [44, 45] that provide margin rates up to O(logtt2)𝑂𝑡superscript𝑡2O\mathopen{}\left(\frac{\log t}{t^{2}}\right)italic_O ( divide start_ARG roman_log italic_t end_ARG start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ). Concerning the hinge loss, the natural extension of the above idea is to consider a subgradient iteration

wt+1=wtγti=1nyixigi(wt),t=0,,T.formulae-sequencesubscript𝑤𝑡1subscript𝑤𝑡subscript𝛾𝑡superscriptsubscript𝑖1𝑛subscript𝑦𝑖subscript𝑥𝑖subscript𝑔𝑖subscript𝑤𝑡𝑡0𝑇w_{t+1}=w_{t}-\gamma_{t}\sum_{i=1}^{n}y_{i}x_{i}g_{i}(w_{t}),\quad t=0,\dots,T.italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_t = 0 , … , italic_T . (3.7)

for some sequence of step-sizes (γt)tsubscriptsubscript𝛾𝑡𝑡(\gamma_{t})_{t}( italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and where gi(wt)(yiwt,xi)subscript𝑔𝑖subscript𝑤𝑡subscript𝑦𝑖subscript𝑤𝑡subscript𝑥𝑖g_{i}(w_{t})\in\partial\ell(y_{i}\left\langle{w_{t}},{x_{i}}\right\rangle)italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ ∂ roman_ℓ ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) is an element of the subgradient of the loss. In this case, the minimization problem

minwdi=1n|1yiwt,xi|+subscript𝑤superscript𝑑superscriptsubscript𝑖1𝑛subscript1subscript𝑦𝑖subscript𝑤𝑡subscript𝑥𝑖\min_{w\in\mathbb{R}^{d}}\sum_{i=1}^{n}|1-y_{i}\left\langle{w_{t}},{x_{i}}% \right\rangle|_{+}roman_min start_POSTSUBSCRIPT italic_w ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | 1 - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ | start_POSTSUBSCRIPT + end_POSTSUBSCRIPT

does have a solution and indeed the iteration in (3.7) converges to it. However, the solution minimizing the hinge loss error cannot be expected to be the max margin (min norm) solution in general, and thus the subgradient iteration in (3.7) does not a provide regularization properties.

Next, we focus on the hinge loss and provide two iterative regularization schemes via a diagonal principle. For ease of reading, we first present and describe the two main iterative methods and then the main convergence and stability results characterizing their regularization properties.

4 Iterative regularization for hinge loss via diagonal descent

In this section, we present two iterative regularization approaches based on the hinge loss. The first is given in Algorithm 1, and the second in Algorithm 2 corresponds to a practically faster variant.

Algorithm 1 Projected iterative GD on the dual

Let {λt}t0subscriptsubscript𝜆𝑡𝑡0\{\lambda_{t}\}_{t\geq 0}{ italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT be a decreasing-to-zero sequence of positive numbers, u0[λ01,0]nsubscript𝑢0superscriptsuperscriptsubscript𝜆010𝑛u_{0}\in[-\lambda_{0}^{-1},0]^{n}italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ [ - italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 0 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 0<γXXop10𝛾superscriptsubscriptdelimited-∥∥𝑋superscript𝑋topop10<\gamma\leq\left\lVert{XX^{\top}}\right\rVert_{\text{op}}^{-1}0 < italic_γ ≤ ∥ italic_X italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT op end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. For all t0𝑡0t\geq 0italic_t ≥ 0, consider gt=(gti)i=1,,nsubscript𝑔𝑡subscriptsuperscriptsubscript𝑔𝑡𝑖𝑖1𝑛g_{t}=(g_{t}^{i})_{i=1,...,n}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 , … , italic_n end_POSTSUBSCRIPT, ut=(uti)i=1,,nsubscript𝑢𝑡subscriptsuperscriptsubscript𝑢𝑡𝑖𝑖1𝑛u_{t}=(u_{t}^{i})_{i=1,...,n}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 , … , italic_n end_POSTSUBSCRIPT and wt=(wtl)l=1,,dsubscript𝑤𝑡subscriptsuperscriptsubscript𝑤𝑡𝑙𝑙1𝑑w_{t}=(w_{t}^{l})_{l=1,...,d}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_l = 1 , … , italic_d end_POSTSUBSCRIPT, such that for all i=1,,n𝑖1𝑛i=1,...,nitalic_i = 1 , … , italic_n and l=1,,d𝑙1𝑑l=1,...,ditalic_l = 1 , … , italic_d :

gtisuperscriptsubscript𝑔𝑡𝑖\displaystyle g_{t}^{i}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT =utiγj=1nyiyjxi,xjutjabsentsuperscriptsubscript𝑢𝑡𝑖𝛾superscriptsubscript𝑗1𝑛subscript𝑦𝑖subscript𝑦𝑗subscript𝑥𝑖subscript𝑥𝑗superscriptsubscript𝑢𝑡𝑗\displaystyle=u_{t}^{i}-\gamma\sum_{j=1}^{n}y_{i}y_{j}\left\langle{x_{i}},{x_{% j}}\right\rangle u_{t}^{j}= italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_γ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟨ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT (4.1)
ut+1isuperscriptsubscript𝑢𝑡1𝑖\displaystyle u_{t+1}^{i}italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ={1λt if gti<γ1λtgtiγ if gti[γ1λt,γ]0 if gti>γabsentcases1subscript𝜆𝑡 if superscriptsubscript𝑔𝑡𝑖𝛾1subscript𝜆𝑡superscriptsubscript𝑔𝑡𝑖𝛾 if superscriptsubscript𝑔𝑡𝑖𝛾1subscript𝜆𝑡𝛾0 if superscriptsubscript𝑔𝑡𝑖𝛾\displaystyle=\begin{cases}-\frac{1}{\lambda_{t}}&\text{ if }g_{t}^{i}<\gamma-% \frac{1}{\lambda_{t}}\\ g_{t}^{i}-\gamma&\text{ if }g_{t}^{i}\in[\gamma-\frac{1}{\lambda_{t}},\gamma]% \\ 0&\text{ if }g_{t}^{i}>\gamma\end{cases}= { start_ROW start_CELL - divide start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_CELL start_CELL if italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT < italic_γ - divide start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_γ end_CELL start_CELL if italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ [ italic_γ - divide start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , italic_γ ] end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT > italic_γ end_CELL end_ROW (4.2)
wt+1lsuperscriptsubscript𝑤𝑡1𝑙\displaystyle w_{t+1}^{l}italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT =i=1nyiut+1ixilabsentsuperscriptsubscript𝑖1𝑛subscript𝑦𝑖superscriptsubscript𝑢𝑡1𝑖superscriptsubscript𝑥𝑖𝑙\displaystyle=-\sum_{i=1}^{n}y_{i}u_{t+1}^{i}x_{i}^{l}= - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT (4.3)
Algorithm 2 Projected iterative i-GD on the dual

Let {λt}t0subscriptsubscript𝜆𝑡𝑡0\{\lambda_{t}\}_{t\geq 0}{ italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT a decreasing-to-zero sequence of positive numbers, u0=u1[λ01,0]nsubscript𝑢0subscript𝑢1superscriptsuperscriptsubscript𝜆010𝑛u_{0}=u_{1}\in[-\lambda_{0}^{-1},0]^{n}italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ [ - italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 0 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 0<γXXop10𝛾superscriptsubscriptdelimited-∥∥𝑋superscript𝑋topop10<\gamma\leq\left\lVert{XX^{\top}}\right\rVert_{\text{op}}^{-1}0 < italic_γ ≤ ∥ italic_X italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT op end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. For all t1𝑡1t\geq 1italic_t ≥ 1, consider qt=(qti)i=1,,nsubscript𝑞𝑡subscriptsuperscriptsubscript𝑞𝑡𝑖𝑖1𝑛q_{t}=(q_{t}^{i})_{i=1,...,n}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 , … , italic_n end_POSTSUBSCRIPT, gt=(gti)i=1,,nsubscript𝑔𝑡subscriptsuperscriptsubscript𝑔𝑡𝑖𝑖1𝑛g_{t}=(g_{t}^{i})_{i=1,...,n}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 , … , italic_n end_POSTSUBSCRIPT, ut=(uti)i=1,,nsubscript𝑢𝑡subscriptsuperscriptsubscript𝑢𝑡𝑖𝑖1𝑛u_{t}=(u_{t}^{i})_{i=1,...,n}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 , … , italic_n end_POSTSUBSCRIPT and wt=(wtl)l=1,,dsubscript𝑤𝑡subscriptsuperscriptsubscript𝑤𝑡𝑙𝑙1𝑑w_{t}=(w_{t}^{l})_{l=1,...,d}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_l = 1 , … , italic_d end_POSTSUBSCRIPT, such that for all i=1,,n𝑖1𝑛i=1,...,nitalic_i = 1 , … , italic_n and l=1,,d𝑙1𝑑l=1,...,ditalic_l = 1 , … , italic_d :

qtisuperscriptsubscript𝑞𝑡𝑖\displaystyle q_{t}^{i}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT =uti+αt(utiut1i),αt=tt+α\displaystyle=u_{t}^{i}+\alpha_{t}(u_{t}^{i}-u_{t-1}^{i})\quad,\quad\alpha_{t}% =\frac{t}{t+\alpha}= italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_u start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_t end_ARG start_ARG italic_t + italic_α end_ARG (4.4)
gtisuperscriptsubscript𝑔𝑡𝑖\displaystyle g_{t}^{i}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT =qtiγj=1nyiyjxi,xjqtjabsentsuperscriptsubscript𝑞𝑡𝑖𝛾superscriptsubscript𝑗1𝑛subscript𝑦𝑖subscript𝑦𝑗subscript𝑥𝑖subscript𝑥𝑗superscriptsubscript𝑞𝑡𝑗\displaystyle=q_{t}^{i}-\gamma\sum_{j=1}^{n}y_{i}y_{j}\left\langle{x_{i}},{x_{% j}}\right\rangle q_{t}^{j}= italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_γ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟨ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT (4.5)
ut+1isuperscriptsubscript𝑢𝑡1𝑖\displaystyle u_{t+1}^{i}italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ={1λt if gti<γ1λtgtiγ if gti[γ1λt,γ]0 if gti>γabsentcases1subscript𝜆𝑡 if superscriptsubscript𝑔𝑡𝑖𝛾1subscript𝜆𝑡superscriptsubscript𝑔𝑡𝑖𝛾 if superscriptsubscript𝑔𝑡𝑖𝛾1subscript𝜆𝑡𝛾0 if superscriptsubscript𝑔𝑡𝑖𝛾\displaystyle=\begin{cases}-\frac{1}{\lambda_{t}}&\text{ if }g_{t}^{i}<\gamma-% \frac{1}{\lambda_{t}}\\ g_{t}^{i}-\gamma&\text{ if }g_{t}^{i}\in[\gamma-\frac{1}{\lambda_{t}},\gamma]% \\ 0&\text{ if }g_{t}^{i}>\gamma\end{cases}= { start_ROW start_CELL - divide start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_CELL start_CELL if italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT < italic_γ - divide start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_γ end_CELL start_CELL if italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ [ italic_γ - divide start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , italic_γ ] end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT > italic_γ end_CELL end_ROW (4.6)
wt+1lsuperscriptsubscript𝑤𝑡1𝑙\displaystyle w_{t+1}^{l}italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT =i=1nyiut+1ixilabsentsuperscriptsubscript𝑖1𝑛subscript𝑦𝑖superscriptsubscript𝑢𝑡1𝑖superscriptsubscript𝑥𝑖𝑙\displaystyle=-\sum_{i=1}^{n}y_{i}u_{t+1}^{i}x_{i}^{l}= - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT (4.7)

We add a few comments before providing a detailed derivation of the two procedures above. In both Algorithms 1 and 2, γ𝛾\gammaitalic_γ denotes a constant step-size and λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT a vanishing-with-iterations parameter. Both procedures are based on simple and easy to implement iterations that require only vector multiplication and thresholding operations. The sequence {ut}t0subscriptsubscript𝑢𝑡𝑡0\{u_{t}\}_{t\geq 0}{ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT represents a dual variable and is computed coordinate-wise via a simple projection operation (see (4.2) and (4.6)), while gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT corresponds to a classical gradient (forward) step related to the square norm 2/2superscriptdelimited-∥∥22{\left\lVert{\cdot}\right\rVert^{2}}/{2}∥ ⋅ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2. We note that the difference between the two schemes is that in Algorithm 2 the sequence gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is computed via the auxiliary sequence qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT instead of utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The sequence qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT equals to utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT extrapolated by the term αt(utut1)subscript𝛼𝑡subscript𝑢𝑡subscript𝑢𝑡1\alpha_{t}(u_{t}-u_{t-1})italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ), called inertial term, which plays an important role on the convergence speed of Algorithm 2. Finally, {wt}t>0subscriptsubscript𝑤𝑡𝑡0\{w_{t}\}_{t>0}{ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t > 0 end_POSTSUBSCRIPT corresponds to the primal sequence designed to approximate the min-norm solution wsubscript𝑤w_{\ast}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. In the next section, we discuss the derivation of the above two procedures.

4.1 Diagonal methods via dual hinge loss

Algorithms 1 and 2 are based on a so called diagonal regularization process [9] applied to a suitably defined dual problem. We next describe these ideas in some detail. We note in passing that, while diagonal approaches have been considered before for inverse problems [8, 59, 19, 9, 34, 21]), we are not aware of their application to classification.

The basic idea of diagonal approaches can perhaps be better explained recalling that the penalized functional such as (3.5) converges to minimal norm separating solution (3.4), as the regularization parameter goes to zero. The idea is to start by considering an optimization procedure for solving Problem (3.5), to then modify it by letting the regularization parameter decrease at each iteration. In this way, the obtained iteration no longer converges to a solution of Problem (3.5), but rather directly to the minimal norm separating solution (3.4). To derive Algorithms 1 and 2 this basic idea is actually applied to the dual formulation of Problem (3.5).

To describe the above ideas more precisely, we begin rewriting Problem (3.5) for the special case of hinge loss ()=|1|+\ell(\cdot)=\left|{1-\cdot}\right|_{+}roman_ℓ ( ⋅ ) = | 1 - ⋅ | start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. In order to ease the notation, we define the matrix Z=diag(Y)X𝑍diag𝑌𝑋Z=\text{diag}(Y)Xitalic_Z = diag ( italic_Y ) italic_X and consequently Z=(zi)in𝑍subscriptsubscript𝑧𝑖𝑖𝑛Z=(z_{i})_{i\leq n}italic_Z = ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT, with zi=(yixi)inn×dsubscript𝑧𝑖subscriptsubscript𝑦𝑖subscript𝑥𝑖𝑖𝑛superscript𝑛𝑑z_{i}=(y_{i}x_{i})_{i\leq n}\in\mathbb{R}^{n\times d}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT, we set :n:superscript𝑛\mathcal{L}:\mathbb{R}^{n}\to\mathbb{R}caligraphic_L : roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → roman_ℝ, with (u)=i=1n(ui)𝑢superscriptsubscript𝑖1𝑛subscript𝑢𝑖\mathcal{L}(u)=\sum_{i=1}^{n}\ell(u_{i})caligraphic_L ( italic_u ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and we get that solving Problem (3.5) is equivalent to solve

minwd1λ(Zw)+w22.𝑤superscript𝑑1𝜆𝑍𝑤superscriptdelimited-∥∥𝑤22\underset{w\in\mathbb{R}^{d}}{\min}\frac{1}{\lambda}\mathcal{L}(Zw)+\frac{% \left\lVert{w}\right\rVert^{2}}{2}.start_UNDERACCENT italic_w ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_min end_ARG divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG caligraphic_L ( italic_Z italic_w ) + divide start_ARG ∥ italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG . (4.8)

The objective function in (4.8) is the sum of a smooth term, the squared norm, and a convex nonsmooth one 1λZ1𝜆𝑍\frac{1}{\lambda}\mathcal{L}\circ Zdivide start_ARG 1 end_ARG start_ARG italic_λ end_ARG caligraphic_L ∘ italic_Z. Problems such as (4.8) are called structured composite convex minimization problems, and can be often solved efficiently by proximal-gradient methods [26], a class of first order methods splitting the contribution of the smooth and the nonsmooth part. At every iteration, the smooth part is activated through a gradient step, while for the nondifferentiable one the computation of the proximity operator is required. In order to implement a proximal gradient algorithm to solve problem  (4.8) we would need the computation of the proximity operator of Z𝑍\mathcal{L}\circ Zcaligraphic_L ∘ italic_Z, which is not available in closed form and may be computationally expensive. Therefore, we consider the dual problem (see for example [10, Definition 15.1015.1015.1015.10]) associated to (4.8) (which is equivalent to (MN)) which is given by

minunZu22+1λ(λu),𝑢superscript𝑛superscriptdelimited-∥∥superscript𝑍top𝑢221𝜆superscript𝜆𝑢\underset{u\in\mathbb{R}^{n}}{\min}\frac{\left\lVert{Z^{\top}u}\right\rVert^{2% }}{2}+\frac{1}{\lambda}\mathcal{L}^{\ast}(\lambda u),start_UNDERACCENT italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_min end_ARG divide start_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ italic_u ) , (4.9)

where superscript\mathcal{L}^{\ast}caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denotes the Fenchel conjugate of \mathcal{L}caligraphic_L defined in Section 1.1. Its computation is simplified by the fact that \mathcal{L}caligraphic_L is a sum of separable functions, and can be written as (u)=i=1n(ui)=i=1nui+ι[1,0](ui)superscript𝑢superscriptsubscript𝑖1𝑛superscriptsuperscript𝑢𝑖superscriptsubscript𝑖1𝑛superscript𝑢𝑖subscript𝜄10superscript𝑢𝑖\mathcal{L}^{\ast}(u)=\sum_{i=1}^{n}\ell^{\ast}(u^{i})=\sum_{i=1}^{n}u^{i}+% \iota_{[-1,0]}(u^{i})caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_u ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_ι start_POSTSUBSCRIPT [ - 1 , 0 ] end_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ), which implies

1λ(λu)=i=1nui+ι[1/λ,0](ui).1𝜆superscript𝜆𝑢superscriptsubscript𝑖1𝑛superscript𝑢𝑖subscript𝜄1𝜆0superscript𝑢𝑖\frac{1}{\lambda}\mathcal{L}^{\ast}(\lambda u)=\sum_{i=1}^{n}u^{i}+\iota_{[-1/% \lambda,0]}(u^{i}).divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ italic_u ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_ι start_POSTSUBSCRIPT [ - 1 / italic_λ , 0 ] end_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) . (4.10)

It is also important to write the dual problem associated to min-norm problem 3.4 which is given by

minunD(u)=Zu22+i=1nui+ι(,0]n(u).subscript𝑢superscript𝑛subscript𝐷𝑢superscriptdelimited-∥∥superscript𝑍top𝑢22superscriptsubscript𝑖1𝑛superscript𝑢𝑖subscript𝜄superscript0𝑛𝑢\min_{u\in\mathbb{R}^{n}}D_{\infty}(u)=\frac{\left\lVert{Z^{\top}u}\right% \rVert^{2}}{2}+\sum_{i=1}^{n}u^{i}+\iota_{(-\infty,0]^{n}}(u).roman_min start_POSTSUBSCRIPT italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u ) = divide start_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_ι start_POSTSUBSCRIPT ( - ∞ , 0 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u ) . (4.11)

Indeed, it is possible to show (see Lemma A.6) that, for λ0𝜆0\lambda\to 0italic_λ → 0, the dual regularized Problem (4.9) converges point-wise to Problem (4.11). To pass from convergence properties in the dual space to the primal space, recall that, if strong duality holds (see e.g. [10, Section 15.215.215.215.2]), the value of problem (4.9) is the same as the value of (4.8) and, for every λ>0𝜆0\lambda>0italic_λ > 0, one can recover a solution wλsubscript𝑤𝜆w_{\lambda}italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT of the primal problem (4.8), from a solution uλsubscript𝑢𝜆u_{\lambda}italic_u start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT of the dual problem (4.9) via the formula [10, Section 19191919],

wλ=Zuλ.subscript𝑤𝜆superscript𝑍topsubscript𝑢𝜆w_{\lambda}=-Z^{\top}u_{\lambda}.italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT . (4.12)

Problem 4.9 is another composite convex optimization problem, where Z2\frac{\left\lVert{Z^{\top}\cdot}\right\rVert}{2}divide start_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⋅ ∥ end_ARG start_ARG 2 end_ARG, has a XXopsubscriptdelimited-∥∥𝑋superscript𝑋topop\left\lVert{XX^{\top}}\right\rVert_{\text{op}}∥ italic_X italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT op end_POSTSUBSCRIPT-Lipschitz gradient, and superscript\mathcal{L}^{\ast}caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the nonsmooth part, of which the proximity operator can be easily computed, as we show below. We can then implement a diagonal proximal gradient iteration on the dual function defined in (4.9) as follows. For a given a starting point u0subscript𝑢0u_{0}italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, some step-size γ>0𝛾0\gamma>0italic_γ > 0 and a decreasing sequence λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the (diagonal) proximal gradient iteration corresponding to Problem (4.9) is given by,

ut+1=proxγλt(λt)(utγZZut).u_{t+1}=\operatorname{prox}_{\frac{\gamma}{\lambda_{t}}\mathcal{L}^{\ast}(% \lambda_{t}\cdot)}\big{(}u_{t}-\gamma ZZ^{\top}u_{t}\big{)}.italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = roman_prox start_POSTSUBSCRIPT divide start_ARG italic_γ end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ ) end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (4.13)

We add several comments. First, if λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is taken to be constant then the above iteration solves Problem (4.9), if the stepsize γXXop1𝛾superscriptsubscriptdelimited-∥∥𝑋superscript𝑋topop1\gamma\leq\left\lVert{XX^{\top}}\right\rVert_{\text{op}}^{-1}italic_γ ≤ ∥ italic_X italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT op end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Following, the previous discussion, by letting λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT decrease at each iteration we obtain a diagonal process solving Problem (4.11). Second,the computation of the proximity operator is simplified by the fact that (λt)\mathcal{L}^{\ast}(\lambda_{t}\cdot)caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ ) is a sum of separable functions, as can be seen from (4.10). Indeed, this allows to compute the proximal operator component-wise

prox(λt)(u)=(prox(λt)(ui))in,\operatorname{prox}_{\mathcal{L}^{\ast}(\lambda_{t}\cdot)}(u)=\big{(}% \operatorname{prox}_{\ell^{\ast}(\lambda_{t}\cdot)}(u^{i})\big{)}_{i\leq n},roman_prox start_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ ) end_POSTSUBSCRIPT ( italic_u ) = ( roman_prox start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ ) end_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT ,

and derive the following iteration

ut+1=(proxγλt(λt)((utγZZut)i))in.u_{t+1}=\biggl{(}\operatorname{prox}_{\frac{\gamma}{\lambda_{t}}\ell^{\ast}(% \lambda_{t}\cdot)}\big{(}\left(u_{t}-\gamma ZZ^{\top}u_{t}\right)_{i}\big{)}% \biggr{)}_{i\leq n}.italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = ( roman_prox start_POSTSUBSCRIPT divide start_ARG italic_γ end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG roman_ℓ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ ) end_POSTSUBSCRIPT ( ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT . (4.14)

Note that the proximal operator of superscript\ell^{\ast}roman_ℓ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, can be computed in closed form. Indeed, for any γ,λ𝛾𝜆\gamma,\lambda\in\mathbb{R}italic_γ , italic_λ ∈ roman_ℝ, p𝑝p\in\mathbb{R}italic_p ∈ roman_ℝ,

proxγλ(λ)(p)\displaystyle\operatorname{prox}_{\frac{\gamma}{\lambda}\ell^{\ast}(\lambda% \cdot)}(p)roman_prox start_POSTSUBSCRIPT divide start_ARG italic_γ end_ARG start_ARG italic_λ end_ARG roman_ℓ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ ⋅ ) end_POSTSUBSCRIPT ( italic_p ) =argmins{s+ι[1/λ,0](s)+sp22γ}absentsubscriptargmin𝑠𝑠subscript𝜄1𝜆0𝑠superscriptdelimited-∥∥𝑠𝑝22𝛾\displaystyle=\operatornamewithlimits{arg\,min}_{s\in\mathbb{R}}\big{\{}s+% \iota_{[-1/\lambda,0]}(s)+\frac{\left\lVert{s-p}\right\rVert^{2}}{2\gamma}\big% {\}}= start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_s ∈ roman_ℝ end_POSTSUBSCRIPT { italic_s + italic_ι start_POSTSUBSCRIPT [ - 1 / italic_λ , 0 ] end_POSTSUBSCRIPT ( italic_s ) + divide start_ARG ∥ italic_s - italic_p ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_γ end_ARG } (4.15)
=proj[1/λ,0](pγ)={1/λp<γ1/λpγp[γ1/λ,γ]0p>γ.absentsubscriptproj1𝜆0𝑝𝛾cases1𝜆𝑝𝛾1𝜆𝑝𝛾𝑝𝛾1𝜆𝛾0𝑝𝛾\displaystyle=\operatorname{proj}_{[-1/\lambda,0]}\bigl{(}p-\gamma\bigr{)}=% \begin{cases}-1/\lambda&p<\gamma-1/\lambda\\ p-\gamma&p\in[\gamma-1/\lambda,\gamma]\\ 0&p>\gamma.\end{cases}= roman_proj start_POSTSUBSCRIPT [ - 1 / italic_λ , 0 ] end_POSTSUBSCRIPT ( italic_p - italic_γ ) = { start_ROW start_CELL - 1 / italic_λ end_CELL start_CELL italic_p < italic_γ - 1 / italic_λ end_CELL end_ROW start_ROW start_CELL italic_p - italic_γ end_CELL start_CELL italic_p ∈ [ italic_γ - 1 / italic_λ , italic_γ ] end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_p > italic_γ . end_CELL end_ROW

Finally, putting together all the above observations we derive Algorithm 1. Algorithm 2 is derived by considering a classic variation of the proximal gradient iteration in Algorithm 1, namely a so called inertial step. The latter corresponds to replacing utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in (4.14) with qt=ut+tt+α(utut1)subscript𝑞𝑡subscript𝑢𝑡𝑡𝑡𝛼subscript𝑢𝑡subscript𝑢𝑡1q_{t}=u_{t}+\frac{t}{t+\alpha}(u_{t}-u_{t-1})italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG italic_t end_ARG start_ARG italic_t + italic_α end_ARG ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ). For both Algorithm 1 and Algorithm 2 the last step corresponds to the dual-to-primal update wt+1=Zut+1subscript𝑤𝑡1superscript𝑍topsubscript𝑢𝑡1w_{t+1}=-Z^{\top}u_{t+1}italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT.

Remark 4.1.

Notice that Algorithms 1 and 2 can be obtained by replacing Z𝑍Zitalic_Z with diag(Y)Xdiag𝑌𝑋\text{diag}(Y)Xdiag ( italic_Y ) italic_X. In fact, our analysis also extends to considering general linearly parametrized models of the form fw(x)=sign(w,ϕ(x))subscript𝑓𝑤𝑥sign𝑤italic-ϕ𝑥f_{w}(x)=\text{sign}(\left\langle{w},{\phi(x)}\right\rangle)italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_x ) = sign ( ⟨ italic_w , italic_ϕ ( italic_x ) ⟩ ), where ϕ:d:italic-ϕsuperscript𝑑\phi:\mathbb{R}^{d}\to\mathcal{H}italic_ϕ : roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → caligraphic_H denotes some feature mapping (possibly infinite dimensional) which may be specified explicitly, or via some kernel operator, i.e. K(xi,xj)=ϕ(xi),ϕ(xj)𝐾subscript𝑥𝑖subscript𝑥𝑗italic-ϕsubscript𝑥𝑖italic-ϕsubscript𝑥𝑗K(x_{i},x_{j})=\left\langle{\phi(x_{i})},{\phi(x_{j})}\right\rangleitalic_K ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = ⟨ italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩666The use of this feature-mapping ϕitalic-ϕ\phiitalic_ϕ allows to consider non-linear classifiers for possibly non-linearly separable data (see for example [27, 89]).. If we set ϕ(X)=(ϕ(xi))initalic-ϕ𝑋subscriptitalic-ϕsubscript𝑥𝑖𝑖𝑛\phi(X)=(\phi(x_{i}))_{i\leq n}italic_ϕ ( italic_X ) = ( italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT, the penalized hinge-loss problem associated to the minimal norm separating problem and its corresponding dual are:

minw{1λi=1n(ϕ(xi),w)+w22}𝑤1𝜆superscriptsubscript𝑖1𝑛italic-ϕsubscript𝑥𝑖𝑤superscriptdelimited-∥∥𝑤22\displaystyle\underset{w\in\mathcal{H}}{\min}\left\{\frac{1}{\lambda}\sum_{i=1% }^{n}\ell(\left\langle{\phi(x_{i})},{w}\right\rangle)+\frac{\left\lVert{w}% \right\rVert^{2}}{2}\right\}start_UNDERACCENT italic_w ∈ caligraphic_H end_UNDERACCENT start_ARG roman_min end_ARG { divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ ( ⟨ italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_w ⟩ ) + divide start_ARG ∥ italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG } (4.16)
minun{i=1n(λui)λ+ϕ(X)u22=i=1n(λui)λ+i,j=1nuiK(xi,xj)uj2}𝑢superscript𝑛superscriptsubscript𝑖1𝑛superscript𝜆superscript𝑢𝑖𝜆superscriptdelimited-∥∥italic-ϕsuperscript𝑋top𝑢22superscriptsubscript𝑖1𝑛superscript𝜆superscript𝑢𝑖𝜆superscriptsubscript𝑖𝑗1𝑛superscript𝑢𝑖𝐾subscript𝑥𝑖subscript𝑥𝑗subscript𝑢𝑗2\displaystyle\underset{u\in\mathbb{R}^{n}}{\min}\left\{\sum_{i=1}^{n}\frac{% \ell^{\ast}(\lambda u^{i})}{\lambda}+\frac{\left\lVert{\phi(X)^{\top}u}\right% \rVert^{2}}{2}=\sum_{i=1}^{n}\frac{\ell^{\ast}(\lambda u^{i})}{\lambda}+\frac{% \sum_{i,j=1}^{n}u^{i}K(x_{i},x_{j})u_{j}}{2}\right\}start_UNDERACCENT italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_min end_ARG { ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG roman_ℓ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_λ end_ARG + divide start_ARG ∥ italic_ϕ ( italic_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG roman_ℓ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_λ end_ARG + divide start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_K ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG } (4.17)

Since Algorithms 1 and 2 are designed from the dual problem (4.17), the information needed for the dual update ut+1subscript𝑢𝑡1u_{t+1}italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is only the kernel evaluation at each data-point and not the one of ϕ(X)italic-ϕ𝑋\phi(X)italic_ϕ ( italic_X ) (see e.g. [89]). In this case Algorithms 1 and 2 can be rewritten by replacing ZZ𝑍superscript𝑍topZZ^{\top}italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, with the operator K𝐾Kitalic_K. Finally while the primal update wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT may not be computable, the predictor can be still computed as fwt(x)=wt,ϕ(x)=K(xi,x),utsubscript𝑓subscript𝑤𝑡𝑥subscript𝑤𝑡italic-ϕ𝑥𝐾subscript𝑥𝑖𝑥subscript𝑢𝑡f_{w_{t}}(x)=\left\langle{w_{t}},{\phi(x)}\right\rangle=\left\langle{K(x_{i},x% )},{u_{t}}\right\rangleitalic_f start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) = ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ϕ ( italic_x ) ⟩ = ⟨ italic_K ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ) , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ via the dual iterates and the kernel evaluation.

We end this section with two remarks commenting on some related literature and then analyze the properties of Algorithm 1 and Algorithm 2 in the next section.

Remark 4.2 (Implicit regularization via homotopic subgradient).

Another implicit regularization approach for the hinge loss was recently studied in [63] and derived using an homotopic subgradient method for the primal problem (4.8). Our approach, in contrast, is based on a diagonal process on the dual problem (4.9) and, as discussed later, leads to faster convergence rates.

Remark 4.3 (Hard-SVM).

The dual formulation (4.9) is the one used for solving linear SVM problems (see [89]). Indeed tackling the max-margin problem via its dual formulation (4.9) is popular, due to its favorable structure and there is a very rich literature on methods to solve it (see e.g. interior point-methods ([18, 32]) or decomposition methods (see e.g. [27, 89], [32, 97] and [74, 50, 56]. Compared to this methods our diagonal approach enjoys good theoretical guarantees while providing a direct link to regularization methods.

5 Main results and convergence analysis

In this section we present and discuss the main results of this work, deferring the associated proofs to the Appendix. Before stating the main theorems, we discuss a key property of the sequence of dual problems. In particular, there exist some positive constants μ𝜇\muitalic_μ, M𝑀Mitalic_M and R𝑅Ritalic_R, such that, for all t0𝑡0t\geq 0italic_t ≥ 0, each of the dual functions Dt()=1λt(λt)+12Z2D_{t}(\cdot)=\frac{1}{\lambda_{t}}\mathcal{L}^{\ast}(\lambda_{t}\cdot)+\frac{1% }{2}\left\lVert{Z^{\top}\cdot}\right\rVert^{2}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) = divide start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⋅ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT related to problem (4.9) satisfies the μ𝜇\muitalic_μ-Łojasiewicz condition in [DtminDt+M]𝔹(𝟎,R)delimited-[]subscript𝐷𝑡subscript𝐷𝑡𝑀𝔹0𝑅[D_{t}\leq\min D_{t}+M]\cap\mathbb{B}(\mathbf{0},R)[ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_M ] ∩ roman_𝔹 ( bold_0 , italic_R ), i.e. for all u[DtminDt+M]𝔹(𝟎,R)𝑢delimited-[]subscript𝐷𝑡subscript𝐷𝑡𝑀𝔹0𝑅u\in[D_{t}\leq\min D_{t}+M]\cap\mathbb{B}(\mathbf{0},R)italic_u ∈ [ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_M ] ∩ roman_𝔹 ( bold_0 , italic_R ) and t0𝑡0t\geq 0italic_t ≥ 0, it holds

Dt(u)minDt12μdist(Dt(u),0)2,subscript𝐷𝑡𝑢subscript𝐷𝑡12𝜇distsuperscriptsubscript𝐷𝑡𝑢02D_{t}(u)-\min D_{t}\leq\frac{1}{2\mu}\text{dist}\left(\partial D_{t}(u),0% \right)^{2},italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) - roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_μ end_ARG dist ( ∂ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (5.1)

as we will show in Lemma A.9. The previous condition is a relaxation of strong convexity, and it is well-known to imply linear convergence of the standard Forward-Backward scheme (see e.g. [55, 14, 35, 48] and references therein). Theorem (5.1) extends these classical results to the diagonal setting. The value of μ𝜇\muitalic_μ is crucial, since it determines the constant appearing in the linear convergence bound in (5.4). As can be seen from (5.1), μ𝜇\muitalic_μ is independent from t𝑡titalic_t, and an explicit expression for μ𝜇\muitalic_μ can be given by

μ=18τ2((3Xu02Xu2+𝟏,u0u+2Xop(u0+2u))2+2),𝜇18superscript𝜏2superscript3superscriptdelimited-∥∥superscript𝑋topsubscript𝑢02superscriptdelimited-∥∥superscript𝑋topsubscript𝑢21subscript𝑢0subscript𝑢2subscriptdelimited-∥∥𝑋𝑜𝑝delimited-∥∥subscript𝑢02delimited-∥∥subscript𝑢22\mu=\frac{1}{8\tau^{2}\left(\left(3\sqrt{\left\lVert{X^{\top}u_{0}}\right% \rVert^{2}-\left\lVert{X^{\top}u_{\ast}}\right\rVert^{2}+\left\langle{\mathbf{% 1}},{u_{0}-u_{\ast}}\right\rangle}+\sqrt{2}\left\lVert{X}\right\rVert_{op}% \left(\left\lVert{u_{0}}\right\rVert+2\left\lVert{u_{\ast}}\right\rVert\right)% \right)^{2}+2\right)},italic_μ = divide start_ARG 1 end_ARG start_ARG 8 italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ( 3 square-root start_ARG ∥ italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⟨ bold_1 , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ end_ARG + square-root start_ARG 2 end_ARG ∥ italic_X ∥ start_POSTSUBSCRIPT italic_o italic_p end_POSTSUBSCRIPT ( ∥ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ + 2 ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ) end_ARG , (5.2)

where τ𝜏\tauitalic_τ is the Hoffman constant (see e.g. [43, 36] for a definition) of a system of linear inequalities and equalities describing the set of minimizers of the dual objective function Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The explicit computation of this constant is expensive, but an expression is available in closed form, and is given by (see e.g. [94, Lemma 15151515]):

τ=sup(u,v)2n×d+1{uv:Au+Ev=1,u0. The rows of AE corresponding to non-zero components of u and v are linearly independent.}𝜏𝑢𝑣superscript2𝑛superscript𝑑1supremumconditional-setdelimited-∥∥𝑢𝑣formulae-sequenceformulae-sequencedelimited-∥∥superscript𝐴top𝑢superscript𝐸top𝑣1𝑢0 The rows of AE corresponding to non-zero components of u and v are linearly independent.\tau=\underset{(u,v)\in\mathbb{R}^{2n}\times\mathbb{R}^{d+1}}{\sup}\left\{% \left\lVert\begin{split}u\\ v\end{split}\right\rVert~{}:~{}\begin{split}&\left\lVert{A^{\top}u+E^{\top}v}% \right\rVert=1,~{}u\geq 0.\text{ The rows of $A$, $E$}\\ &\text{ corresponding to non-zero components}\\ &\text{ of $u$ and $v$ are linearly independent.}\end{split}\right\}italic_τ = start_UNDERACCENT ( italic_u , italic_v ) ∈ roman_ℝ start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT × roman_ℝ start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG { ∥ start_ROW start_CELL italic_u end_CELL end_ROW start_ROW start_CELL italic_v end_CELL end_ROW ∥ : start_ROW start_CELL end_CELL start_CELL ∥ italic_A start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u + italic_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_v ∥ = 1 , italic_u ≥ 0 . The rows of italic_A , italic_E end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL corresponding to non-zero components end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL of italic_u and italic_v are linearly independent. end_CELL end_ROW } (5.3)

where A=[IdnIdn](2n)×n𝐴matrixsubscriptId𝑛subscriptId𝑛superscript2𝑛𝑛A=\begin{bmatrix}\text{Id}_{n}\\ -\text{Id}_{n}\end{bmatrix}\in\mathbb{R}^{(2n)\times n}italic_A = [ start_ARG start_ROW start_CELL Id start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - Id start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ∈ roman_ℝ start_POSTSUPERSCRIPT ( 2 italic_n ) × italic_n end_POSTSUPERSCRIPT and E=[Z𝟏](d+1)×n𝐸matrixsuperscript𝑍topsuperscript1topsuperscript𝑑1𝑛E=\begin{bmatrix}Z^{\top}\\ \mathbf{1}^{\top}\end{bmatrix}\in\mathbb{R}^{(d+1)\times n}italic_E = [ start_ARG start_ROW start_CELL italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] ∈ roman_ℝ start_POSTSUPERSCRIPT ( italic_d + 1 ) × italic_n end_POSTSUPERSCRIPT.

We are now ready to state the main results. In Theorem 5.1 we state the convergence results for the sequence generated by Algorithm 1 to the minimal norm separating solution wsubscript𝑤w_{\ast}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT (see (MN)), and in Theorem 5.2 we state the convergence results for the inertial Algorithm 5.2.

Theorem 5.1.

Let wsubscript𝑤w_{\ast}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT the solution of (MN), and {ut}t0subscriptsubscript𝑢𝑡𝑡0\{u_{t}\}_{t\geq 0}{ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT and {wt}t0subscriptsubscript𝑤𝑡𝑡0\{w_{t}\}_{t\geq 0}{ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT the sequences generated by Algorithm 1. Then {wt}t0subscriptsubscript𝑤𝑡𝑡0\{w_{t}\}_{t\geq 0}{ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT converges to wsubscript𝑤w_{\ast}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. In addition, if usubscript𝑢u_{\ast}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT is a solution of the associated dual problem (4.11) and λ0u1subscript𝜆0superscriptdelimited-∥∥subscript𝑢1\lambda_{0}\leq\left\lVert{u_{\ast}}\right\rVert^{-1}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, then for all t1𝑡1t\geq 1italic_t ≥ 1, the following estimate holds true:

wtwC(1γμ1+γμ)t2delimited-∥∥subscript𝑤𝑡subscript𝑤𝐶superscript1𝛾𝜇1𝛾𝜇𝑡2\left\lVert{w_{t}-w_{\ast}}\right\rVert\leq C\left(1-\frac{\gamma\mu}{1+\gamma% \mu}\right)^{\frac{t}{2}}∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ≤ italic_C ( 1 - divide start_ARG italic_γ italic_μ end_ARG start_ARG 1 + italic_γ italic_μ end_ARG ) start_POSTSUPERSCRIPT divide start_ARG italic_t end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT (5.4)

where C=Xu02Xu2+2𝟏,u0u𝐶superscriptdelimited-∥∥superscript𝑋topsubscript𝑢02superscriptdelimited-∥∥superscript𝑋topsubscript𝑢221subscript𝑢0subscript𝑢C=\sqrt{\left\lVert{X^{\top}u_{0}}\right\rVert^{2}-\left\lVert{X^{\top}u_{\ast% }}\right\rVert^{2}+2\left\langle{\mathbf{1}},{u_{0}-u_{\ast}}\right\rangle}italic_C = square-root start_ARG ∥ italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ⟨ bold_1 , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ end_ARG and μ𝜇\muitalic_μ is defined in (5.2). In addition, there exists some t>0superscript𝑡0t^{\ast}>0italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > 0, such that for all tt𝑡superscript𝑡t\geq t^{\ast}italic_t ≥ italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the following rates hold true for the angle and the margin gap (respectively):

1wt,wwtwC2w2(1γμ1+γμ)t1subscript𝑤𝑡subscript𝑤delimited-∥∥subscript𝑤𝑡delimited-∥∥subscript𝑤superscript𝐶2superscriptdelimited-∥∥subscript𝑤2superscript1𝛾𝜇1𝛾𝜇𝑡1-\frac{\left\langle{w_{t}},{w_{\ast}}\right\rangle}{\left\lVert{w_{t}}\right% \rVert\left\lVert{w_{\ast}}\right\rVert}\leq\frac{C^{2}}{\left\lVert{w_{\ast}}% \right\rVert^{2}}\left(1-\frac{\gamma\mu}{1+\gamma\mu}\right)^{t}1 - divide start_ARG ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG ≤ divide start_ARG italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( 1 - divide start_ARG italic_γ italic_μ end_ARG start_ARG 1 + italic_γ italic_μ end_ARG ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (5.5)
M(ww)M(wtwt)2CXFw2(1γμ1+γμ)t2𝑀subscript𝑤delimited-∥∥subscript𝑤𝑀subscript𝑤𝑡delimited-∥∥subscript𝑤𝑡2𝐶subscriptdelimited-∥∥𝑋𝐹superscriptdelimited-∥∥subscript𝑤2superscript1𝛾𝜇1𝛾𝜇𝑡2M\left(\frac{w_{\ast}}{\left\lVert{w_{\ast}}\right\rVert}\right)-M\left(\frac{% w_{t}}{\left\lVert{w_{t}}\right\rVert}\right)\leq\frac{2C\left\lVert{X}\right% \rVert_{F}}{\left\lVert{w_{\ast}}\right\rVert^{2}}\left(1-\frac{\gamma\mu}{1+% \gamma\mu}\right)^{\frac{t}{2}}italic_M ( divide start_ARG italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG ) - italic_M ( divide start_ARG italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG ) ≤ divide start_ARG 2 italic_C ∥ italic_X ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( 1 - divide start_ARG italic_γ italic_μ end_ARG start_ARG 1 + italic_γ italic_μ end_ARG ) start_POSTSUPERSCRIPT divide start_ARG italic_t end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT (5.6)
Theorem 5.2.

Let wsubscript𝑤w_{\ast}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT the solution of (MN) and {ut}t0subscriptsubscript𝑢𝑡𝑡0\{u_{t}\}_{t\geq 0}{ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT and {wt}t0subscriptsubscript𝑤𝑡𝑡0\{w_{t}\}_{t\geq 0}{ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT the sequences generated by Algorithm 2 with α3𝛼3\alpha\geq 3italic_α ≥ 3. Then {wt}t0subscriptsubscript𝑤𝑡𝑡0\{w_{t}\}_{t\geq 0}{ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT converges to wsubscript𝑤w_{\ast}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. In particular, if usubscript𝑢u_{\ast}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT is a solution of the associated dual problem (4.11) and λ0u1subscript𝜆0superscriptdelimited-∥∥subscript𝑢1\lambda_{0}\leq\left\lVert{u_{\ast}}\right\rVert^{-1}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, then for all t1𝑡1t\geq 1italic_t ≥ 1, the following estimate holds true:

wtwCt+α1delimited-∥∥subscript𝑤𝑡subscript𝑤𝐶𝑡𝛼1\left\lVert{w_{t}-w_{\ast}}\right\rVert\leq\frac{C}{t+\alpha-1}∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ≤ divide start_ARG italic_C end_ARG start_ARG italic_t + italic_α - 1 end_ARG (5.7)

where C=(α1)(Xu02Xu2+2𝟏,u0u+u0u2γ)1/2𝐶𝛼1superscriptsuperscriptdelimited-∥∥superscript𝑋topsubscript𝑢02superscriptdelimited-∥∥superscript𝑋topsubscript𝑢221subscript𝑢0subscript𝑢superscriptdelimited-∥∥subscript𝑢0subscript𝑢2𝛾12~{}C=(\alpha-1)\Big{(}\left\lVert{X^{\top}u_{0}}\right\rVert^{2}-\left\lVert{X% ^{\top}u_{\ast}}\right\rVert^{2}+2\left\langle{\mathbf{1}},{u_{0}-u_{\ast}}% \right\rangle+\frac{\left\lVert{u_{0}-u_{\ast}}\right\rVert^{2}}{\gamma}\Big{)% }^{1/2}italic_C = ( italic_α - 1 ) ( ∥ italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ⟨ bold_1 , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ + divide start_ARG ∥ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ end_ARG ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT.

In addition, there exists some t>0superscript𝑡0t^{\ast}>0italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > 0, such that for all tt𝑡superscript𝑡t\geq t^{\ast}italic_t ≥ italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the following rates hold true for the angle and the margin gap (respectively):

1wt,wwtwC2w2t21subscript𝑤𝑡subscript𝑤delimited-∥∥subscript𝑤𝑡delimited-∥∥subscript𝑤superscript𝐶2superscriptdelimited-∥∥subscript𝑤2superscript𝑡21-\frac{\left\langle{w_{t}},{w_{\ast}}\right\rangle}{\left\lVert{w_{t}}\right% \rVert\left\lVert{w_{\ast}}\right\rVert}\leq\frac{C^{2}}{\left\lVert{w_{\ast}}% \right\rVert^{2}t^{2}}1 - divide start_ARG ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG ≤ divide start_ARG italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (5.8)
M(ww)M(wtwt)2CZFwt𝑀subscript𝑤delimited-∥∥subscript𝑤𝑀subscript𝑤𝑡delimited-∥∥subscript𝑤𝑡2𝐶subscriptdelimited-∥∥𝑍𝐹delimited-∥∥subscript𝑤𝑡M\Big{(}\frac{w_{\ast}}{\left\lVert{w_{\ast}}\right\rVert}\Big{)}-M\Big{(}% \frac{w_{t}}{\left\lVert{w_{t}}\right\rVert}\Big{)}\leq\frac{2C\left\lVert{Z}% \right\rVert_{F}}{\left\lVert{w_{\ast}}\right\rVert t}italic_M ( divide start_ARG italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG ) - italic_M ( divide start_ARG italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG ) ≤ divide start_ARG 2 italic_C ∥ italic_Z ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ italic_t end_ARG (5.9)

We add several remarks discussing the results in Theorems 5.1 and 5.2 before comparing them to some recent related works and deriving their proofs.

Remark 5.1.

In Theorem 5.1 we derived the linear convergence of the sequence {wt}t0subscriptsubscript𝑤𝑡𝑡0\{w_{t}\}_{t\geq 0}{ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT thanks to condition (5.1), as discussed above. Even if the inertial version should give better convergence than the basic iteration, Theorem 5.2 provides only a sublinear rate of convergence. We believe this is due to technical, rather than fundamental reasons. The numerical results in Section 6 suggest that inertial variants can indeed provide faster convergence, but proving linear rates of convergence for inertial variants is a challenging question and is an active area of research in the optimization literature, see e.g. the discussion in [35, 3].

Remark 5.2 (Error metrics).

Theorems 5.1 and 5.2 provide rates of convergence for the distance of the iterates wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the minimal norm solution, as well as the angle gap and the margin gap of the normalized iterates wtwtsubscript𝑤𝑡delimited-∥∥subscript𝑤𝑡\frac{w_{t}}{\left\lVert{w_{t}}\right\rVert}divide start_ARG italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG to the max-margin solution w+=wwsubscript𝑤subscript𝑤delimited-∥∥subscript𝑤w_{+}=\frac{w_{\ast}}{\left\lVert{w_{\ast}}\right\rVert}italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = divide start_ARG italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG , for Algorithms 1 and 2 (respectively). As mentioned in Section 3, since the original max-margin problem (MM) is a direction problem [27, 89], the margin and the angle gap are relevant quantities to measure the performance of the proposed methods, see [66, 87, 63, 46, 44, 45].

Remark 5.3 (Parameter choice).

Both in Theorems 5.1 and 5.2, the requirement λ0u1subscript𝜆0superscriptdelimited-∥∥subscript𝑢1\lambda_{0}\leq\left\lVert{u_{\ast}}\right\rVert^{-1}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, where uargminDsubscript𝑢argminsubscript𝐷u_{\ast}\in\operatornamewithlimits{arg\,min}D_{\infty}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT allows to deduce the bounds(5.4) and (5.7), for all t0𝑡0t\geq 0italic_t ≥ 0. If this condition is not verified the estimates in Theorems 5.1 and 5.2 still hold true asymptotically due to the decreasing property of λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In addition, one can freely choose the decay rate-to-zero of λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In Section 6 we numerically evaluate the impact of different choices of (λt)subscript𝜆𝑡(\lambda_{t})( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) on the performance of the method.

5.1 Comparison to other convergence results for implicit regularization in classification

We next compare the convergence results of Theorems 5.1 and 5.2, with existing results in the related literature. We begin by noting that Theorems 5.1 and 5.2 provide improved rates compared to those for classical perceptron variants [71] which are of order O(1/t)𝑂1𝑡O\mathopen{}\left({1}/{\sqrt{t}}\right)italic_O ( 1 / square-root start_ARG italic_t end_ARG ), see for example [75, Theorem 4444]. Margin rates similar to those in (5.6) and (5.9) have been derived for other optimization procedures applied to different losses and regularizers. For the iterates generated by gradient descent applied to exponentialy-tailed losses (such as logistic or exponential loss) a margin rate of order O(1/log(t))𝑂1𝑡O\mathopen{}\left({1}/{\log(t)}\right)italic_O ( 1 / roman_log ( italic_t ) ) is derived in [66, 87, 46]). For the iterates of the same algorithm with adaptive step size variants, the margin rates are of the order O(log(t)/t)𝑂𝑡𝑡O\mathopen{}\left({\log(t)}/{\sqrt{t}}\right)italic_O ( roman_log ( italic_t ) / square-root start_ARG italic_t end_ARG ) ([66, Theorem 5555]) or O(1/t)𝑂1𝑡O\mathopen{}\left({1}/{\sqrt{t}}\right)italic_O ( 1 / square-root start_ARG italic_t end_ARG ) [25, 75]. In all these cases, the rates are worse than the ones we obtain in (5.6) and (5.9). The rates for the margin in Theorem 5.2 for Algorithm 2 match the ones in [44] (see Theorem 7777). They are slightly worse than those found in [45, Theorem 3.13.13.13.1] which are of order O(lnt/t2)𝑂𝑡superscript𝑡2O\mathopen{}\left({\ln t}/{t^{2}}\right)italic_O ( roman_ln italic_t / italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), and are obtained considering a mirror-descent method on the smoothed margin for the exponential and logistic loss [58, 44, 45]. Finally, we compare to the results in [63] considering a different implicit regularization approach, based on the use of a homotopic subgradient method to minimize the primal penalized hinge loss. The rates given in Theorems 5.1 and 5.2 are considerably better than the ones in [63], which are approximately of order O(t16)𝑂superscript𝑡16O\mathopen{}\left(t^{-\frac{1}{6}}\right)italic_O ( italic_t start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 6 end_ARG end_POSTSUPERSCRIPT ) (see [63, Corollary 1111, Lemma 2222]). None of the existing results provides linear rates, as the ones we derive for Algorithm 1.

5.2 Stability

In Theorems 5.1 and 5.2 we established the regularization properties of Algorithms 1 and 2 in the sense of convergence to the minimal norm separating solution (MN) corresponding to the true labels Y=(yi)in𝑌subscriptsubscript𝑦𝑖𝑖𝑛Y=(y_{i})_{i\leq n}italic_Y = ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT. In practice, labels are typically corrupted by noise and regularization methods should provide stable solutions. In this section, we study the stability of Algorithm 1 introducing a suitable notion of label noise (analogous results could be derived also for Algorithm 2 and are let for future study).

For classical inverse problems noise is measured with respect to some norm in the data space. In the context of classification, a possible noise model is to consider a fraction of labels to be wrong [1, 49]. Since the data are binary valued, a natural way to measure the discrepancy between correct Y=(yi)in𝑌subscriptsubscript𝑦𝑖𝑖𝑛Y=(y_{i})_{i\leq n}italic_Y = ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT outputs and mislabeled outputs Y~=(y~i)in~𝑌subscriptsubscript~𝑦𝑖𝑖𝑛\tilde{Y}=(\tilde{y}_{i})_{i\leq n}over~ start_ARG italic_Y end_ARG = ( over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT is to assume there exists 0Nn/20𝑁𝑛20\leq N\leq n/20 ≤ italic_N ≤ italic_n / 2 such that

dH(Y,Y~)=Nsubscript𝑑𝐻𝑌~𝑌𝑁d_{H}(Y,\tilde{Y})=Nitalic_d start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_Y , over~ start_ARG italic_Y end_ARG ) = italic_N (5.10)

where dHsubscript𝑑𝐻d_{H}italic_d start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT is the Hamming distance defined as,

dH(Y,Y~)=i:yiyi~1subscript𝑑𝐻𝑌~𝑌subscript:𝑖subscript𝑦𝑖~subscript𝑦𝑖1d_{H}(Y,\tilde{Y})=\sum_{i:y_{i}\neq\tilde{y_{i}}}1italic_d start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_Y , over~ start_ARG italic_Y end_ARG ) = ∑ start_POSTSUBSCRIPT italic_i : italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ over~ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_POSTSUBSCRIPT 1 (5.11)

The Hamming distance replaces the norm in the data space to quantify the label noise. Here, N𝑁Nitalic_N can be seen as the noise level. The constraint Nn/2𝑁𝑛2N\leq n/2italic_N ≤ italic_n / 2 is natural since higher values would correspond to simply renaming the classes. The case N=0𝑁0N=0italic_N = 0 corresponds to noiseless data, that is where no mislabelling is present. Note that, Assumption 5.10 implies that there is a set of indices SN{1,,n}subscript𝑆𝑁1𝑛S_{N}\subset{\{1,\dots,n\}}italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⊂ { 1 , … , italic_n } with cardinality N𝑁Nitalic_N, such that

y~iyi=1, for all iSN.formulae-sequencesubscript~𝑦𝑖subscript𝑦𝑖1 for all 𝑖subscript𝑆𝑁\tilde{y}_{i}y_{i}=-1~{},\quad\text{ for all }i\in S_{N}.over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - 1 , for all italic_i ∈ italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT . (5.12)

In the above setting, if w~(N)~𝑤𝑁\tilde{w}(N)over~ start_ARG italic_w end_ARG ( italic_N ) is a solution obtained using labels with noise level N𝑁Nitalic_N, then the goal is to derive error estimates with respect to the true solution wsubscript𝑤w_{\ast}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT in terms of N𝑁Nitalic_N. The error estimates should decrease in N𝑁Nitalic_N so that the correct solution is recovered as the noise decreases. In the case of the iterative regularization procedure defined by Algorithm 1, this requires specifying a suitable choice for the stopping time. The following theorem provides such a choice and the corresponding error estimate.

Theorem 5.3 (Stability).

Let wsubscript𝑤w_{\ast}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT and usubscript𝑢u_{\ast}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT be the solutions of problems (MN) and (4.11) respectively and K=max1in{xi}𝐾1𝑖𝑛delimited-∥∥subscript𝑥𝑖K=\underset{1\leq i\leq n}{\max}\{\left\lVert{x_{i}}\right\rVert\}italic_K = start_UNDERACCENT 1 ≤ italic_i ≤ italic_n end_UNDERACCENT start_ARG roman_max end_ARG { ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ }. Let Y~=(y~i)1in~𝑌subscriptsubscript~𝑦𝑖1𝑖𝑛\tilde{Y}=(\tilde{y}_{i})_{1\leq i\leq n}over~ start_ARG italic_Y end_ARG = ( over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_n end_POSTSUBSCRIPT be a vector of noisy outputs satisfying Assumption (5.10). Finally, let {u~t}t0subscriptsubscript~𝑢𝑡𝑡0\{\tilde{u}_{t}\}_{t\geq 0}{ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT and {w~t}t0subscriptsubscript~𝑤𝑡𝑡0\{\tilde{w}_{t}\}_{t\geq 0}{ over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT the sequences generated by Algorithm 1 applied to the data X=(xi)1in𝑋subscriptsubscript𝑥𝑖1𝑖𝑛X=(x_{i})_{1\leq i\leq n}italic_X = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_n end_POSTSUBSCRIPT and Y~=(y~i)1in~𝑌subscriptsubscript~𝑦𝑖1𝑖𝑛\tilde{Y}=(\tilde{y}_{i})_{1\leq i\leq n}over~ start_ARG italic_Y end_ARG = ( over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_n end_POSTSUBSCRIPT, with λ0u1subscript𝜆0superscriptdelimited-∥∥subscript𝑢1\lambda_{0}\leq\left\lVert{u_{\ast}}\right\rVert^{-1}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Let ρ=γμ1+γμ𝜌𝛾𝜇1𝛾𝜇\rho=\frac{\gamma\mu}{1+\gamma\mu}italic_ρ = divide start_ARG italic_γ italic_μ end_ARG start_ARG 1 + italic_γ italic_μ end_ARG, with μ𝜇\muitalic_μ as defined in (5.2). Then for all t1𝑡1t\geq 1italic_t ≥ 1, the following estimate holds true:

w~twC12(n+12N)Nt+C2N+C(1ρ)t2delimited-∥∥subscript~𝑤𝑡subscript𝑤subscript𝐶12𝑛12𝑁𝑁𝑡subscript𝐶2𝑁𝐶superscript1𝜌𝑡2\left\lVert{\tilde{w}_{t}-w_{\ast}}\right\rVert\leq C_{1}\sqrt{2(n+1-2N)N}t+C_% {2}\sqrt{N}+C(1-\rho)^{\frac{t}{2}}∥ over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ≤ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT square-root start_ARG 2 ( italic_n + 1 - 2 italic_N ) italic_N end_ARG italic_t + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT square-root start_ARG italic_N end_ARG + italic_C ( 1 - italic_ρ ) start_POSTSUPERSCRIPT divide start_ARG italic_t end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT (5.13)

In particular, for the stopping time t(N):=max{1,2ln(11ρ)ln(Cln(11ρ)2C12(n+12N)N)}assignsubscript𝑡𝑁1211𝜌𝐶11𝜌2subscript𝐶12𝑛12𝑁𝑁t_{\ast}(N):=\max\left\{1,\frac{2}{\ln\left(\frac{1}{1-\rho}\right)}\ln\left(% \frac{C\ln\left(\frac{1}{1-\rho}\right)}{2C_{1}\sqrt{2(n+1-2N)N}}\right)\right\}italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_N ) := roman_max { 1 , divide start_ARG 2 end_ARG start_ARG roman_ln ( divide start_ARG 1 end_ARG start_ARG 1 - italic_ρ end_ARG ) end_ARG roman_ln ( divide start_ARG italic_C roman_ln ( divide start_ARG 1 end_ARG start_ARG 1 - italic_ρ end_ARG ) end_ARG start_ARG 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT square-root start_ARG 2 ( italic_n + 1 - 2 italic_N ) italic_N end_ARG end_ARG ) }, the following bound holds,

w~t(N)wdelimited-∥∥subscript~𝑤subscript𝑡𝑁subscript𝑤\displaystyle\left\lVert{\tilde{w}_{t_{\ast}(N)}-w_{\ast}}\right\rVert∥ over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_N ) end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ 2C12(n+12N)Nln(11ρ)ln(Cln(11ρ)C12(n+12N)N)+C2N+2C12(n+12N)Nln(11ρ)absent2subscript𝐶12𝑛12𝑁𝑁11𝜌𝐶11𝜌subscript𝐶12𝑛12𝑁𝑁subscript𝐶2𝑁2subscript𝐶12𝑛12𝑁𝑁11𝜌\displaystyle\leq\frac{2C_{1}\sqrt{2(n+1-2N)N}}{\ln\left(\frac{1}{1-\rho}% \right)}\ln\left(\frac{C\ln\left(\frac{1}{1-\rho}\right)}{C_{1}\sqrt{2(n+1-2N)% N}}\right)+C_{2}\sqrt{N}+\frac{2C_{1}\sqrt{2(n+1-2N)N}}{\ln\left(\frac{1}{1-% \rho}\right)}≤ divide start_ARG 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT square-root start_ARG 2 ( italic_n + 1 - 2 italic_N ) italic_N end_ARG end_ARG start_ARG roman_ln ( divide start_ARG 1 end_ARG start_ARG 1 - italic_ρ end_ARG ) end_ARG roman_ln ( divide start_ARG italic_C roman_ln ( divide start_ARG 1 end_ARG start_ARG 1 - italic_ρ end_ARG ) end_ARG start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT square-root start_ARG 2 ( italic_n + 1 - 2 italic_N ) italic_N end_ARG end_ARG ) + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT square-root start_ARG italic_N end_ARG + divide start_ARG 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT square-root start_ARG 2 ( italic_n + 1 - 2 italic_N ) italic_N end_ARG end_ARG start_ARG roman_ln ( divide start_ARG 1 end_ARG start_ARG 1 - italic_ρ end_ARG ) end_ARG (5.14)
=O(N)absent𝑂𝑁\displaystyle=O\mathopen{}\left(\sqrt{N}\right)= italic_O ( square-root start_ARG italic_N end_ARG )

where C1=22K2(u0u+u)subscript𝐶122superscript𝐾2delimited-∥∥subscript𝑢0subscript𝑢delimited-∥∥subscript𝑢C_{1}=2\sqrt{2}K^{2}\left(\left\lVert{u_{0}-u_{\ast}}\right\rVert+\left\lVert{% u_{\ast}}\right\rVert\right)italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2 square-root start_ARG 2 end_ARG italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∥ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ + ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ), C2=2K(u0u+u)subscript𝐶22𝐾delimited-∥∥subscript𝑢0subscript𝑢delimited-∥∥subscript𝑢C_{2}=2K\left(\left\lVert{u_{0}-u_{\ast}}\right\rVert+\left\lVert{u_{\ast}}% \right\rVert\right)italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 italic_K ( ∥ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ + ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ), C=Xu02Xu2+2𝟏,u0u𝐶superscriptdelimited-∥∥superscript𝑋topsubscript𝑢02superscriptdelimited-∥∥superscript𝑋topsubscript𝑢221subscript𝑢0subscript𝑢C=\sqrt{\left\lVert{X^{\top}u_{0}}\right\rVert^{2}-\left\lVert{X^{\top}u_{\ast% }}\right\rVert^{2}+2\left\langle{\mathbf{1}},{u_{0}-u_{\ast}}\right\rangle}italic_C = square-root start_ARG ∥ italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ⟨ bold_1 , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ end_ARG.

The proof of the above result can be found in Section A.5 of Appendix A. We add two remarks to discuss the above result.

Remark 5.4 (Stopping time and stability).

The above result shows that the best stopping time choice arises from a trade-off between stability and convergence. More precisely, the error estimate in (5.13) is composed of three terms. The first two terms are related to the stability of the algorithm and are increasing along the iterations due to the presence of the label noise. The last term is related to the convergence of the algorithm already analyzed in Theorem 5.1 in the absence of noise. The best stopping time is derived optimizing the bound (5.13). In this sense, this is an a priori choice. Deriving appropriate a posteriori choice is an interesting question left to a future study. Here, we note that the optimal stopping time is larger when the noise level is smaller, whereas the corresponding error decreases with respect to the noise. Another interesting question would be to derive corresponding lower bounds.

Remark 5.5.

[Noise model] The classification noise model considered above is simple and inspired by the classic deterministic noise in inverse problems. It allows to take a first step towards understanding the stability property of iterative regularization in classification. We note that other, possibly more complex, noise models are possible. For example, stochastic noise could be considered, possibly considering so called margin conditions [60]. A more substantial development would be to consider random input data, as often done in machine learning. This is likely to require results from empirical process theory [16] and possibly different statistical notions of stability already used in machine learning [17].

6 Numerical results

In this section, we investigate numerically the properties of Algorithms 1 and 2. First, we analyze their convergence and stability on some synthetic datasets. Second, we study their performance on two benchmark datasets, and compare them to some recent related works.

6.1 Synthetic data-set

Following [87] and [63], we consider a solution vector w=(12,12)subscript𝑤1212w_{\ast}=\left(\frac{1}{2},\frac{1}{2}\right)italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) defining the maximal margin separator f(x)=w,x𝑓𝑥subscript𝑤𝑥f(x)=\left\langle{w_{\ast}},{x}\right\rangleitalic_f ( italic_x ) = ⟨ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ and two pairs of support vectors x1=(12,32)subscript𝑥11232x_{1}=(\frac{1}{2},\frac{3}{2})italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG , divide start_ARG 3 end_ARG start_ARG 2 end_ARG ), x2=(32,12)subscript𝑥23212x_{2}=(\frac{3}{2},\frac{1}{2})italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( divide start_ARG 3 end_ARG start_ARG 2 end_ARG , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) labeled with y1=y2=1subscript𝑦1subscript𝑦21y_{1}=y_{2}=1italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 and x3=(12,32)subscript𝑥31232x_{3}=(-\frac{1}{2},-\frac{3}{2})italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG , - divide start_ARG 3 end_ARG start_ARG 2 end_ARG ), x4=(12,32)subscript𝑥41232x_{4}=(-\frac{1}{2},-\frac{3}{2})italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG , - divide start_ARG 3 end_ARG start_ARG 2 end_ARG ) labeled with y3=y4=1subscript𝑦3subscript𝑦41y_{3}=y_{4}=-1italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = - 1. We then generate 80808080 data-points and assign them to the two classes, so that the support vectors do not change, i.e. we have a larger distance from f(x)=w,x𝑓𝑥subscript𝑤𝑥f(x)=\left\langle{w_{\ast}},{x}\right\rangleitalic_f ( italic_x ) = ⟨ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x ⟩ than the points x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, x3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and x4subscript𝑥4x_{4}italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, see Figure 2.

We test Algorithms 1 and 2 for T=1000𝑇1000T=1000italic_T = 1000 iteration with regularization parameter λt=4tsubscript𝜆𝑡4𝑡\lambda_{t}=\frac{4}{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 4 end_ARG start_ARG italic_t end_ARG for all tT𝑡𝑇t\leq Titalic_t ≤ italic_T. In Figure 3 we illustrate the convergence results in terms of the margin and angle gap, and the error of the difference wtwsubscript𝑤𝑡subscript𝑤w_{t}-w_{\ast}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT (i.e. wtwdelimited-∥∥subscript𝑤𝑡subscript𝑤\left\lVert{w_{t}-w_{\ast}}\right\rVert∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥), as found in Theorems 5.1 and (5.2).

In this toy example we can notice that while the theoretical worst case bounds for Algorithm 1 are better than the ones for 2 as expressed in Theorems 5.1 and 5.2, this is not necessarily reflected in Figure 2. This is due to the pessimistic worst case bound found in Theorem 5.2 for the inertial Algorithm 2, rather than a numerical issue. Indeed this mismatch between theory and practice for inertial methods is commonly observed in similar settings and is an active area of research which is let for future study. A second remark on this example concerns the influence of the over-relaxation parameter α𝛼\alphaitalic_α for the convergence behavior of Algorithm 2. As one can notice in Figure 2, the choice of α𝛼\alphaitalic_α can highly affect the performance of Algorithm 2 in relation with the stopping time. This observation rises an interesting question about the tuning of the parameter α𝛼\alphaitalic_α which is let for future study (see also discussion in [3]).

Refer to caption
Figure 2: Data-set consisting of 80808080 labeled points with given support vector-points ±(12,32)plus-or-minus1232\pm(\frac{1}{2},\frac{3}{2})± ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG , divide start_ARG 3 end_ARG start_ARG 2 end_ARG ) and ±(32,12)plus-or-minus3212\pm(\frac{3}{2},\frac{1}{2})± ( divide start_ARG 3 end_ARG start_ARG 2 end_ARG , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ). In dashed lines the (overlapping) max-margin separating hyperplanes formed by the last iterate of every scheme (Algorithms 1 and 2 with α=10𝛼10\alpha=10italic_α = 10, 30303030 and 50505050 respectively).
Refer to caption
Refer to caption
Refer to caption
Figure 3: Values of the normalized error gap |wt/wtw/w|subscript𝑤𝑡delimited-∥∥subscript𝑤𝑡subscript𝑤delimited-∥∥subscript𝑤\lvert{{w_{t}}/{\left\lVert{w_{t}}\right\rVert}-{w_{\ast}}/{\left\lVert{w_{% \ast}}\right\rVert}}\rvert| italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT / ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ | (first figure), the normalized margin gap M(w)/wM(wt)/wt𝑀subscript𝑤delimited-∥∥subscript𝑤𝑀subscript𝑤𝑡delimited-∥∥subscript𝑤𝑡{M(w_{\ast})}{/\left\lVert{w_{\ast}}\right\rVert}-{M(w_{t})}/{\left\lVert{w_{t% }}\right\rVert}italic_M ( italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) / ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ - italic_M ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) / ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ (second figure) and the normalized angle gap 1wt,wwtw1subscript𝑤𝑡subscript𝑤delimited-∥∥subscript𝑤𝑡delimited-∥∥subscript𝑤1-\frac{\left\langle{w_{t}},{w_{\ast}}\right\rangle}{\left\lVert{w_{t}}\right% \rVert\left\lVert{w_{\ast}}\right\rVert}1 - divide start_ARG ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG (third figure), as a function of the iterations t𝑡titalic_t. Here we illustrate the performance of Algorithms 1(green) and 2 with 3333 different choices for the parameter α𝛼\alphaitalic_α in Algorithm 2, α=10𝛼10\alpha=10italic_α = 10 (red), α=30𝛼30\alpha=30italic_α = 30 (magenta) and α=50𝛼50\alpha=50italic_α = 50 (blue). As it appears in this example the choice of α𝛼\alphaitalic_α decisively affects the performance of algorithm 2.

Second, we consider a data set of 1200120012001200 points in 2superscript2\mathbb{R}^{2}roman_ℝ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, that consists of two classes distributed independently 𝒩(±(12,12),0.4)similar-toabsent𝒩plus-or-minus12120.4\sim\mathcal{N}(\pm(\frac{1}{2},\frac{1}{2}),0.4)∼ caligraphic_N ( ± ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) , 0.4 ) and we split them equally in training and test data. In this case, the training points are not linearly separable and algorithms 1 and 2 are implemented with a Gaussian kernel with parameter σ2=0.15superscript𝜎20.15\sigma^{2}=0.15italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.15, see Remark 4.1. The total number of iterations (kernel evaluations) is set T=2000𝑇2000T=2000italic_T = 2000. In this experiment, we aim at illustrating convergence but also the stability of the proposed methods in the presence of the noise induced by mislabeling an amount of the training data. In particular, we are interested in the effect of the parameter λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on the convergence behavior, as also on the stability performance of the proposed methods 1 and 2. In these experiments, the convergence is measured in terms of the margin, while the stability via the test error on the test data. In Figure 4, we plot the margin gap and the test error of Algorithms 1 and 2 with parameter λt=λ0/tsubscript𝜆𝑡subscript𝜆0𝑡\lambda_{t}={\lambda_{0}}/{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_t with λ0{0.01,10,100}subscript𝜆00.0110100\lambda_{0}\in\{0.01,10,100\}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ { 0.01 , 10 , 100 }, for three different levels of noise (p=𝑝absentp=italic_p = percentage of flipped training labels) starting from p=0%𝑝percent0p=0\%italic_p = 0 % (no noise), p=10%𝑝percent10p=10\%italic_p = 10 % (moderate noise) and p=20%𝑝percent20p=20\%italic_p = 20 % (strong noise). In Figure 5, we repeat the same experiment by choosing various orders of decay for λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, that is λt{8/log(t),8/t,8/t,8/t2,8/2t}subscript𝜆𝑡8𝑡8𝑡8𝑡8superscript𝑡28superscript2𝑡\lambda_{t}\in\{{8}/{\log(t)},{8}/{\sqrt{t}},{8}/{t},{8}/{t^{2}},{8}/{2^{t}}\}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 8 / roman_log ( italic_t ) , 8 / square-root start_ARG italic_t end_ARG , 8 / italic_t , 8 / italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , 8 / 2 start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT }. Several comments can be made. First, we note that larger initialization λ0subscript𝜆0\lambda_{0}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (e.g. λ0=100subscript𝜆0100\lambda_{0}=100italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 100 in Figure 4) or slower decay rate (e.g. λt=8/log(t)subscript𝜆𝑡8𝑡\lambda_{t}={8}/{\log(t)}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 8 / roman_log ( italic_t ) or 8/t8𝑡{8}/{\sqrt{t}}8 / square-root start_ARG italic_t end_ARG in Figure 5) lead to slower margin convergence, but better generalization properties especially in presence of errors (second and third rows in Figures 4 and 5). In addition, while Algorithm 1 seems more robust with respect to the various changes of λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the situation is different for the inertial variant 2. Both in Figures 4 and 5, for Algorithm 1, the behavior of the margin gap and the test error do not change radically unless the initialization is very large (e.g. λt=100/tsubscript𝜆𝑡100𝑡\lambda_{t}={100}/{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 100 / italic_t in Figure 4) or the decay rate is very slow (e.g. λt=8/log(t)subscript𝜆𝑡8𝑡\lambda_{t}={8}/{\log(t)}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 8 / roman_log ( italic_t ) or 8/t8𝑡{8}/{\sqrt{t}}8 / square-root start_ARG italic_t end_ARG in Figure 5) . On the other hand, Algorithm 2 seems more sensible to the different choices of the regularization parameter λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT both in terms of margin gap and test error.

In the noiseless case (first row in Figures 4 and 5) it seems that choosing λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT small enough (or decaying moderately fast to zero) can be a good policy. Note however that choosing too small initialization (or too fast decay to zero) does not offer any significant advantage with respect to more moderate choices of λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (see in particular the blue, magenta and khaki lines in the first row of Figures 4 and 5 ). On the other hand, in presence of errors (second and third rows in Figures 4 and 5), larger initialization (like λ0=10subscript𝜆010\lambda_{0}=10italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10 or 1111 in Figure 4) and slower decay rate (as λt=8/tsubscript𝜆𝑡8𝑡\lambda_{t}={8}/{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 8 / italic_t or 8/t8𝑡{8}/{\sqrt{t}}8 / square-root start_ARG italic_t end_ARG in Figure 5) may offer a better trade-off between margin convergence and test error.

Refer to captionRefer to caption
Refer to caption
Refer to caption
Refer to captionRefer to caption
Refer to caption
Refer to caption
Refer to captionRefer to caption
Refer to caption
Refer to caption
Figure 4: Margin and test error performance of Algorithms 1 and 2 in noisy dataset with λt=λ0/tsubscript𝜆𝑡subscript𝜆0𝑡\lambda_{t}={\lambda_{0}}/{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_t, for different initial values of λ0subscript𝜆0\lambda_{0}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (λ0=100subscript𝜆0100\lambda_{0}=100italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 100 in green, λ0=10subscript𝜆010\lambda_{0}=10italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10 in red, λ0=1subscript𝜆01\lambda_{0}=1italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 in blue and λ0=0.01subscript𝜆00.01\lambda_{0}=0.01italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.01 in magenta). Each row corresponds to a different noise level (%percent\%% of flipped labels) starting with 0%percent00\%0 % (first row), 10%percent1010\%10 % (second row) and 20%percent2020\%20 % (third row). The first and third column illustrate the margin and test error of Algorithm 1 respectively, while the second and the fourth column correspond to the margin (resp. test error) of Algorithm 2.
Refer to captionRefer to caption
Refer to caption
Refer to captionRefer to caption
Refer to caption
Refer to caption
Refer to captionRefer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Margin and test error performance of Algorithms 1 and 2 in noisy dataset for different decay rates of λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (λt=λ0log(t)subscript𝜆𝑡subscript𝜆0𝑡\lambda_{t}=\frac{\lambda_{0}}{\log(t)}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG roman_log ( italic_t ) end_ARG in green, λt=λ0tsubscript𝜆𝑡subscript𝜆0𝑡\lambda_{t}=\frac{\lambda_{0}}{\sqrt{t}}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_t end_ARG end_ARG in red, λt=λ0tsubscript𝜆𝑡subscript𝜆0𝑡\lambda_{t}=\frac{\lambda_{0}}{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_t end_ARG in blue, λt=λ0t2subscript𝜆𝑡subscript𝜆0superscript𝑡2\lambda_{t}=\frac{\lambda_{0}}{t^{2}}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG in magenta and λt=λ02tsubscript𝜆𝑡subscript𝜆0superscript2𝑡\lambda_{t}=\frac{\lambda_{0}}{2^{t}}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG in khaki), where λ0=8subscript𝜆08\lambda_{0}=8italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 8. Each row corresponds to a different noise level (%percent\%% of flipped labels) starting with 0%percent00\%0 % (first row), 10%percent1010\%10 % (second row) and 20%percent2020\%20 % (third row). The first and third column illustrate the margin and test error of Algorithm 1 respectively, while the second and the fourth column correspond to the margin (resp. test error) of Algorithm 2.

6.2 Real data-set

Finally, we test the proposed methods on the MNIST dataset (see [53]) and on the HTRU2222 dataset777HTRU2 is a data set which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey (South)[51] (see [57, 76]). In particular we compare the performance of Algorithms 1 and 2, with some of the recent proposed methods for binary classification [66, 87], and [44, 45] in terms of margin convergence and test error. The method in [66, 87] is a normalized gradient descent with variable stepsize on the exponential loss and the one in [44, 45] is based on an accelerated mirror descent approach on the smoothed margin of the exponential loss. We want to stress out that this comparison is indicative for the theoretical convergence properties of the tested methods and is not meant to be exhaustive. The results are reported in Figure 6.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Margin and test error comparison on three pairs of MNIST digits (from first to third column) and the HTRU2222 dataset (fourth column). The first row corresponds to margin gap, while the second one to the test error. In green and red are Algorithms 1 and 2 respectively. In blue is the variable stepsize normalized gradient method proposed in [87, 66] and in magenta the inertial mirror descent approach in [45].

All the data were standardized with zero mean and unit standard deviation and the binary labels of each dataset are set to be 11-1- 1 and 1111. The MNIST dataset is restricted to a two-digits comparison of 3333 vs 5555, 3333 vs 8888 and 4444 vs 9999. For the HTRU2222 dataset the training part was chosen by randomly picking 70%percent7070\%70 % of the whole dataset, while the test part is composed of the remaining 30%percent3030\%30 % (see e.g. [76]). All methods are implemented with the Gaussian kernel with parameter σ2=3.4superscript𝜎23.4\sigma^{2}=3.4italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 3.4 for MNIST digits and σ2=4superscript𝜎24\sigma^{2}=4italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 4 for the HTRU2222 dataset. Each update use only the kernel evaluation and thus all the schemes have the same computational complexity per iteration. The test error is measured by the standard zero-one loss.

As a general remark Algorithm 2 has the best performance in terms of margin convergence and test error in the MNIST datasets. For HTRU2222, while Algorithm 2 still provides the fastest margin convergence, it performs slightly worse than the method in [45] in terms of test error, whereas Algorithm 1 and the method proposed in [66] seem to give better test error results.

7 Conclusion and possible future work

In this work, we propose and study iterative regularization for classification in machine learning. Considering the hinge loss function, we derive an iterative regularization method defined by a dual diagonal optimization approach and further consider its accelerated variant, see [34, 21]. We provide convergence results, as well as convergence rates, to the minimum norm separating solution. Moreover we derive stability results for a natural classification noise model.

Several further research directions can be explored. For example it would be interesting to consider other form of regularization, extending to classification the results for linear inverse problems, see [62] and references therein. Moreover, it would be interesting to consider different data model including stochastic noise and random input data as often done in statistical learning theory [93]. Finally it would be very interesting to consider nonlinear models following [77, 65].

Acknowledgements : We acknowledge the financial support of the European Research Council (grant SLING 819789), the AFOSR ((European Office of Aerospace Research and Development)) project FA9550-18-1-7009 and FA8655-22-1-7034, the EU H2020-MSCA-RISE project NoMADS - DLV-777826, the H2020-MSCA-ITN Project Trade-OPT 2019; L. R. acknowledges the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216 and IIT. S.V. and L.R. acnowledge the support of the Ministry of Education, University and Research (PRIN 202244A7YL project ”Gradient Flows and Non-Smooth Geometric Structures with Applications to Optimization and Machine Learning”) is part of the Indam group ”Gruppo Nazionale per l’Analisi Matematica, la Probabilità e le loro applicazioni”. This work has been partially supported by project MIS 5154714 of the National Recovery and Resilience Plan Greece 2.0 funded by the European Union under the NextGenerationEU Program.

Appendix A Appendix

The Appendix is organized as follows: In paragraph A.1 we recall some basic results concerning Tikhonov regularization and gradient descent for linear problems, as also some basic facts on the max-margin problem MM. Paragraph A.2 contains some auxiliary results that are needed for the proofs of main Theorems 5.1, 5.2 and 5.3. In paragraphs A.3 and A.4 we analyze Algorithm 1 and 2 (respectively) and provide the proofs of Theorems 5.1 and 5.2. Finally in paragraph A.5 we provide the proof of Theorem 5.3 regarding stability of Algorithm 1.

A.1 General Lemmas

In the following two lemmas we establish the regularization properties of Tikhonov regularization and gradient descent to the minimal norm interpolating solution, for general losses.

Lemma A.1 (Tikhonov).

Let V:d+:𝑉superscript𝑑subscriptV:\mathbb{R}^{d}\to\mathbb{R}_{+}italic_V : roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → roman_ℝ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT be a continuous and convex function. For all λ>0𝜆0\lambda>0italic_λ > 0, consider:

wλ=argminwRdV(w)+λw2.subscript𝑤𝜆subscriptargmin𝑤superscript𝑅𝑑𝑉𝑤𝜆superscriptdelimited-∥∥𝑤2w_{\lambda}=\operatornamewithlimits{arg\,min}_{w\in R^{d}}V(w)+\lambda\left% \lVert{w}\right\rVert^{2}.italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_w ∈ italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V ( italic_w ) + italic_λ ∥ italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (A.1)
  1. 1.

    If argminVargmin𝑉\operatornamewithlimits{arg\,min}V\neq\emptysetstart_OPERATOR roman_arg roman_min end_OPERATOR italic_V ≠ ∅, then

    limλ0wλ=w:=argmin{w:wargminV}.subscript𝜆0subscript𝑤𝜆superscript𝑤assignargmin:delimited-∥∥𝑤𝑤argmin𝑉\lim_{\lambda\to 0}w_{\lambda}=w^{\dagger}:=\operatornamewithlimits{arg\,min}% \{\left\lVert{w}\right\rVert~{}:~{}w\in\operatornamewithlimits{arg\,min}V\}.roman_lim start_POSTSUBSCRIPT italic_λ → 0 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT := start_OPERATOR roman_arg roman_min end_OPERATOR { ∥ italic_w ∥ : italic_w ∈ start_OPERATOR roman_arg roman_min end_OPERATOR italic_V } .
  2. 2.

    If argminV=argmin𝑉\operatornamewithlimits{arg\,min}V=\emptysetstart_OPERATOR roman_arg roman_min end_OPERATOR italic_V = ∅, then wλλ0+delimited-∥∥subscript𝑤𝜆𝜆0\left\lVert{w_{\lambda}}\right\rVert\underset{\lambda\to 0}{\to}+\infty∥ italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ∥ start_UNDERACCENT italic_λ → 0 end_UNDERACCENT start_ARG → end_ARG + ∞

Proof.

For the first point, let wargminVsuperscript𝑤argmin𝑉w^{\ast}\in\operatornamewithlimits{arg\,min}Vitalic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR italic_V and Jλ(w)=V(w)+λw2subscript𝐽𝜆𝑤𝑉𝑤𝜆superscriptdelimited-∥∥𝑤2J_{\lambda}(w)=V(w)+\lambda\left\lVert{w}\right\rVert^{2}italic_J start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_w ) = italic_V ( italic_w ) + italic_λ ∥ italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. By the definition of wλsubscript𝑤𝜆w_{\lambda}italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT and wsuperscript𝑤w^{\ast}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Jλ(wλ)Jλ(w)V(wλ)+λw2subscript𝐽𝜆subscript𝑤𝜆subscript𝐽𝜆superscript𝑤𝑉subscript𝑤𝜆𝜆superscriptdelimited-∥∥superscript𝑤2J_{\lambda}(w_{\lambda})\leq J_{\lambda}(w^{\ast})\leq V(w_{\lambda})+\lambda% \left\lVert{w^{\ast}}\right\rVert^{2}italic_J start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ) ≤ italic_J start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_V ( italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ) + italic_λ ∥ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

yielding

wλwdelimited-∥∥subscript𝑤𝜆delimited-∥∥superscript𝑤\left\lVert{w_{\lambda}}\right\rVert\leq\left\lVert{w^{\ast}}\right\rVert∥ italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ∥ ≤ ∥ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ (A.2)

which allows to deduce that the sequence {wλ}λ>0subscriptsubscript𝑤𝜆𝜆0\{w_{\lambda}\}_{\lambda}{>0}{ italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT > 0 is uniformly bounded. This implies that (up to a subsequence) wλsubscript𝑤𝜆w_{\lambda}italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT converges to an element w¯¯𝑤\bar{w}over¯ start_ARG italic_w end_ARG. By lower semi-continuity of V𝑉Vitalic_V, we have :

V(w¯)lim infV(wλ)lim infJλ(wλ)lim infJλ(w)=V(w)𝑉¯𝑤limit-infimum𝑉subscript𝑤𝜆limit-infimumsubscript𝐽𝜆subscript𝑤𝜆limit-infimumsubscript𝐽𝜆superscript𝑤𝑉superscript𝑤V(\bar{w})\leq\liminf V(w_{\lambda})\leq\liminf J_{\lambda}(w_{\lambda})\leq% \liminf J_{\lambda}(w^{\dagger})=V(w^{\ast})italic_V ( over¯ start_ARG italic_w end_ARG ) ≤ lim inf italic_V ( italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ) ≤ lim inf italic_J start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ) ≤ lim inf italic_J start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ) = italic_V ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )

which shows that w¯argminV¯𝑤argmin𝑉\bar{w}\in\operatornamewithlimits{arg\,min}Vover¯ start_ARG italic_w end_ARG ∈ start_OPERATOR roman_arg roman_min end_OPERATOR italic_V. In addition from the (weak) lower semi-continuity of the norm and relation (A.2), we have :

w¯lim infwλwdelimited-∥∥¯𝑤limit-infimumdelimited-∥∥subscript𝑤𝜆delimited-∥∥superscript𝑤\left\lVert{\bar{w}}\right\rVert\leq\liminf\left\lVert{w_{\lambda}}\right% \rVert\leq\left\lVert{w^{\ast}}\right\rVert∥ over¯ start_ARG italic_w end_ARG ∥ ≤ lim inf ∥ italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ∥ ≤ ∥ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥

and since the last inequality holds for an arbitrary wargminVsuperscript𝑤argmin𝑉w^{\ast}\in\operatornamewithlimits{arg\,min}Vitalic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR italic_V, we deduce that w¯=w¯𝑤superscript𝑤\bar{w}=w^{\dagger}over¯ start_ARG italic_w end_ARG = italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT.

The second point can be proven by contrapositive. Let us assume the existence of some M>0𝑀0M>0italic_M > 0, such that wλMdelimited-∥∥subscript𝑤𝜆𝑀\left\lVert{w_{\lambda}}\right\rVert\leq M∥ italic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ∥ ≤ italic_M. By following the arguments of the proof of the first point, we deduce the existence of a limit point wλw¯argminVsubscript𝑤𝜆¯𝑤argmin𝑉w_{\lambda}\to\bar{w}\in\operatornamewithlimits{arg\,min}Vitalic_w start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT → over¯ start_ARG italic_w end_ARG ∈ start_OPERATOR roman_arg roman_min end_OPERATOR italic_V, which allows to conclude.

Lemma A.2 (Gradient descent).

Let Xn×d𝑋superscript𝑛𝑑X\in\mathbb{R}^{n\times d}italic_X ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT be a linear operator, with d>n𝑑𝑛d>nitalic_d > italic_n , y(X)𝑦𝑋y\in\Im(X)italic_y ∈ roman_ℑ ( italic_X ) and :n×n+:superscript𝑛superscript𝑛subscript\mathcal{L}:\mathbb{R}^{n}\times\mathbb{R}^{n}\to\mathbb{R}_{+}caligraphic_L : roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → roman_ℝ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT be a proper, convex function with L𝐿Litalic_L-Lipschitz gradient, such that

((u,v)n×n)(u,v)=0u=vfor-all𝑢𝑣superscript𝑛superscript𝑛𝑢𝑣0𝑢𝑣(\forall(u,v)\in\mathbb{R}^{n}\times\mathbb{R}^{n})\quad\mathcal{L}(u,v)=0% \Leftrightarrow u=v( ∀ ( italic_u , italic_v ) ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) caligraphic_L ( italic_u , italic_v ) = 0 ⇔ italic_u = italic_v (A.3)

Let also w0(X)subscript𝑤0superscript𝑋topw_{0}\in\Im(X^{\top})italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_ℑ ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ), 0<γLXop20𝛾𝐿superscriptsubscriptdelimited-∥∥𝑋𝑜𝑝20<\gamma\leq L\left\lVert{X}\right\rVert_{op}^{2}0 < italic_γ ≤ italic_L ∥ italic_X ∥ start_POSTSUBSCRIPT italic_o italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and consider the gradient iteration on the first argument of X𝑋\mathcal{L}\circ Xcaligraphic_L ∘ italic_X, i.e.

wt+1=wtγX(Xwt,y)subscript𝑤𝑡1subscript𝑤𝑡𝛾superscript𝑋top𝑋subscript𝑤𝑡𝑦w_{t+1}=w_{t}-\gamma X^{\top}\nabla\mathcal{L}(Xw_{t},y)italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ caligraphic_L ( italic_X italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y ) (A.4)

Then

limt+wt=w:=argmin{w:Xw=y}subscript𝑡subscript𝑤𝑡superscript𝑤assignargmin:delimited-∥∥𝑤𝑋𝑤𝑦\lim_{t\to+\infty}w_{t}=w^{\dagger}:=\operatornamewithlimits{arg\,min}\{\left% \lVert{w}\right\rVert~{}:~{}Xw=y\}roman_lim start_POSTSUBSCRIPT italic_t → + ∞ end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT := start_OPERATOR roman_arg roman_min end_OPERATOR { ∥ italic_w ∥ : italic_X italic_w = italic_y } (A.5)
Proof.

First of all, since d>n𝑑𝑛d>nitalic_d > italic_n, the set of minimizers of (X,y)\mathcal{L}(X\cdot,y)caligraphic_L ( italic_X ⋅ , italic_y ) is non-empty (i.e. wd𝑤superscript𝑑\exists w\in\mathbb{R}^{d}∃ italic_w ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, such that Xw=y𝑋𝑤𝑦Xw=yitalic_X italic_w = italic_y). By the standard gradient descent analysis (see for example [67, Paragraph 2.1.52.1.52.1.52.1.5]), it holds that wttw¯argmin(X,y)w_{t}\underset{t\to\infty}{\to}\bar{w}\in\operatornamewithlimits{arg\,min}% \mathcal{L}(X\cdot,y)italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_UNDERACCENT italic_t → ∞ end_UNDERACCENT start_ARG → end_ARG over¯ start_ARG italic_w end_ARG ∈ start_OPERATOR roman_arg roman_min end_OPERATOR caligraphic_L ( italic_X ⋅ , italic_y ).

In addition, since w0(X)subscript𝑤0superscript𝑋topw_{0}\in\Im(X^{\top})italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_ℑ ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ), by using (A.4), and a recurrence argument, we have that t{wt}t0(X)subscriptsubscript𝑤𝑡𝑡0superscript𝑋top\{w_{t}\}_{t\geq 0}\in\Im(X^{\top}){ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT ∈ roman_ℑ ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ). By closeness of (X)superscript𝑋top\Im(X^{\top})roman_ℑ ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ), it follows that w¯(X)¯𝑤superscript𝑋top\bar{w}\in\Im(X^{\top})over¯ start_ARG italic_w end_ARG ∈ roman_ℑ ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ). By definition of \mathcal{L}caligraphic_L, we have that the set of minimizers argmin(X,y)={w:Xw=y}\operatornamewithlimits{arg\,min}\mathcal{L}(X\cdot,y)=\{w~{}:~{}Xw=y\}start_OPERATOR roman_arg roman_min end_OPERATOR caligraphic_L ( italic_X ⋅ , italic_y ) = { italic_w : italic_X italic_w = italic_y } is an affine space, hence it can be expressed as argmin(X,y)=projargmin(X,y)(0)+V\operatornamewithlimits{arg\,min}\mathcal{L}(X\cdot,y)={\operatorname{proj}_{% \operatornamewithlimits{arg\,min}\mathcal{L}(X\cdot,y)}(0)}+Vstart_OPERATOR roman_arg roman_min end_OPERATOR caligraphic_L ( italic_X ⋅ , italic_y ) = roman_proj start_POSTSUBSCRIPT start_OPERATOR roman_arg roman_min end_OPERATOR caligraphic_L ( italic_X ⋅ , italic_y ) end_POSTSUBSCRIPT ( 0 ) + italic_V, for some suitable space V𝑉Vitalic_V, where projargmin(X,y)(0)\operatorname{proj}_{\operatornamewithlimits{arg\,min}\mathcal{L}(X\cdot,y)}(0)roman_proj start_POSTSUBSCRIPT start_OPERATOR roman_arg roman_min end_OPERATOR caligraphic_L ( italic_X ⋅ , italic_y ) end_POSTSUBSCRIPT ( 0 ) denotes the projection of the 00 element in argmin(X,y)\operatornamewithlimits{arg\,min}\mathcal{L}(X\cdot,y)start_OPERATOR roman_arg roman_min end_OPERATOR caligraphic_L ( italic_X ⋅ , italic_y ) and is equal to {w}superscript𝑤\{w^{\dagger}\}{ italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT }, by definition.

Let us show that V=ker(X)𝑉kernel𝑋V=\ker(X)italic_V = roman_ker ( italic_X ). Indeed for any wargmin(X,y)w\in\operatornamewithlimits{arg\,min}\mathcal{L}(X\cdot,y)italic_w ∈ start_OPERATOR roman_arg roman_min end_OPERATOR caligraphic_L ( italic_X ⋅ , italic_y ), we have w=w+v𝑤superscript𝑤𝑣w=w^{\dagger}+vitalic_w = italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT + italic_v, vV𝑣𝑉v\in Vitalic_v ∈ italic_V and y=Xw=Xw+Xv=y+Xv𝑦𝑋𝑤𝑋superscript𝑤𝑋𝑣𝑦𝑋𝑣y=Xw=Xw^{\dagger}+Xv=y+Xvitalic_y = italic_X italic_w = italic_X italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT + italic_X italic_v = italic_y + italic_X italic_v, which gives Vker(X)𝑉kernel𝑋V\subset\ker(X)italic_V ⊂ roman_ker ( italic_X ). On the other hand, if uker(X)𝑢kernel𝑋u\in\ker(X)italic_u ∈ roman_ker ( italic_X ) and w=w+usuperscript𝑤superscript𝑤𝑢w^{\prime}=w^{\dagger}+uitalic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT + italic_u, we have that Xw=Xw=y𝑋superscript𝑤𝑋superscript𝑤𝑦Xw^{\prime}=Xw^{\dagger}=yitalic_X italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_X italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT = italic_y, which allows to conclude that argmin(X,y)={w}+ker(X)\operatornamewithlimits{arg\,min}\mathcal{L}(X\cdot,y)=\{w^{\dagger}\}+\ker(X)start_OPERATOR roman_arg roman_min end_OPERATOR caligraphic_L ( italic_X ⋅ , italic_y ) = { italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT } + roman_ker ( italic_X ). By definition of the minimal norm solution it holds that w(kerX)superscript𝑤superscriptkernel𝑋perpendicular-tow^{\dagger}\in(\ker X)^{\perp}italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ∈ ( roman_ker italic_X ) start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT (see [31, Proposition 2.32.32.32.3 and Theorem 2.52.52.52.5]). By combining the facts w¯=w+u¯𝑤superscript𝑤𝑢\bar{w}=w^{\dagger}+uover¯ start_ARG italic_w end_ARG = italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT + italic_u, ukerX𝑢kernel𝑋u\in\ker Xitalic_u ∈ roman_ker italic_X, w¯(X)=(kerX)¯𝑤superscript𝑋topsuperscriptkernel𝑋perpendicular-to\bar{w}\in\Im(X^{\top})=(\ker X)^{\perp}over¯ start_ARG italic_w end_ARG ∈ roman_ℑ ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) = ( roman_ker italic_X ) start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT and w(kerX)superscript𝑤superscriptkernel𝑋perpendicular-tow^{\dagger}\in(\ker X)^{\perp}italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ∈ ( roman_ker italic_X ) start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT, it follows that w¯=w¯𝑤superscript𝑤\bar{w}=w^{\dagger}over¯ start_ARG italic_w end_ARG = italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT. ∎

The next lemma establishes the equivalence between the max-margin problem (MM) and the min-norm problem (MN). In fact the result holds true for general positively 1111-homogeneous features, i.e. by replacing M(w)=mini=1,,nyiw,xi𝑀𝑤subscript𝑖1𝑛subscript𝑦𝑖𝑤subscript𝑥𝑖M(w)=\min_{i=1,\dots,n}y_{i}\left\langle{w},{x_{i}}\right\rangleitalic_M ( italic_w ) = roman_min start_POSTSUBSCRIPT italic_i = 1 , … , italic_n end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_w , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ in (MM) and (MN), by the general margin:

M(w)=mini=1,,nyihi(w).𝑀𝑤subscript𝑖1𝑛subscript𝑦𝑖subscript𝑖𝑤M(w)=\min_{i=1,\dots,n}y_{i}h_{i}(w).italic_M ( italic_w ) = roman_min start_POSTSUBSCRIPT italic_i = 1 , … , italic_n end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w ) . (A.6)
Lemma A.3 (Max margin & min norm).

Let hi:d:subscript𝑖superscript𝑑h_{i}:\mathbb{R}^{d}\to\mathbb{R}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → roman_ℝ be a family of positively 1111-homogeneous functions for all in𝑖𝑛i\leq nitalic_i ≤ italic_n and consider the following optimization problems:

w+=argmax{M(w)=mini=1,,nyihi(w):w=1}subscript𝑤argmax:𝑀𝑤subscript𝑖1𝑛subscript𝑦𝑖subscript𝑖𝑤delimited-∥∥𝑤1w_{+}=\operatornamewithlimits{arg\,max}\{M(w)=\min_{i=1,\dots,n}y_{i}h_{i}(w)~% {}:~{}\left\lVert{w}\right\rVert=1\}italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR { italic_M ( italic_w ) = roman_min start_POSTSUBSCRIPT italic_i = 1 , … , italic_n end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w ) : ∥ italic_w ∥ = 1 } (A.7)
w=argmin{w:M(w)=mini=1,,nyihi(w)1}.subscript𝑤argmin:delimited-∥∥𝑤𝑀𝑤subscript𝑖1𝑛subscript𝑦𝑖subscript𝑖𝑤1w_{\ast}=\operatornamewithlimits{arg\,min}\{\left\lVert{w}\right\rVert~{}:~{}M% (w)=\min_{i=1,\dots,n}y_{i}h_{i}(w)\geq 1\}.italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR { ∥ italic_w ∥ : italic_M ( italic_w ) = roman_min start_POSTSUBSCRIPT italic_i = 1 , … , italic_n end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w ) ≥ 1 } . (A.8)

Then Problem (A.7) is equivalent to Problem (A.8). In particular, if wsubscript𝑤w_{\ast}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT is the solution of Problem (A.8) then w+=w/wsubscript𝑤subscript𝑤delimited-∥∥subscript𝑤w_{+}=w_{\ast}/\left\lVert{w_{\ast}}\right\rVertitalic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT / ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ is a solution of Problem (A.7). Further, if w+subscript𝑤w_{+}italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is the solution of Problem (A.7) then w=w+M(w+)subscript𝑤subscript𝑤𝑀subscript𝑤w_{\ast}=\frac{w_{+}}{M(w_{+})}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = divide start_ARG italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG start_ARG italic_M ( italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) end_ARG is a solution of Problem (A.8).

Proof.

First of all, from the definition of M𝑀Mitalic_M, see (A.6), we have that for all wd{0}𝑤superscript𝑑0w\in\mathbb{R}^{d}\setminus\{0\}italic_w ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∖ { 0 }

M(w)=maxγ>0γs.t. w=1&yihi(w)γ,i=1,,n.formulae-sequence𝑀𝑤subscript𝛾0𝛾formulae-sequences.t. delimited-∥∥𝑤limit-from1formulae-sequencesubscript𝑦𝑖subscript𝑖𝑤𝛾𝑖1𝑛M(w)=\max_{\gamma>0}\gamma\qquad\text{s.t. }~{}\left\lVert{w}\right\rVert=1~{}% ~{}\&\quad y_{i}h_{i}(w)\geq\gamma,\qquad i=1,\dots,n.italic_M ( italic_w ) = roman_max start_POSTSUBSCRIPT italic_γ > 0 end_POSTSUBSCRIPT italic_γ s.t. ∥ italic_w ∥ = 1 & italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w ) ≥ italic_γ , italic_i = 1 , … , italic_n . (A.9)

By making the change of variable w=ww𝑤superscript𝑤delimited-∥∥superscript𝑤w=\frac{w^{\prime}}{\left\lVert{w^{\prime}}\right\rVert}italic_w = divide start_ARG italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ end_ARG and taking into consideration (A.9), we can rewrite the max margin problem related to (A.7) as

maxwd{0},γ>0γs.t.yihi(w/w)γ,i=1,,n,formulae-sequencesubscriptformulae-sequencesuperscript𝑤superscript𝑑0𝛾0𝛾s.t.subscript𝑦𝑖subscript𝑖superscript𝑤delimited-∥∥superscript𝑤𝛾𝑖1𝑛\max_{w^{\prime}\in\mathbb{R}^{d}\setminus\{0\},\gamma>0}\gamma\qquad\text{s.t% .}\qquad y_{i}h_{i}(w^{\prime}/\left\lVert{w^{\prime}}\right\rVert)\geq\gamma,% \qquad i=1,\dots,n,roman_max start_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∖ { 0 } , italic_γ > 0 end_POSTSUBSCRIPT italic_γ s.t. italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / ∥ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ ) ≥ italic_γ , italic_i = 1 , … , italic_n ,

and using the homogeneity property of hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT,

maxwd{0},γ>0γs.t.yi1whi(w)γ,i=1,,n.formulae-sequencesubscriptformulae-sequencesuperscript𝑤superscript𝑑0𝛾0𝛾s.t.subscript𝑦𝑖1delimited-∥∥superscript𝑤subscript𝑖superscript𝑤𝛾𝑖1𝑛\max_{w^{\prime}\in\mathbb{R}^{d}\setminus\{0\},\gamma>0}\gamma\qquad\text{s.t% .}\qquad y_{i}\frac{1}{\left\lVert{w^{\prime}}\right\rVert}h_{i}(w^{\prime})% \geq\gamma,\qquad i=1,\dots,n.roman_max start_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∖ { 0 } , italic_γ > 0 end_POSTSUBSCRIPT italic_γ s.t. italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG ∥ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ end_ARG italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≥ italic_γ , italic_i = 1 , … , italic_n . (A.10)

Then setting γ~=γw~𝛾𝛾delimited-∥∥superscript𝑤\tilde{\gamma}=\gamma\left\lVert{w^{\prime}}\right\rVertover~ start_ARG italic_γ end_ARG = italic_γ ∥ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥, (A.10) can be equivalently written as

maxwd{0},γ~>0γ~ws.t.yihi(w)γ~,i=1,,n.formulae-sequencesubscriptformulae-sequencesuperscript𝑤superscript𝑑0~𝛾0~𝛾delimited-∥∥superscript𝑤s.t.subscript𝑦𝑖subscript𝑖superscript𝑤~𝛾𝑖1𝑛\max_{w^{\prime}\in\mathbb{R}^{d}\setminus\{0\},\tilde{\gamma}>0}\frac{\tilde{% \gamma}}{\left\lVert{w^{\prime}}\right\rVert}\qquad\text{s.t.}\qquad y_{i}h_{i% }(w^{\prime})\geq\tilde{\gamma},\qquad i=1,\dots,n.roman_max start_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∖ { 0 } , over~ start_ARG italic_γ end_ARG > 0 end_POSTSUBSCRIPT divide start_ARG over~ start_ARG italic_γ end_ARG end_ARG start_ARG ∥ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ end_ARG s.t. italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≥ over~ start_ARG italic_γ end_ARG , italic_i = 1 , … , italic_n . (A.11)

Since the above problem is still scale invariant by letting v=wγ~𝑣superscript𝑤~𝛾v=\frac{w^{\prime}}{\tilde{\gamma}}italic_v = divide start_ARG italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG over~ start_ARG italic_γ end_ARG end_ARG, (A.11) is equivalent to

maxvd{0}1vs.t.yihi(v)1,i=1,,n.formulae-sequencesubscript𝑣superscript𝑑01delimited-∥∥𝑣s.t.subscript𝑦𝑖subscript𝑖𝑣1𝑖1𝑛\max_{v\in\mathbb{R}^{d}\setminus\{0\}}\frac{1}{\left\lVert{v}\right\rVert}% \qquad\text{s.t.}\qquad y_{i}h_{i}(v)\geq 1,\qquad i=1,\dots,n.roman_max start_POSTSUBSCRIPT italic_v ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∖ { 0 } end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG ∥ italic_v ∥ end_ARG s.t. italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_v ) ≥ 1 , italic_i = 1 , … , italic_n .

which is equivalent to the min-norm problem related to  (A.8) (in the sense that the they have the same set of solutions) . By taking into consideration all the change of variables, it follows that w+=wwsubscript𝑤subscript𝑤delimited-∥∥subscript𝑤w_{+}=\frac{w_{\ast}}{\left\lVert{w_{\ast}}\right\rVert}italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = divide start_ARG italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG and w=w+M(w+)subscript𝑤subscript𝑤𝑀subscript𝑤w_{\ast}=\frac{w_{+}}{M(w_{+})}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = divide start_ARG italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_ARG start_ARG italic_M ( italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) end_ARG, as also that M(w)=1𝑀subscript𝑤1M(w_{\ast})=1italic_M ( italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = 1 and M(w+)=1w𝑀subscript𝑤1delimited-∥∥subscript𝑤M(w_{+})=\frac{1}{\left\lVert{w_{\ast}}\right\rVert}italic_M ( italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG. ∎

The following Lemma gives an expression of the dependence of the margin and the angle gap as a function of the gap of a method’s iterates approaching to the hard-margin solution wsubscript𝑤w_{\ast}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT.

Lemma A.4 (Lemma 2222 [63]: Bounds for angle and margin).

Let δ>0𝛿0\delta>0italic_δ > 0 and c>0𝑐0c>0italic_c > 0, and w0subscript𝑤0w_{\ast}\neq 0italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ≠ 0 be the min-norm solution as defined in (MN). Let {wt}t1subscriptsubscript𝑤𝑡𝑡1\{w_{t}\}_{t\geq 1}{ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 1 end_POSTSUBSCRIPT be a sequence such that wtδdelimited-∥∥subscript𝑤𝑡𝛿\left\lVert{w_{t}}\right\rVert\geq\delta∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ≥ italic_δ. Then the following estimates hold true :

1wt,wwtw12δwwtw21subscript𝑤𝑡subscript𝑤delimited-∥∥subscript𝑤𝑡delimited-∥∥subscript𝑤12𝛿delimited-∥∥subscript𝑤superscriptdelimited-∥∥subscript𝑤𝑡subscript𝑤21-\frac{\left\langle{w_{t}},{w_{\ast}}\right\rangle}{\left\lVert{w_{t}}\right% \rVert\left\lVert{w_{\ast}}\right\rVert}\leq\frac{1}{2\delta\left\lVert{w_{% \ast}}\right\rVert}\left\lVert{w_{t}-w_{\ast}}\right\rVert^{2}1 - divide start_ARG ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_δ ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (A.12)
M(ww)M(wtwt)XFδwwtw𝑀subscript𝑤delimited-∥∥subscript𝑤𝑀subscript𝑤𝑡delimited-∥∥subscript𝑤𝑡subscriptdelimited-∥∥𝑋𝐹𝛿delimited-∥∥subscript𝑤delimited-∥∥subscript𝑤𝑡subscript𝑤M\left(\frac{w_{\ast}}{\left\lVert{w_{\ast}}\right\rVert}\right)-M\left(\frac{% w_{t}}{\left\lVert{w_{t}}\right\rVert}\right)\leq\frac{\left\lVert{X}\right% \rVert_{F}}{\delta\left\lVert{w_{\ast}}\right\rVert}\left\lVert{w_{t}-w_{\ast}% }\right\rVertitalic_M ( divide start_ARG italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG ) - italic_M ( divide start_ARG italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG ) ≤ divide start_ARG ∥ italic_X ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG italic_δ ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ (A.13)
Proof.

Since wtδdelimited-∥∥subscript𝑤𝑡𝛿\left\lVert{w_{t}}\right\rVert\geq\delta∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ≥ italic_δ, by using the identity

uv2=u2+v22u,vu,vdformulae-sequencesuperscriptdelimited-∥∥𝑢𝑣2superscriptdelimited-∥∥𝑢2superscriptdelimited-∥∥𝑣22𝑢𝑣for-all𝑢𝑣superscript𝑑\left\lVert{u-v}\right\rVert^{2}=\left\lVert{u}\right\rVert^{2}+\left\lVert{v}% \right\rVert^{2}-2\left\langle{u},{v}\right\rangle\quad\forall u,v\in\mathbb{R% }^{d}∥ italic_u - italic_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ⟨ italic_u , italic_v ⟩ ∀ italic_u , italic_v ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT (A.14)

we find :

1wt,wwtw1subscript𝑤𝑡subscript𝑤delimited-∥∥subscript𝑤𝑡delimited-∥∥subscript𝑤\displaystyle 1-\frac{\left\langle{w_{t}},{w_{\ast}}\right\rangle}{\left\lVert% {w_{t}}\right\rVert\left\lVert{w_{\ast}}\right\rVert}1 - divide start_ARG ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG =2wtw+wtw2wt2w22wtw=wtw2(wtw)22wtwabsent2delimited-∥∥subscript𝑤𝑡delimited-∥∥subscript𝑤superscriptdelimited-∥∥subscript𝑤𝑡subscript𝑤2superscriptdelimited-∥∥subscript𝑤𝑡2superscriptdelimited-∥∥subscript𝑤22delimited-∥∥subscript𝑤𝑡delimited-∥∥subscript𝑤superscriptdelimited-∥∥subscript𝑤𝑡subscript𝑤2superscriptdelimited-∥∥subscript𝑤𝑡delimited-∥∥subscript𝑤22delimited-∥∥subscript𝑤𝑡delimited-∥∥subscript𝑤\displaystyle=\frac{2\left\lVert{w_{t}}\right\rVert\left\lVert{w_{\ast}}\right% \rVert+\left\lVert{w_{t}-w_{\ast}}\right\rVert^{2}-\left\lVert{w_{t}}\right% \rVert^{2}-\left\lVert{w_{\ast}}\right\rVert^{2}}{2\left\lVert{w_{t}}\right% \rVert\left\lVert{w_{\ast}}\right\rVert}=\frac{\left\lVert{w_{t}-w_{\ast}}% \right\rVert^{2}-\big{(}\left\lVert{w_{t}}\right\rVert-\left\lVert{w_{\ast}}% \right\rVert\big{)}^{2}}{2\left\lVert{w_{t}}\right\rVert\left\lVert{w_{\ast}}% \right\rVert}= divide start_ARG 2 ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ + ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG = divide start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ - ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG (A.15)
12δwwtw2absent12𝛿delimited-∥∥subscript𝑤superscriptdelimited-∥∥subscript𝑤𝑡subscript𝑤2\displaystyle\leq\frac{1}{2\delta\left\lVert{w_{\ast}}\right\rVert}\left\lVert% {w_{t}-w_{\ast}}\right\rVert^{2}≤ divide start_ARG 1 end_ARG start_ARG 2 italic_δ ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

which allow to prove the first point.

Finally if j=argminin{yiwt,xi}𝑗subscriptargmin𝑖𝑛subscript𝑦𝑖subscript𝑤𝑡subscript𝑥𝑖j=\operatornamewithlimits{arg\,min}_{i\leq n}\{y_{i}\left\langle{w_{t}},{x_{i}% }\right\rangle\}italic_j = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ }, then by definition of the margin M𝑀Mitalic_M and wsubscript𝑤w_{\ast}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, since M(w)=1𝑀subscript𝑤1M(w_{\ast})=1italic_M ( italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = 1 (see Lemma 3.1), we obtain:

M(ww)M(wtwt)𝑀subscript𝑤delimited-∥∥subscript𝑤𝑀subscript𝑤𝑡delimited-∥∥subscript𝑤𝑡\displaystyle M\left(\frac{w_{\ast}}{\left\lVert{w_{\ast}}\right\rVert}\right)% -M\left(\frac{w_{t}}{\left\lVert{w_{t}}\right\rVert}\right)italic_M ( divide start_ARG italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG ) - italic_M ( divide start_ARG italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG ) =1wyjwt,xjwtyjw,xjwyjwt,xjwt=yjwwwtwt,xjabsent1delimited-∥∥subscript𝑤subscript𝑦𝑗subscript𝑤𝑡subscript𝑥𝑗delimited-∥∥subscript𝑤𝑡subscript𝑦𝑗subscript𝑤subscript𝑥𝑗delimited-∥∥subscript𝑤subscript𝑦𝑗subscript𝑤𝑡subscript𝑥𝑗delimited-∥∥subscript𝑤𝑡subscript𝑦𝑗subscript𝑤delimited-∥∥subscript𝑤subscript𝑤𝑡delimited-∥∥subscript𝑤𝑡subscript𝑥𝑗\displaystyle=\frac{1}{\left\lVert{w_{\ast}}\right\rVert}-\frac{y_{j}\left% \langle{w_{t}},{x_{j}}\right\rangle}{\left\lVert{w_{t}}\right\rVert}\leq\frac{% y_{j}\left\langle{w_{\ast}},{x_{j}}\right\rangle}{\left\lVert{w_{\ast}}\right% \rVert}-\frac{y_{j}\left\langle{w_{t}},{x_{j}}\right\rangle}{\left\lVert{w_{t}% }\right\rVert}=y_{j}\left\langle{\frac{w_{\ast}}{\left\lVert{w_{\ast}}\right% \rVert}-\frac{w_{t}}{\left\lVert{w_{t}}\right\rVert}},{x_{j}}\right\rangle= divide start_ARG 1 end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG - divide start_ARG italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG ≤ divide start_ARG italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟨ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG - divide start_ARG italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG = italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟨ divide start_ARG italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG - divide start_ARG italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ (A.16)
wwwtwtxjXFwwwtwtabsentdelimited-∥∥subscript𝑤delimited-∥∥subscript𝑤subscript𝑤𝑡delimited-∥∥subscript𝑤𝑡delimited-∥∥subscript𝑥𝑗subscriptdelimited-∥∥𝑋𝐹delimited-∥∥subscript𝑤delimited-∥∥subscript𝑤subscript𝑤𝑡delimited-∥∥subscript𝑤𝑡\displaystyle\leq\left\lVert{\frac{w_{\ast}}{\left\lVert{w_{\ast}}\right\rVert% }-\frac{w_{t}}{\left\lVert{w_{t}}\right\rVert}}\right\rVert\left\lVert{x_{j}}% \right\rVert\leq\left\lVert{X}\right\rVert_{F}\left\lVert{\frac{w_{\ast}}{% \left\lVert{w_{\ast}}\right\rVert}-\frac{w_{t}}{\left\lVert{w_{t}}\right\rVert% }}\right\rVert≤ ∥ divide start_ARG italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG - divide start_ARG italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG ∥ ∥ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ ≤ ∥ italic_X ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ∥ divide start_ARG italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG - divide start_ARG italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG ∥

and since

wtwtww2=1+12wt,wwtw=2(1wt,wwtw)superscriptdelimited-∥∥subscript𝑤𝑡delimited-∥∥subscript𝑤𝑡subscript𝑤delimited-∥∥subscript𝑤2112subscript𝑤𝑡subscript𝑤delimited-∥∥subscript𝑤𝑡delimited-∥∥subscript𝑤21subscript𝑤𝑡subscript𝑤delimited-∥∥subscript𝑤𝑡delimited-∥∥subscript𝑤\left\lVert{\frac{w_{t}}{\left\lVert{w_{t}}\right\rVert}-\frac{w_{\ast}}{\left% \lVert{w_{\ast}}\right\rVert}}\right\rVert^{2}=1+1-2\frac{\left\langle{w_{t}},% {w_{\ast}}\right\rangle}{\left\lVert{w_{t}}\right\rVert\left\lVert{w_{\ast}}% \right\rVert}=2\bigg{(}1-\frac{\left\langle{w_{t}},{w_{\ast}}\right\rangle}{% \left\lVert{w_{t}}\right\rVert\left\lVert{w_{\ast}}\right\rVert}\bigg{)}∥ divide start_ARG italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG - divide start_ARG italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 + 1 - 2 divide start_ARG ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG = 2 ( 1 - divide start_ARG ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG ) (A.17)

the conclusion follows by combining (A.15) and (A.16). ∎

The next lemma is a general descent lemma that is classically used for proximal gradient methods applied to structured composite convex optimization problems, such as (4.9). The interested reader can find a proof in [13, Lemma 2.32.32.32.3]

Lemma A.5.

Let F=f+g:n:𝐹𝑓𝑔superscript𝑛F=f+g:\mathbb{R}^{n}\longrightarrow\mathbb{R}italic_F = italic_f + italic_g : roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ⟶ roman_ℝ, where f𝑓fitalic_f and g𝑔gitalic_g are convex lower-semi continuous function and f𝑓fitalic_f is continuously differentiable with L𝐿Litalic_L-Lipschitz gradient. For all 0<γ1L0𝛾1𝐿0<\gamma\leq\frac{1}{L}0 < italic_γ ≤ divide start_ARG 1 end_ARG start_ARG italic_L end_ARG, consider the operator Tγ:nn:subscript𝑇𝛾superscript𝑛superscript𝑛T_{\gamma}:\mathbb{R}^{n}\longrightarrow\mathbb{R}^{n}italic_T start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT : roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ⟶ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, such that Tγ(x):=proxγg(xγf(x))assignsubscript𝑇𝛾𝑥subscriptprox𝛾𝑔𝑥𝛾𝑓𝑥T_{\gamma}(x):=\operatorname{prox}_{\gamma g}\big{(}x-\gamma\nabla f(x)\big{)}italic_T start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_x ) := roman_prox start_POSTSUBSCRIPT italic_γ italic_g end_POSTSUBSCRIPT ( italic_x - italic_γ ∇ italic_f ( italic_x ) ). Then for all (x,y)(n)2𝑥𝑦superscriptsuperscript𝑛2(x,y)\in(\mathbb{R}^{n})^{2}( italic_x , italic_y ) ∈ ( roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT it holds:

2γ(F(Tγ(y))F(x))yx2Tγ(y)x22𝛾𝐹subscript𝑇𝛾𝑦𝐹𝑥superscriptdelimited-∥∥𝑦𝑥2superscriptdelimited-∥∥subscript𝑇𝛾𝑦𝑥22\gamma\big{(}F(T_{\gamma}(y))-F(x)\big{)}\leq\left\lVert{y-x}\right\rVert^{2}% -\left\lVert{T_{\gamma}(y)-x}\right\rVert^{2}2 italic_γ ( italic_F ( italic_T start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_y ) ) - italic_F ( italic_x ) ) ≤ ∥ italic_y - italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_T start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_y ) - italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (A.18)

A.2 Preliminary results

In this paragraph, we state some basic facts concerning the properties of the dual regularized problem (4.9), necessary for the analysis of Algorithms 1 and 2.

We first recall the objective function associated to the dual penalized hinge loss problem (4.9) (see also (4.10)), where we take the regularization parameter to be given by a sequence {λt}tsubscriptsubscript𝜆𝑡𝑡\{\lambda_{t}\}_{t}{ italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Dt(u)=Zu22+1λt(λtu)=12Xu2+i=1nui+ι[1,0]n(λtu),subscript𝐷𝑡𝑢superscriptdelimited-∥∥superscript𝑍top𝑢221subscript𝜆𝑡superscriptsubscript𝜆𝑡𝑢12superscriptdelimited-∥∥superscript𝑋top𝑢2superscriptsubscript𝑖1𝑛superscript𝑢𝑖subscript𝜄superscript10𝑛subscript𝜆𝑡𝑢D_{t}(u)=\frac{\left\lVert{Z^{\top}u}\right\rVert^{2}}{2}+\frac{1}{\lambda_{t}% }\mathcal{L}^{\ast}(\lambda_{t}u)=\frac{1}{2}\left\lVert{X^{\top}u}\right% \rVert^{2}+\sum_{i=1}^{n}u^{i}+\iota_{[-1,0]^{n}}(\lambda_{t}u),italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) = divide start_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_ι start_POSTSUBSCRIPT [ - 1 , 0 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u ) , (A.19)

and the dual problem of (3.4) (see (4.11)), that is

D(u)=12Zu2+i=1nui+ι(,0]n(u).subscript𝐷𝑢12superscriptdelimited-∥∥superscript𝑍top𝑢2superscriptsubscript𝑖1𝑛superscript𝑢𝑖subscript𝜄superscript0𝑛𝑢D_{\infty}(u)=\frac{1}{2}\left\lVert{Z^{\top}u}\right\rVert^{2}+\sum_{i=1}^{n}% u^{i}+\iota_{(-\infty,0]^{n}}(u).italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_ι start_POSTSUBSCRIPT ( - ∞ , 0 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u ) . (A.20)

Below we state some fundamental properties of the dual objective function Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, that will be useful for the convergence analysis of both Algorithms 1 and 2. We start by showing that the sequence of regularized dual functions Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is monotonically pointwise decreasing to Dsubscript𝐷D_{\infty}italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT.

Lemma A.6.

Let {λt}t0subscriptsubscript𝜆𝑡𝑡0\{\lambda_{t}\}_{t\geq 0}{ italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT be a sequence of positive parameters decreasing to zero, and consider the functions Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Dsubscript𝐷D_{\infty}italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT defined in (A.19) and (A.20) respectively. Then, for all un𝑢superscript𝑛u\in\mathbb{R}^{n}italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT the sequence {Dt(u)}t0subscriptsubscript𝐷𝑡𝑢𝑡0\{D_{t}(u)\}_{t\geq 0}{ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT is non-increasing. In addition, for any un𝑢superscript𝑛u\in\mathbb{R}^{n}italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, it holds,

Dt(u)tD(u).subscript𝐷𝑡𝑢𝑡subscript𝐷𝑢D_{t}(u)\underset{t\to\infty}{\longrightarrow}D_{\infty}(u).italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) start_UNDERACCENT italic_t → ∞ end_UNDERACCENT start_ARG ⟶ end_ARG italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u ) . (A.21)
Proof of Lemma A.6.

Since λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is non-increasing, for every un𝑢superscript𝑛u\in\mathbb{R}^{n}italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT it follows that the function t1λt(λtu)=i=1nui+ι[1,0](λtui)maps-to𝑡1subscript𝜆𝑡superscriptsubscript𝜆𝑡𝑢superscriptsubscript𝑖1𝑛superscript𝑢𝑖subscript𝜄10subscript𝜆𝑡superscript𝑢𝑖t\mapsto\frac{1}{\lambda_{t}}\mathcal{L}^{\ast}(\lambda_{t}u)=\sum_{i=1}^{n}u^% {i}+\iota_{[-1,0]}(\lambda_{t}u^{i})italic_t ↦ divide start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_ι start_POSTSUBSCRIPT [ - 1 , 0 ] end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) is non-increasing in [0,+]0[0,+\infty][ 0 , + ∞ ].

In addition, by direct computation for all un𝑢superscript𝑛u\in\mathbb{R}^{n}italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, it holds:

1λt(λtu)=i=1nui+ι(1,0]n(λtu)ti=1nui+ι(,0]n(u)1subscript𝜆𝑡superscriptsubscript𝜆𝑡𝑢superscriptsubscript𝑖1𝑛superscript𝑢𝑖subscript𝜄superscript10𝑛subscript𝜆𝑡𝑢𝑡superscriptsubscript𝑖1𝑛superscript𝑢𝑖subscript𝜄superscript0𝑛𝑢\frac{1}{\lambda_{t}}\mathcal{L}^{\ast}(\lambda_{t}u)=\sum_{i=1}^{n}u^{i}+% \iota_{(-1,0]^{n}}(\lambda_{t}u)\underset{t\to\infty}{\longrightarrow}\sum_{i=% 1}^{n}u^{i}+\iota_{(-\infty,0]^{n}}(u)divide start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_ι start_POSTSUBSCRIPT ( - 1 , 0 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u ) start_UNDERACCENT italic_t → ∞ end_UNDERACCENT start_ARG ⟶ end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_ι start_POSTSUBSCRIPT ( - ∞ , 0 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u ) (A.22)

Thus, for all un𝑢superscript𝑛u\in\mathbb{R}^{n}italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the function Dt(u)subscript𝐷𝑡𝑢D_{t}(u)italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) is non-increasing in t0𝑡0t\geq 0italic_t ≥ 0 and that Dt(u)tD(u).subscript𝐷𝑡𝑢𝑡subscript𝐷𝑢D_{t}(u)\underset{t\to\infty}{\longrightarrow}D_{\infty}(u).italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) start_UNDERACCENT italic_t → ∞ end_UNDERACCENT start_ARG ⟶ end_ARG italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u ) .

The following lemma, is well-known (see e.g. [83, Proposition 94]), and we recall it here for the sake of completeness. It plays a fundamental role in our analysis since it provides a bound of the distance of the primal iterates from the min norm solution (MN) in terms of the distance of the dual objective function from its minimum.

Lemma A.7.

Let wsubscript𝑤w_{\ast}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT be the minimal norm separating solution defined in (MN), and Dsubscript𝐷D_{\infty}italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT the associated dual problem defined in (A.20). Then

argminD.argminsubscript𝐷\operatornamewithlimits{arg\,min}D_{\infty}\neq\emptyset.start_OPERATOR roman_arg roman_min end_OPERATOR italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≠ ∅ . (A.23)

In addition for every un𝑢superscript𝑛u\in\mathbb{R}^{n}italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and every w=Zu𝑤superscript𝑍top𝑢w=-Z^{\top}uitalic_w = - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u and uargminDsubscript𝑢argminsubscript𝐷u_{\ast}\in\operatornamewithlimits{arg\,min}D_{\infty}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT, we have,

12ww2D(u)D(u).12superscriptdelimited-∥∥𝑤subscript𝑤2subscript𝐷𝑢subscript𝐷subscript𝑢\frac{1}{2}\left\lVert{w-w_{\ast}}\right\rVert^{2}\leq D_{\infty}(u)-D_{\infty% }(u_{\ast}).divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_w - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) . (A.24)
Proof of Lemma A.7.

From the separability assumption (1) there exists some w~d~𝑤superscript𝑑\tilde{w}\in\mathbb{R}^{d}over~ start_ARG italic_w end_ARG ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, such that w~,zi>0,informulae-sequence~𝑤subscript𝑧𝑖0for-all𝑖𝑛\left\langle{\tilde{w}},{z_{i}}\right\rangle>0~{},\forall i\leq n⟨ over~ start_ARG italic_w end_ARG , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ > 0 , ∀ italic_i ≤ italic_n. Let imin=argmin{w~,zi:in}subscript𝑖𝑚𝑖𝑛argmin:~𝑤subscript𝑧𝑖𝑖𝑛i_{min}=\operatornamewithlimits{arg\,min}\{\left\langle{\tilde{w}},{z_{i}}% \right\rangle~{}:~{}i\leq n\}italic_i start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR { ⟨ over~ start_ARG italic_w end_ARG , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ : italic_i ≤ italic_n } and let M(w~)=w~,zimin>0𝑀~𝑤~𝑤subscript𝑧subscript𝑖𝑚𝑖𝑛0M(\tilde{w})=\left\langle{\tilde{w}},{z_{i_{min}}}\right\rangle>0italic_M ( over~ start_ARG italic_w end_ARG ) = ⟨ over~ start_ARG italic_w end_ARG , italic_z start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ > 0. Then by setting w=2w~M(w~)superscript𝑤2~𝑤𝑀~𝑤w^{\prime}=\frac{2\tilde{w}}{M(\tilde{w})}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG 2 over~ start_ARG italic_w end_ARG end_ARG start_ARG italic_M ( over~ start_ARG italic_w end_ARG ) end_ARG, we have w,zi=2w~,ziM(w~)2>1,informulae-sequencesuperscript𝑤subscript𝑧𝑖2~𝑤subscript𝑧𝑖𝑀~𝑤21for-all𝑖𝑛\left\langle{w^{\prime}},{z_{i}}\right\rangle=\frac{\left\langle{2\tilde{w}},{% z_{i}}\right\rangle}{M(\tilde{w})}\geq 2>1~{},~{}\forall i\leq n⟨ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ = divide start_ARG ⟨ 2 over~ start_ARG italic_w end_ARG , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ end_ARG start_ARG italic_M ( over~ start_ARG italic_w end_ARG ) end_ARG ≥ 2 > 1 , ∀ italic_i ≤ italic_n. Thus we deduce the existence of an element wdom22=nsuperscript𝑤domsuperscriptdelimited-∥∥22superscript𝑛w^{\prime}\in\text{dom}\frac{\left\lVert{\cdot}\right\rVert^{2}}{2}=\mathbb{R}% ^{n}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ dom divide start_ARG ∥ ⋅ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG = roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, such that \mathcal{L}caligraphic_L is continuous at Xw𝑋superscript𝑤Xw^{\prime}italic_X italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (since \mathcal{L}caligraphic_L is continuous and wdomsuperscript𝑤domw^{\prime}\in\text{dom}\mathcal{L}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ dom caligraphic_L). By [73, Corollary 3.313.313.313.31] and the optimality condition for the min-norm separating solution wsubscript𝑤w_{\ast}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, the sum rule for the subdifferential holds and thus we have

0(22+ι[1,+)nZ)(w)=(22)(w)+Zι[1,+)n(Zw)={w}+ZZι[1,+)n(Zw)0superscriptdelimited-∥∥22subscript𝜄superscript1𝑛𝑍subscript𝑤superscriptdelimited-∥∥22subscript𝑤superscript𝑍topsubscript𝜄superscript1𝑛𝑍subscript𝑤subscript𝑤superscript𝑍topsuperscript𝑍topsubscript𝜄superscript1𝑛𝑍subscript𝑤0\in\partial\left(\frac{\left\lVert{\cdot}\right\rVert^{2}}{2}\ +\iota_{[1,+% \infty)^{n}}\circ Z\right)(w_{\ast})=\nabla\left(\frac{\left\lVert{\cdot}% \right\rVert^{2}}{2}\right)(w_{\ast})+Z^{\top}\partial\iota_{[1,+\infty)^{n}}(% Zw_{\ast})=\{w_{\ast}\}+Z^{\top}Z^{\top}\partial\iota_{[1,+\infty)^{n}}(Zw_{% \ast})0 ∈ ∂ ( divide start_ARG ∥ ⋅ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + italic_ι start_POSTSUBSCRIPT [ 1 , + ∞ ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∘ italic_Z ) ( italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = ∇ ( divide start_ARG ∥ ⋅ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) ( italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∂ italic_ι start_POSTSUBSCRIPT [ 1 , + ∞ ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = { italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT } + italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∂ italic_ι start_POSTSUBSCRIPT [ 1 , + ∞ ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) (A.25)

From (A.25), there exists some uι[1,+)n(Zw)subscript𝑢subscript𝜄superscript1𝑛𝑍subscript𝑤u_{\ast}\in\partial\iota_{[1,+\infty)^{n}}(Zw_{\ast})italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ ∂ italic_ι start_POSTSUBSCRIPT [ 1 , + ∞ ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) such that w=Zusubscript𝑤superscript𝑍topsubscript𝑢w_{\ast}=-Z^{\top}u_{\ast}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT or equivalently Zwι[1,+)n(u)𝑍subscript𝑤subscriptsuperscript𝜄superscript1𝑛subscript𝑢Zw_{\ast}\in\partial\iota^{\ast}_{[1,+\infty)^{n}}(u_{\ast})italic_Z italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ ∂ italic_ι start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ 1 , + ∞ ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ). Hence, by replacing w=Zusubscript𝑤superscript𝑍topsubscript𝑢w_{\ast}=-Z^{\top}u_{\ast}italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, we obtain:

ZZuι[1,+)n(u)𝑍superscript𝑍topsubscript𝑢subscriptsuperscript𝜄superscript1𝑛subscript𝑢absent\displaystyle-ZZ^{\top}u_{\ast}\in\partial\iota^{\ast}_{[1,+\infty)^{n}}(u_{% \ast})~{}\Leftrightarrow~{}- italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ ∂ italic_ι start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ 1 , + ∞ ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⇔ (A.26)
0Z(22)(Zu)+ι[1,+)n(u)(Z22+ι[1,+)n())(u)\displaystyle~{}\Leftrightarrow~{}0\in-Z\nabla\left(\frac{\left\lVert{\cdot}% \right\rVert^{2}}{2}\right)(-Z^{\top}u_{\ast})+\partial\iota^{\ast}_{[1,+% \infty)^{n}}(u_{\ast})\subset\partial\left(\frac{\left\lVert{-Z^{\top}\cdot}% \right\rVert^{2}}{2}+\iota^{\ast}_{[1,+\infty)^{n}}(\cdot)\right)(u_{\ast})⇔ 0 ∈ - italic_Z ∇ ( divide start_ARG ∥ ⋅ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) ( - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + ∂ italic_ι start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ 1 , + ∞ ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⊂ ∂ ( divide start_ARG ∥ - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⋅ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + italic_ι start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ 1 , + ∞ ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ ) ) ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT )

which shows that uargminDsubscript𝑢argminsubscript𝐷u_{\ast}\in\operatornamewithlimits{arg\,min}D_{\infty}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT. In order to prove (A.24), let un𝑢superscript𝑛u\in\mathbb{R}^{n}italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and w=Zu𝑤superscript𝑍top𝑢w=-Z^{\top}uitalic_w = - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u.

Since Zwι[1,+)n(u)𝑍subscript𝑤subscript𝜄superscript1𝑛subscript𝑢Zw_{\ast}\in\partial\iota_{[1,+\infty)^{n}}(u_{\ast})italic_Z italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ ∂ italic_ι start_POSTSUBSCRIPT [ 1 , + ∞ ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) and Zu=wsuperscript𝑍top𝑢𝑤-Z^{\top}u=w- italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u = italic_w, by using the Fenchel-Young equality for Zu22superscriptdelimited-∥∥superscript𝑍topsubscript𝑢22\frac{\left\lVert{-Z^{\top}u_{\ast}}\right\rVert^{2}}{2}divide start_ARG ∥ - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARGand Zu22superscriptdelimited-∥∥superscript𝑍top𝑢22\frac{\left\lVert{-Z^{\top}u}\right\rVert^{2}}{2}divide start_ARG ∥ - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG, we find:

D(u)D(u)subscript𝐷𝑢subscript𝐷subscript𝑢\displaystyle D_{\infty}(u)-D_{\infty}(u_{\ast})italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) =Zu22Zu22+ι[1,+)n(u)ι[1,+)n(u)absentsuperscriptdelimited-∥∥superscript𝑍top𝑢22superscriptdelimited-∥∥superscript𝑍topsubscript𝑢22subscriptsuperscript𝜄superscript1𝑛𝑢subscriptsuperscript𝜄superscript1𝑛subscript𝑢\displaystyle=\frac{\left\lVert{-Z^{\top}u}\right\rVert^{2}}{2}-\frac{\left% \lVert{-Z^{\top}u_{\ast}}\right\rVert^{2}}{2}+\iota^{\ast}_{[1,+\infty)^{n}}(u% )-\iota^{\ast}_{[1,+\infty)^{n}}(u_{\ast})= divide start_ARG ∥ - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG ∥ - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + italic_ι start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ 1 , + ∞ ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u ) - italic_ι start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ 1 , + ∞ ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) (A.27)
=Zu,ww22+Zu,w+w22+ι[1,+)n(u)ι[1,+)n(u)absentsuperscript𝑍top𝑢𝑤superscriptdelimited-∥∥𝑤22superscript𝑍topsubscript𝑢subscript𝑤superscriptdelimited-∥∥subscript𝑤22subscriptsuperscript𝜄superscript1𝑛𝑢subscriptsuperscript𝜄superscript1𝑛subscript𝑢\displaystyle=\left\langle{-Z^{\top}u},{w}\right\rangle-\frac{\left\lVert{w}% \right\rVert^{2}}{2}+\left\langle{Z^{\top}u_{\ast}},{w_{\ast}}\right\rangle+% \frac{\left\lVert{w_{\ast}}\right\rVert^{2}}{2}+\iota^{\ast}_{[1,+\infty)^{n}}% (u)-\iota^{\ast}_{[1,+\infty)^{n}}(u_{\ast})= ⟨ - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u , italic_w ⟩ - divide start_ARG ∥ italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + ⟨ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ + divide start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + italic_ι start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ 1 , + ∞ ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u ) - italic_ι start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ 1 , + ∞ ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT )
=w22w22Zu,wwabsentsuperscriptdelimited-∥∥subscript𝑤22superscriptdelimited-∥∥𝑤22superscript𝑍top𝑢subscript𝑤𝑤\displaystyle=\frac{\left\lVert{w_{\ast}}\right\rVert^{2}}{2}-\frac{\left% \lVert{w}\right\rVert^{2}}{2}-\left\langle{-Z^{\top}u},{w_{\ast}-w}\right\rangle= divide start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG ∥ italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - ⟨ - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_w ⟩
+ι[1,+)n(u)ι[1,+)n(u)Zw,uusubscriptsuperscript𝜄superscript1𝑛𝑢subscriptsuperscript𝜄superscript1𝑛subscript𝑢𝑍subscript𝑤𝑢subscript𝑢\displaystyle\quad+\iota^{\ast}_{[1,+\infty)^{n}}(u)-\iota^{\ast}_{[1,+\infty)% ^{n}}(u_{\ast})-\left\langle{Zw_{\ast}},{u-u_{\ast}}\right\rangle+ italic_ι start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ 1 , + ∞ ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u ) - italic_ι start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ 1 , + ∞ ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - ⟨ italic_Z italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩

By using the strong convexity of 22superscriptdelimited-∥∥22\frac{\left\lVert{\cdot}\right\rVert^{2}}{2}divide start_ARG ∥ ⋅ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG (with parameter 1111) and the convexity of ι[1,+)n()subscriptsuperscript𝜄superscript1𝑛\iota^{\ast}_{[1,+\infty)^{n}}(\cdot)italic_ι start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ 1 , + ∞ ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ ), we conclude that:

D(u)D(u)ww22subscript𝐷𝑢subscript𝐷subscript𝑢superscriptdelimited-∥∥𝑤subscript𝑤22D_{\infty}(u)-D_{\infty}(u_{\ast})\geq\frac{\left\lVert{w-w_{\ast}}\right% \rVert^{2}}{2}italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≥ divide start_ARG ∥ italic_w - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG (A.28)

As a consequence of Lemma A.6, the following Lemma provides some basic estimates for the sequence {Dt(ut+1)}t0subscriptsubscript𝐷𝑡subscript𝑢𝑡1𝑡0\{D_{t}(u_{t+1})\}_{t\geq 0}{ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT generated by Algorithm 1.

Lemma A.8.

Let uargminDsubscript𝑢𝑎𝑟𝑔𝑚𝑖𝑛subscript𝐷u_{\ast}\in argminD_{\infty}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ italic_a italic_r italic_g italic_m italic_i italic_n italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT and {ut}t0subscriptsubscript𝑢𝑡𝑡0\{u_{t}\}_{t\geq 0}{ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT be the sequence generated by Algorithm 1. Then the following estimate holds for all un𝑢superscript𝑛u\in\mathbb{R}^{n}italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT :

Dt(ut+1)D(u)+12γut+1u2Dt(u)D(u)+12γutu2subscript𝐷𝑡subscript𝑢𝑡1subscript𝐷subscript𝑢12𝛾superscriptdelimited-∥∥subscript𝑢𝑡1𝑢2subscript𝐷𝑡𝑢subscript𝐷subscript𝑢12𝛾superscriptdelimited-∥∥subscript𝑢𝑡𝑢2D_{t}(u_{t+1})-D_{\infty}(u_{\ast})+\frac{1}{2\gamma}\left\lVert{u_{t+1}-u}% \right\rVert^{2}\leq D_{t}(u)-D_{\infty}(u_{\ast})+\frac{1}{2\gamma}\left% \lVert{u_{t}-u}\right\rVert^{2}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∥ italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (A.29)

In addition, {Dt(ut)}t0subscriptsubscript𝐷𝑡subscript𝑢𝑡𝑡0\{D_{t}(u_{t})\}_{t\geq 0}{ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT is non-increasing and

Dt(ut)tD(u)subscript𝐷𝑡subscript𝑢𝑡𝑡subscript𝐷subscript𝑢D_{t}(u_{t})\underset{t\to\infty}{\longrightarrow}D_{\infty}(u_{\ast})italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_UNDERACCENT italic_t → ∞ end_UNDERACCENT start_ARG ⟶ end_ARG italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) (A.30)
Proof of Lemma A.8.

Let uargminDsubscript𝑢argminsubscript𝐷u_{\ast}\in\operatornamewithlimits{arg\,min}D_{\infty}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT. Since 22Zsuperscriptdelimited-∥∥22superscript𝑍top\frac{\left\lVert{\cdot}\right\rVert^{2}}{2}\circ Z^{\top}divide start_ARG ∥ ⋅ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∘ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT has XXopsubscriptdelimited-∥∥𝑋superscript𝑋top𝑜𝑝\left\lVert{XX^{\top}}\right\rVert_{op}∥ italic_X italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_o italic_p end_POSTSUBSCRIPT-Lipschitz gradient, by applying the Descent Lemma A.5 with y=ut𝑦subscript𝑢𝑡y=u_{t}italic_y = italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, for all un𝑢superscript𝑛u\in\mathbb{R}^{n}italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and t0𝑡0t\geq 0italic_t ≥ 0 it holds:

Dt(ut+1)+12γut+1u2Dt(u)+12γutu2subscript𝐷𝑡subscript𝑢𝑡112𝛾superscriptdelimited-∥∥subscript𝑢𝑡1𝑢2subscript𝐷𝑡𝑢12𝛾superscriptdelimited-∥∥subscript𝑢𝑡𝑢2D_{t}(u_{t+1})+\frac{1}{2\gamma}\left\lVert{u_{t+1}-u}\right\rVert^{2}\leq D_{% t}(u)+\frac{1}{2\gamma}\left\lVert{u_{t}-u}\right\rVert^{2}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∥ italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) + divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (A.31)

which allows to deduce (A.30).

By choosing u=ut𝑢subscript𝑢𝑡u=u_{t}italic_u = italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in (A.31) and using the non-increasing property of Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in t0𝑡0t\geq 0italic_t ≥ 0 (see Lemma A.6), we obtain:

Dt+1(ut+1)D(u)Dt(ut+1)D(u)Dt(ut)D(u)12γut+1ut2subscript𝐷𝑡1subscript𝑢𝑡1subscript𝐷subscript𝑢subscript𝐷𝑡subscript𝑢𝑡1subscript𝐷subscript𝑢subscript𝐷𝑡subscript𝑢𝑡subscript𝐷subscript𝑢12𝛾superscriptdelimited-∥∥subscript𝑢𝑡1subscript𝑢𝑡2D_{t+1}(u_{t+1})-D_{\infty}(u_{\ast})\leq D_{t}(u_{t+1})-D_{\infty}(u_{\ast})% \leq D_{t}(u_{t})-D_{\infty}(u_{\ast})-\frac{1}{2\gamma}\left\lVert{u_{t+1}-u_% {t}}\right\rVert^{2}italic_D start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∥ italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (A.32)

which shows that the function rt=Dt(ut)D(u)subscript𝑟𝑡subscript𝐷𝑡subscript𝑢𝑡subscript𝐷subscript𝑢r_{t}=D_{t}(u_{t})-D_{\infty}(u_{\ast})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) is non-increasing in t𝑡titalic_t. Since Dt(u)subscript𝐷𝑡𝑢D_{t}(u)italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) is also non-increasing in t𝑡titalic_t and convergent to D(u)subscript𝐷𝑢D_{\infty}(u)italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u ), from (A.31) we also have that 12γutu212𝛾superscriptdelimited-∥∥subscript𝑢𝑡𝑢2\frac{1}{2\gamma}\left\lVert{u_{t}-u}\right\rVert^{2}divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is bounded for all un𝑢superscript𝑛u\in\mathbb{R}^{n}italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. In addition rt=Dt(ut)D(u)D(ut)D(u)0subscript𝑟𝑡subscript𝐷𝑡subscript𝑢𝑡subscript𝐷subscript𝑢subscript𝐷subscript𝑢𝑡subscript𝐷subscript𝑢0r_{t}=D_{t}(u_{t})-D_{\infty}(u_{\ast})\geq D_{\infty}(u_{t})-D_{\infty}(u_{% \ast})\geq 0italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≥ italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≥ 0, which shows that rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is also bounded from below by zero, and therefore converges to a non-negative limit.

By adding and subtracting D(u)subscript𝐷subscript𝑢D_{\infty}(u_{\ast})italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) in (A.31), for all un𝑢superscript𝑛u\in\mathbb{R}^{n}italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we find:

rt+1(Dt(u)D(u))12γutu212γut+1u2subscript𝑟𝑡1subscript𝐷𝑡𝑢subscript𝐷subscript𝑢12𝛾superscriptdelimited-∥∥subscript𝑢𝑡𝑢212𝛾superscriptdelimited-∥∥subscript𝑢𝑡1𝑢2r_{t+1}-(D_{t}(u)-D_{\infty}(u_{\ast}))\leq\frac{1}{2\gamma}\left\lVert{u_{t}-% u}\right\rVert^{2}-\frac{1}{2\gamma}\left\lVert{u_{t+1}-u}\right\rVert^{2}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∥ italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (A.33)

which by summing up to t1𝑡1t\geq 1italic_t ≥ 1 gives:

s=1t(rs(Ds1(u)D(u)))12γu0u2unformulae-sequencesuperscriptsubscript𝑠1𝑡subscript𝑟𝑠subscript𝐷𝑠1𝑢subscript𝐷subscript𝑢12𝛾superscriptdelimited-∥∥subscript𝑢0𝑢2for-all𝑢superscript𝑛\sum_{s=1}^{t}\big{(}r_{s}-(D_{s-1}(u)-D_{\infty}(u_{\ast}))\big{)}\leq\frac{1% }{2\gamma}\left\lVert{u_{0}-u}\right\rVert^{2}\quad\forall u\in\mathbb{R}^{n}∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - ( italic_D start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_u ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ) ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∥ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∀ italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT (A.34)

The last relation allows to conclude that limt(rt(Dt1(u)D(u)))=0subscript𝑡subscript𝑟𝑡subscript𝐷𝑡1𝑢subscript𝐷subscript𝑢0\lim\limits_{t\to\infty}\big{(}r_{t}-(D_{t-1}(u)-D_{\infty}(u_{\ast}))\big{)}=0roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ( italic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_u ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ) = 0 and since limtDt1(u)=D(u)D(u)subscript𝑡subscript𝐷𝑡1𝑢subscript𝐷𝑢subscript𝐷subscript𝑢\lim\limits_{t\to\infty}D_{t-1}(u)=D_{\infty}(u)\geq D_{\infty}(u_{\ast})roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_u ) = italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u ) ≥ italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) and limtrt0subscript𝑡subscript𝑟𝑡0\lim\limits_{t\to\infty}r_{t}\geq 0roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ 0, we have that limtrt=0subscript𝑡subscript𝑟𝑡0\lim\limits_{t\to\infty}r_{t}=0roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0. ∎

In the next lemma we prove that each of the regularized dual problems Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT satisfies the Łojasiewicz condition (5.1) with a common constant μ𝜇\muitalic_μ, not depending on t𝑡titalic_t. The proof of Lemma A.9 (see below) is inspired by the analysis presented in [12, Lemma 2.52.52.52.5] with some modifications. More precisely we will prove that there exists some positive constants M,R,θ𝑀𝑅𝜃M,~{}R,~{}\thetaitalic_M , italic_R , italic_θ, such that for all t0𝑡0t\geq 0italic_t ≥ 0, Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT satisfies the following growth condition:

(u[DtminDt+M]𝔹(𝟎,R)),θ2dist(u,argminDt)2Dt(u)minDtfor-all𝑢delimited-[]subscript𝐷𝑡subscript𝐷𝑡𝑀𝔹0𝑅𝜃2distsuperscript𝑢argminsubscript𝐷𝑡2subscript𝐷𝑡𝑢subscript𝐷𝑡\left(\forall u\in[D_{t}\leq\min D_{t}+M]\cap\mathbb{B}(\mathbf{0},R)\right),% \qquad\frac{\theta}{2}\text{dist}\left({u,\operatornamewithlimits{arg\,min}D_{% t}}\right)^{2}\leq D_{t}(u)-\min D_{t}( ∀ italic_u ∈ [ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_M ] ∩ roman_𝔹 ( bold_0 , italic_R ) ) , divide start_ARG italic_θ end_ARG start_ARG 2 end_ARG dist ( italic_u , start_OPERATOR roman_arg roman_min end_OPERATOR italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) - roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (A.35)

In fact, relation (A.35) can be met under the name quadratic growth (see e.g. [35]) and is equivalent to (5.1).

It is worth mentioning that we cannot apply directly the proof in [12, Lemma 2.52.52.52.5], due to the fact of possible unboudedness of argminDtargminsubscript𝐷𝑡\operatornamewithlimits{arg\,min}D_{t}start_OPERATOR roman_arg roman_min end_OPERATOR italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, when t𝑡t\to\inftyitalic_t → ∞. Indeed by applying directly [12, Lemma 2.52.52.52.5] we can deduce the existence of θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, such that (A.35) hold true. However this is not sufficient for establishing linear convergence of the proposed scheme (see Proposition A.1), since, in general, θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT may vanish asymptotically.

Lemma A.9.

Let R>0𝑅0R>0italic_R > 0, M>0𝑀0M>0italic_M > 0 and {λt}t0subscriptsubscript𝜆𝑡𝑡0\{\lambda_{t}\}_{t\geq 0}{ italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT be a sequence of positive parameters decreasing to zero. Consider the functions Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Dsubscript𝐷D_{\infty}italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT defined in (A.19) and (A.20) respectively. Then there exists some μ>0𝜇0\mu>0italic_μ > 0, such that for all t0𝑡0t\geq 0italic_t ≥ 0, Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT satisfies the μ𝜇\muitalic_μ-Łojasiewicz condition (5.1) in [DtminDt+M]𝔹(𝟎,R)delimited-[]subscript𝐷𝑡subscript𝐷𝑡𝑀𝔹0𝑅[D_{t}\leq\min D_{t}+M]\cap\mathbb{B}(\mathbf{0},R)[ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_M ] ∩ roman_𝔹 ( bold_0 , italic_R ), with μ𝜇\muitalic_μ given by (5.2), i.e.

(u[DtminDt+M]𝔹(𝟎,R)),Dt(u)minDt12μdist(Dt(u),0)2,for-all𝑢delimited-[]subscript𝐷𝑡subscript𝐷𝑡𝑀𝔹0𝑅subscript𝐷𝑡𝑢subscript𝐷𝑡12𝜇distsuperscriptsubscript𝐷𝑡𝑢02\left(\forall u\in[D_{t}\leq\min D_{t}+M]\cap\mathbb{B}(\mathbf{0},R)\right),% \qquad D_{t}(u)-\min D_{t}\leq\frac{1}{2\mu}\text{dist}\left(\partial D_{t}(u)% ,0\right)^{2},( ∀ italic_u ∈ [ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_M ] ∩ roman_𝔹 ( bold_0 , italic_R ) ) , italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) - roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_μ end_ARG dist ( ∂ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (A.36)
Proof of Lemma A.9.

For all t0𝑡0t\geq 0italic_t ≥ 0, the problem minunDt(u)𝑢superscript𝑛subscript𝐷𝑡𝑢\underset{u\in\mathbb{R}^{n}}{\min}D_{t}(u)start_UNDERACCENT italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_min end_ARG italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) is equivalent to the following constrained optimization one:

min{F(u):=12Zu2+𝟏,u:u𝒰t:=[λt,0]n}.:assign𝐹𝑢12superscriptdelimited-∥∥superscript𝑍top𝑢21𝑢𝑢subscript𝒰𝑡assignsuperscriptsubscript𝜆𝑡0𝑛\min\left\{F(u):=\frac{1}{2}\left\lVert{Z^{\top}u}\right\rVert^{2}+\left% \langle{\bf{1}},{u}\right\rangle~{}:~{}u\in\mathcal{U}_{t}:=[-\lambda_{t},0]^{% n}\right\}.roman_min { italic_F ( italic_u ) := divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⟨ bold_1 , italic_u ⟩ : italic_u ∈ caligraphic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := [ - italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , 0 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } . (A.37)

If u𝒰t𝑢subscript𝒰𝑡u\notin\mathcal{U}_{t}italic_u ∉ caligraphic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, relation (A.35) holds, so without loss of generality, we assume that u𝒰t𝑢subscript𝒰𝑡u\in\mathcal{U}_{t}italic_u ∈ caligraphic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, so that F(u)=Dt(u)𝐹𝑢subscript𝐷𝑡𝑢F(u)=D_{t}(u)italic_F ( italic_u ) = italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ).

For all t0𝑡0t\geq 0italic_t ≥ 0, let u¯targminunDt(u)subscript¯𝑢𝑡𝑢superscript𝑛argminsubscript𝐷𝑡𝑢\bar{u}_{t}\in\underset{u\in\mathbb{R}^{n}}{\operatornamewithlimits{arg\,min}}% D_{t}(u)over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ start_UNDERACCENT italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_arg roman_min end_ARG italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ). By the optimality conditions for problem (A.37), for all u𝒰t𝑢subscript𝒰𝑡u\in\mathcal{U}_{t}italic_u ∈ caligraphic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, it holds:

F(u¯t),uu¯t0𝐹subscript¯𝑢𝑡𝑢subscript¯𝑢𝑡0\left\langle{\nabla F(\bar{u}_{t})},{u-\bar{u}_{t}}\right\rangle\geq 0⟨ ∇ italic_F ( over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≥ 0 (A.38)

Notice that 𝒰tsubscript𝒰𝑡\mathcal{U}_{t}caligraphic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a polyhedral set since 𝒰t={un:Auat}subscript𝒰𝑡conditional-set𝑢superscript𝑛𝐴𝑢subscript𝑎𝑡\mathcal{U}_{t}=\{u\in\mathbb{R}^{n}~{}:~{}Au\leq a_{t}\}caligraphic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : italic_A italic_u ≤ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, with A=[IdnIdn]2n×n𝐴matrixsubscriptId𝑛subscriptId𝑛superscript2𝑛𝑛A=\begin{bmatrix}\text{Id}_{n}\\ -\text{Id}_{n}\end{bmatrix}\in\mathbb{R}^{2n\times n}italic_A = [ start_ARG start_ROW start_CELL Id start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - Id start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ∈ roman_ℝ start_POSTSUPERSCRIPT 2 italic_n × italic_n end_POSTSUPERSCRIPT and at=λt1[𝟎𝟏]2nsubscript𝑎𝑡superscriptsubscript𝜆𝑡1matrix01superscript2𝑛a_{t}=-\lambda_{t}^{-1}\begin{bmatrix}\mathbf{0}\\ \mathbf{1}\end{bmatrix}\in\mathbb{R}^{2n}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_1 end_CELL end_ROW end_ARG ] ∈ roman_ℝ start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT.

By [12, Lemma 2.32.32.32.3 ], there exist a unique vector v¯n¯𝑣superscript𝑛\bar{v}\in\mathbb{R}^{n}over¯ start_ARG italic_v end_ARG ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and a scalar s¯¯𝑠\bar{s}\in\mathbb{R}over¯ start_ARG italic_s end_ARG ∈ roman_ℝ such that the following equivalence holds true:

u¯targminDt[Z𝟏]u¯t=[v¯s¯]and Au¯tatsubscript¯𝑢𝑡argminsubscript𝐷𝑡formulae-sequencematrixsuperscript𝑍topsuperscript1topsubscript¯𝑢𝑡matrix¯𝑣¯𝑠and 𝐴subscript¯𝑢𝑡subscript𝑎𝑡\bar{u}_{t}\in\operatornamewithlimits{arg\,min}D_{t}~{}\Leftrightarrow~{}% \begin{bmatrix}Z^{\top}\\ \mathbf{1}^{\top}\end{bmatrix}\bar{u}_{t}=\begin{bmatrix}\bar{v}\\ \bar{s}\end{bmatrix}\quad\text{and }\quad A\bar{u}_{t}\leq a_{t}over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⇔ [ start_ARG start_ROW start_CELL italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL over¯ start_ARG italic_v end_ARG end_CELL end_ROW start_ROW start_CELL over¯ start_ARG italic_s end_ARG end_CELL end_ROW end_ARG ] and italic_A over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (A.39)

so that argminDt=S𝒰targminsubscript𝐷𝑡𝑆subscript𝒰𝑡\operatornamewithlimits{arg\,min}D_{t}=S\cap\mathcal{U}_{t}start_OPERATOR roman_arg roman_min end_OPERATOR italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_S ∩ caligraphic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where S={un:[Z𝟏]u=[v¯s¯]}𝑆conditional-set𝑢superscript𝑛matrixsuperscript𝑍topsuperscript1top𝑢matrix¯𝑣¯𝑠S=\left\{u\in\mathbb{R}^{n}~{}:\begin{bmatrix}Z^{\top}\\ \mathbf{1}^{\top}\end{bmatrix}u=\begin{bmatrix}\bar{v}\\ \bar{s}\end{bmatrix}\right\}italic_S = { italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : [ start_ARG start_ROW start_CELL italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] italic_u = [ start_ARG start_ROW start_CELL over¯ start_ARG italic_v end_ARG end_CELL end_ROW start_ROW start_CELL over¯ start_ARG italic_s end_ARG end_CELL end_ROW end_ARG ] } and 𝒰t={un:Auat}subscript𝒰𝑡conditional-set𝑢superscript𝑛𝐴𝑢subscript𝑎𝑡\mathcal{U}_{t}=\{u\in\mathbb{R}^{n}~{}:Au\leq a_{t}\}caligraphic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : italic_A italic_u ≤ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }.

According to Hoffman’s lemma [43] (see also [94, Lemma 15151515]) for the polyhedral sets S𝑆Sitalic_S and 𝒰tsubscript𝒰𝑡\mathcal{U}_{t}caligraphic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, by setting E=[Z𝟏](d+1)×n𝐸matrixsuperscript𝑍topsuperscript1topsuperscript𝑑1𝑛E=\begin{bmatrix}Z^{\top}\\ \mathbf{1}^{\top}\end{bmatrix}\in\mathbb{R}^{(d+1)\times n}italic_E = [ start_ARG start_ROW start_CELL italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] ∈ roman_ℝ start_POSTSUPERSCRIPT ( italic_d + 1 ) × italic_n end_POSTSUPERSCRIPT, there exists some positive constant τ𝜏\tauitalic_τ given by (5.3), such that

dist(u,argminDt)=dist(u,S𝒰t)τEu[v¯s¯]dist𝑢argminsubscript𝐷𝑡dist𝑢𝑆subscript𝒰𝑡𝜏delimited-∥∥𝐸𝑢matrix¯𝑣¯𝑠\text{dist}\left(u,\operatornamewithlimits{arg\,min}D_{t}\right)=\text{dist}% \left(u,S\cap\mathcal{U}_{t}\right)\leq\tau\left\lVert{Eu-\begin{bmatrix}\bar{% v}\\ \bar{s}\end{bmatrix}}\right\rVertdist ( italic_u , start_OPERATOR roman_arg roman_min end_OPERATOR italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = dist ( italic_u , italic_S ∩ caligraphic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ italic_τ ∥ italic_E italic_u - [ start_ARG start_ROW start_CELL over¯ start_ARG italic_v end_ARG end_CELL end_ROW start_ROW start_CELL over¯ start_ARG italic_s end_ARG end_CELL end_ROW end_ARG ] ∥ (A.40)

It is important to stress out that the Hoffman’s error bound constant τ𝜏\tauitalic_τ only depends on the matrices E𝐸Eitalic_E and A𝐴Aitalic_A (see e.g. [96, Remark 1111] and the associated references). In our setting this means that the constant τ𝜏\tauitalic_τ only depends on the data-matrix Z=diag(Y)X𝑍diag(Y)XZ=\text{diag(Y)X}italic_Z = diag(Y)X.

By taking the squares in (A.40), we find:

dist(u,argminDt)2τ2(Z(uu¯t)2+(1,uu¯t)2)distsuperscript𝑢argminsubscript𝐷𝑡2superscript𝜏2superscriptdelimited-∥∥superscript𝑍top𝑢subscript¯𝑢𝑡2superscript1𝑢subscript¯𝑢𝑡2\text{dist}\left(u,\operatornamewithlimits{arg\,min}D_{t}\right)^{2}\leq\tau^{% 2}\left(\left\lVert{Z^{\top}(u-\bar{u}_{t})}\right\rVert^{2}+(\left\langle{1},% {u-\bar{u}_{t}}\right\rangle)^{2}\right)dist ( italic_u , start_OPERATOR roman_arg roman_min end_OPERATOR italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( ⟨ 1 , italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (A.41)

Let us now bound appropriately the two terms in the right-hand-side of (A.41).

For the first term, by developing the square we find:

12Z(uu¯t)212superscriptdelimited-∥∥superscript𝑍top𝑢subscript¯𝑢𝑡2\displaystyle\frac{1}{2}\left\lVert{Z^{\top}(u-\bar{u}_{t})}\right\rVert^{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =12Zu2Zu,Zu¯t+12Zu¯t2absent12superscriptdelimited-∥∥superscript𝑍top𝑢2superscript𝑍top𝑢superscript𝑍topsubscript¯𝑢𝑡12superscriptdelimited-∥∥superscript𝑍topsubscript¯𝑢𝑡2\displaystyle=\frac{1}{2}\left\lVert{Z^{\top}u}\right\rVert^{2}-\left\langle{Z% ^{\top}u},{Z^{\top}\bar{u}_{t}}\right\rangle+\frac{1}{2}\left\lVert{Z^{\top}% \bar{u}_{t}}\right\rVert^{2}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ⟨ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u , italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (A.42)
=12Zu212Zu¯t2Zu¯t,Z(uu¯t)absent12superscriptdelimited-∥∥superscript𝑍top𝑢212superscriptdelimited-∥∥superscript𝑍topsubscript¯𝑢𝑡2superscript𝑍topsubscript¯𝑢𝑡superscript𝑍top𝑢subscript¯𝑢𝑡\displaystyle=\frac{1}{2}\left\lVert{Z^{\top}u}\right\rVert^{2}-\frac{1}{2}% \left\lVert{Z^{\top}\bar{u}_{t}}\right\rVert^{2}-\left\langle{Z^{\top}\bar{u}_% {t}},{Z^{\top}(u-\bar{u}_{t})}\right\rangle= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ⟨ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⟩
F(u¯t),uu¯t+12Zu212Zu¯t2Zu¯t,Z(uu¯t)absent𝐹subscript¯𝑢𝑡𝑢subscript¯𝑢𝑡12superscriptdelimited-∥∥superscript𝑍top𝑢212superscriptdelimited-∥∥superscript𝑍topsubscript¯𝑢𝑡2superscript𝑍topsubscript¯𝑢𝑡superscript𝑍top𝑢subscript¯𝑢𝑡\displaystyle\leq\left\langle{\nabla F(\bar{u}_{t})},{u-\bar{u}_{t}}\right% \rangle+\frac{1}{2}\left\lVert{Z^{\top}u}\right\rVert^{2}-\frac{1}{2}\left% \lVert{Z^{\top}\bar{u}_{t}}\right\rVert^{2}-\left\langle{Z^{\top}\bar{u}_{t}},% {Z^{\top}(u-\bar{u}_{t})}\right\rangle≤ ⟨ ∇ italic_F ( over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ⟨ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⟩

where in the last inequality we used the optimality condition (A.38). In addition, since F(u¯t)=𝟏+ZZu¯t𝐹subscript¯𝑢𝑡1𝑍superscript𝑍topsubscript¯𝑢𝑡\nabla F(\bar{u}_{t})=\mathbf{1}+ZZ^{\top}\bar{u}_{t}∇ italic_F ( over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = bold_1 + italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, from (A.42) it follows:

12Z(uu¯t)212Zu2+𝟏,u12Zu¯t2𝟏,u¯t=Dt(u)minDt12superscriptdelimited-∥∥superscript𝑍top𝑢subscript¯𝑢𝑡212superscriptdelimited-∥∥superscript𝑍top𝑢21𝑢12superscriptdelimited-∥∥superscript𝑍topsubscript¯𝑢𝑡21subscript¯𝑢𝑡subscript𝐷𝑡𝑢subscript𝐷𝑡\frac{1}{2}\left\lVert{Z^{\top}(u-\bar{u}_{t})}\right\rVert^{2}\leq\frac{1}{2}% \left\lVert{Z^{\top}u}\right\rVert^{2}+\left\langle{\mathbf{1}},{u}\right% \rangle-\frac{1}{2}\left\lVert{Z^{\top}\bar{u}_{t}}\right\rVert^{2}-\left% \langle{\mathbf{1}},{\bar{u}_{t}}\right\rangle=D_{t}(u)-\min D_{t}divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⟨ bold_1 , italic_u ⟩ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ⟨ bold_1 , over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ = italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) - roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (A.43)

Let us now provide an upper bound for the term (1,uu¯t)2superscript1𝑢subscript¯𝑢𝑡2(\left\langle{1},{u-\bar{u}_{t}}\right\rangle)^{2}( ⟨ 1 , italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. On the one hand we have:

𝟏,u¯tu1subscript¯𝑢𝑡𝑢\displaystyle\left\langle{\bf{1}},{\bar{u}_{t}-u}\right\rangle⟨ bold_1 , over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u ⟩ =F(u¯t),u¯tuZu¯t,Z(u¯tu)absent𝐹subscript¯𝑢𝑡subscript¯𝑢𝑡𝑢superscript𝑍topsubscript¯𝑢𝑡superscript𝑍topsubscript¯𝑢𝑡𝑢\displaystyle=\left\langle{\nabla F(\bar{u}_{t})},{\bar{u}_{t}-u}\right\rangle% -\left\langle{Z^{\top}\bar{u}_{t}},{Z^{\top}(\bar{u}_{t}-u)}\right\rangle= ⟨ ∇ italic_F ( over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u ⟩ - ⟨ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u ) ⟩ (A.44)
=F(u¯t),u¯tuZ(uu¯t)2+Zu,Z(u¯tu)absent𝐹subscript¯𝑢𝑡subscript¯𝑢𝑡𝑢superscriptdelimited-∥∥superscript𝑍top𝑢subscript¯𝑢𝑡2superscript𝑍top𝑢superscript𝑍topsubscript¯𝑢𝑡𝑢\displaystyle=\left\langle{\nabla F(\bar{u}_{t})},{\bar{u}_{t}-u}\right\rangle% -\left\lVert{Z^{\top}(u-\bar{u}_{t})}\right\rVert^{2}+\left\langle{Z^{\top}u},% {Z^{\top}(\bar{u}_{t}-u)}\right\rangle= ⟨ ∇ italic_F ( over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u ⟩ - ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⟨ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u , italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u ) ⟩
F(u¯t),u¯tu+ZuZ(u¯tu)absent𝐹subscript¯𝑢𝑡subscript¯𝑢𝑡𝑢delimited-∥∥superscript𝑍top𝑢delimited-∥∥superscript𝑍topsubscript¯𝑢𝑡𝑢\displaystyle\leq\left\langle{\nabla F(\bar{u}_{t})},{\bar{u}_{t}-u}\right% \rangle+\left\lVert{Z^{\top}u}\right\rVert\left\lVert{Z^{\top}(\bar{u}_{t}-u)}\right\rVert≤ ⟨ ∇ italic_F ( over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u ⟩ + ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u ) ∥
ZuZ(u¯tu)absentdelimited-∥∥superscript𝑍top𝑢delimited-∥∥superscript𝑍topsubscript¯𝑢𝑡𝑢\displaystyle\leq\left\lVert{Z^{\top}u}\right\rVert\left\lVert{Z^{\top}(\bar{u% }_{t}-u)}\right\rVert≤ ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u ) ∥

where in the first inequality we used the Cauchy-Schwarz inequality and for the last one the optimality condition (A.38). On the other hand we find:

𝟏,uu¯t1𝑢subscript¯𝑢𝑡\displaystyle\left\langle{\bf{1}},{u-\bar{u}_{t}}\right\rangle⟨ bold_1 , italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ =F(u¯t),uu¯tZu¯t,Z(uu¯t)absent𝐹subscript¯𝑢𝑡𝑢subscript¯𝑢𝑡superscript𝑍topsubscript¯𝑢𝑡superscript𝑍top𝑢subscript¯𝑢𝑡\displaystyle=\left\langle{\nabla F(\bar{u}_{t})},{u-\bar{u}_{t}}\right\rangle% -\left\langle{Z^{\top}\bar{u}_{t}},{Z^{\top}(u-\bar{u}_{t})}\right\rangle= ⟨ ∇ italic_F ( over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ - ⟨ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⟩ (A.45)
=F(u¯t),uu¯t+Z(uu¯t)2Zu,Z(uu¯t)absent𝐹subscript¯𝑢𝑡𝑢subscript¯𝑢𝑡superscriptdelimited-∥∥superscript𝑍top𝑢subscript¯𝑢𝑡2superscript𝑍top𝑢superscript𝑍top𝑢subscript¯𝑢𝑡\displaystyle=\left\langle{\nabla F(\bar{u}_{t})},{u-\bar{u}_{t}}\right\rangle% +\left\lVert{Z^{\top}(u-\bar{u}_{t})}\right\rVert^{2}-\left\langle{Z^{\top}u},% {Z^{\top}(u-\bar{u}_{t})}\right\rangle= ⟨ ∇ italic_F ( over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ⟨ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u , italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⟩
F(u¯t),uu¯t+Z(uu¯t)2+ZuZ(uu¯t)absent𝐹subscript¯𝑢𝑡𝑢subscript¯𝑢𝑡superscriptdelimited-∥∥superscript𝑍top𝑢subscript¯𝑢𝑡2delimited-∥∥superscript𝑍top𝑢delimited-∥∥superscript𝑍top𝑢subscript¯𝑢𝑡\displaystyle\leq\left\langle{\nabla F(\bar{u}_{t})},{u-\bar{u}_{t}}\right% \rangle+\left\lVert{Z^{\top}(u-\bar{u}_{t})}\right\rVert^{2}+\left\lVert{Z^{% \top}u}\right\rVert\left\lVert{Z^{\top}(u-\bar{u}_{t})}\right\rVert≤ ⟨ ∇ italic_F ( over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥
=F(u¯t),uu¯t+(Z(uu¯t)+Zu)Z(uu¯t)absent𝐹subscript¯𝑢𝑡𝑢subscript¯𝑢𝑡delimited-∥∥superscript𝑍top𝑢subscript¯𝑢𝑡delimited-∥∥superscript𝑍top𝑢delimited-∥∥superscript𝑍top𝑢subscript¯𝑢𝑡\displaystyle=\left\langle{\nabla F(\bar{u}_{t})},{u-\bar{u}_{t}}\right\rangle% +\left(\left\lVert{Z^{\top}(u-\bar{u}_{t})}\right\rVert+\left\lVert{Z^{\top}u}% \right\rVert\right)\left\lVert{Z^{\top}(u-\bar{u}_{t})}\right\rVert= ⟨ ∇ italic_F ( over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + ( ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ + ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ ) ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥
Dt(u)minDt+(Z(uu¯t)+Zu)Z(uu¯t)absentsubscript𝐷𝑡𝑢subscript𝐷𝑡delimited-∥∥superscript𝑍top𝑢subscript¯𝑢𝑡delimited-∥∥superscript𝑍top𝑢delimited-∥∥superscript𝑍top𝑢subscript¯𝑢𝑡\displaystyle\leq D_{t}(u)-\min D_{t}+\left(\left\lVert{Z^{\top}(u-\bar{u}_{t}% )}\right\rVert+\left\lVert{Z^{\top}u}\right\rVert\right)\left\lVert{Z^{\top}(u% -\bar{u}_{t})}\right\rVert≤ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) - roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ + ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ ) ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥

where in the first inequality we used the Cauchy-Schwarz inequality and in the last one the convexity of F𝐹Fitalic_F. By relations (A.44) and (A.45), it follows that:

|𝟏,uu¯t|1𝑢subscript¯𝑢𝑡\displaystyle\lvert{\left\langle{\bf{1}},{u-\bar{u}_{t}}\right\rangle}\rvert| ⟨ bold_1 , italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ | Dt(u)minDt+(Z(uu¯t)+Zu)Z(uu¯t)absentsubscript𝐷𝑡𝑢subscript𝐷𝑡delimited-∥∥superscript𝑍top𝑢subscript¯𝑢𝑡delimited-∥∥superscript𝑍top𝑢delimited-∥∥superscript𝑍top𝑢subscript¯𝑢𝑡\displaystyle\leq D_{t}(u)-\min D_{t}+\left(\left\lVert{Z^{\top}(u-\bar{u}_{t}% )}\right\rVert+\left\lVert{Z^{\top}u}\right\rVert\right)\left\lVert{Z^{\top}(u% -\bar{u}_{t})}\right\rVert≤ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) - roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ + ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ ) ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_u - over¯ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ (A.46)
(3Dt(u)minDt+2Zu)Dt(u)minDtabsent3subscript𝐷𝑡𝑢subscript𝐷𝑡2delimited-∥∥superscript𝑍top𝑢subscript𝐷𝑡𝑢subscript𝐷𝑡\displaystyle\leq\left(3\sqrt{D_{t}(u)-\min D_{t}}+\sqrt{2}\left\lVert{Z^{\top% }u}\right\rVert\right)\sqrt{D_{t}(u)-\min D_{t}}≤ ( 3 square-root start_ARG italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) - roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + square-root start_ARG 2 end_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u ∥ ) square-root start_ARG italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) - roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG
(3M+2RXop)Dt(u)minDtabsent3𝑀2𝑅subscriptdelimited-∥∥𝑋opsubscript𝐷𝑡𝑢subscript𝐷𝑡\displaystyle\leq\left(3\sqrt{M}+\sqrt{2}R\left\lVert{X}\right\rVert_{\text{op% }}\right)\sqrt{D_{t}(u)-\min D_{t}}≤ ( 3 square-root start_ARG italic_M end_ARG + square-root start_ARG 2 end_ARG italic_R ∥ italic_X ∥ start_POSTSUBSCRIPT op end_POSTSUBSCRIPT ) square-root start_ARG italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) - roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG

where in the second inequality we used (A.43) and in the last one the fact that u[DtminDt+M]𝔹(0,R)𝑢delimited-[]subscript𝐷𝑡subscript𝐷𝑡𝑀𝔹0𝑅u\in[D_{t}\leq\min D_{t}+M]\cap\mathbb{B}(0,R)italic_u ∈ [ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_M ] ∩ roman_𝔹 ( 0 , italic_R ) and the definition of the norm operator Zop=Xopsubscriptdelimited-∥∥𝑍𝑜𝑝subscriptdelimited-∥∥𝑋𝑜𝑝\left\lVert{Z}\right\rVert_{op}=\left\lVert{X}\right\rVert_{op}∥ italic_Z ∥ start_POSTSUBSCRIPT italic_o italic_p end_POSTSUBSCRIPT = ∥ italic_X ∥ start_POSTSUBSCRIPT italic_o italic_p end_POSTSUBSCRIPT. By injecting relations (A.43) and (A.46) into (A.41), for all u[DtminDt+M]𝔹(0,R)𝑢delimited-[]subscript𝐷𝑡subscript𝐷𝑡𝑀𝔹0𝑅u\in[D_{t}\leq\min D_{t}+M]\cap\mathbb{B}(0,R)italic_u ∈ [ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_M ] ∩ roman_𝔹 ( 0 , italic_R ) we find:

dist(u,argminDt)2distsuperscript𝑢argminsubscript𝐷𝑡2\displaystyle\text{dist}\left(u,\operatornamewithlimits{arg\,min}D_{t}\right)^% {2}dist ( italic_u , start_OPERATOR roman_arg roman_min end_OPERATOR italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT τ2(2(Dt(u)minDt)+(3M+2RXop)2(Dt(u)minDt))absentsuperscript𝜏22subscript𝐷𝑡𝑢subscript𝐷𝑡superscript3𝑀2𝑅subscriptdelimited-∥∥𝑋op2subscript𝐷𝑡𝑢subscript𝐷𝑡\displaystyle\leq\tau^{2}\left(2\left(D_{t}(u)-\min D_{t}\right)+\left(3\sqrt{% M}+\sqrt{2}R\left\lVert{X}\right\rVert_{\text{op}}\right)^{2}\left(D_{t}(u)-% \min D_{t}\right)\right)≤ italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 2 ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) - roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ( 3 square-root start_ARG italic_M end_ARG + square-root start_ARG 2 end_ARG italic_R ∥ italic_X ∥ start_POSTSUBSCRIPT op end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) - roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) (A.47)
=τ2(9M+62MRXop+2R2Xop2+2)(Dt(u)minDt)absentsuperscript𝜏29𝑀62𝑀𝑅subscriptdelimited-∥∥𝑋op2superscript𝑅2superscriptsubscriptdelimited-∥∥𝑋op22subscript𝐷𝑡𝑢subscript𝐷𝑡\displaystyle=\tau^{2}\left(9M+6\sqrt{2M}R\left\lVert{X}\right\rVert_{\text{op% }}+2R^{2}\left\lVert{X}\right\rVert_{\text{op}}^{2}+2\right)\left(D_{t}(u)-% \min D_{t}\right)= italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 9 italic_M + 6 square-root start_ARG 2 italic_M end_ARG italic_R ∥ italic_X ∥ start_POSTSUBSCRIPT op end_POSTSUBSCRIPT + 2 italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_X ∥ start_POSTSUBSCRIPT op end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ) ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) - roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

which shows that for all t0𝑡0t\geq 0italic_t ≥ 0, Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT satisfies the growth condition (A.35) in [DtminDt+M]𝔹(0,R)delimited-[]subscript𝐷𝑡subscript𝐷𝑡𝑀𝔹0𝑅[D_{t}\leq\min D_{t}+M]\cap\mathbb{B}(0,R)[ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_M ] ∩ roman_𝔹 ( 0 , italic_R ), with θ=(2τ2(9M+62MRXop+2R2Xop2+2))1𝜃superscript2superscript𝜏29𝑀62𝑀𝑅subscriptdelimited-∥∥𝑋op2superscript𝑅2superscriptsubscriptdelimited-∥∥𝑋op221\theta=\left(2\tau^{2}\left(9M+6\sqrt{2M}R\left\lVert{X}\right\rVert_{\text{op% }}+2R^{2}\left\lVert{X}\right\rVert_{\text{op}}^{2}+2\right)\right)^{-1}italic_θ = ( 2 italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 9 italic_M + 6 square-root start_ARG 2 italic_M end_ARG italic_R ∥ italic_X ∥ start_POSTSUBSCRIPT op end_POSTSUBSCRIPT + 2 italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_X ∥ start_POSTSUBSCRIPT op end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT

Finally by using the equivalence between the θ𝜃\thetaitalic_θ-growth condition (A.35) and the μ𝜇\muitalic_μ-Łojasiewicz condition (5.1) with μ=θ4𝜇𝜃4\mu=\frac{\theta}{4}italic_μ = divide start_ARG italic_θ end_ARG start_ARG 4 end_ARG (see e.g. [4, Proposition 1111] or [14, Theorem 5555]), we deduce that for all t0𝑡0t\geq 0italic_t ≥ 0, Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT satisfies (5.1), with

μ=18τ2(9M+62MRXop+2R2Xop2+2)𝜇18superscript𝜏29𝑀62𝑀𝑅subscriptdelimited-∥∥𝑋op2superscript𝑅2superscriptsubscriptdelimited-∥∥𝑋op22\mu=\frac{1}{8\tau^{2}\left(9M+6\sqrt{2M}R\left\lVert{X}\right\rVert_{\text{op% }}+2R^{2}\left\lVert{X}\right\rVert_{\text{op}}^{2}+2\right)}italic_μ = divide start_ARG 1 end_ARG start_ARG 8 italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 9 italic_M + 6 square-root start_ARG 2 italic_M end_ARG italic_R ∥ italic_X ∥ start_POSTSUBSCRIPT op end_POSTSUBSCRIPT + 2 italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_X ∥ start_POSTSUBSCRIPT op end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ) end_ARG (A.48)

and τ𝜏\tauitalic_τ is the Hoffman’s constant as defined in (5.3). ∎

A.3 Proof of Theorem 5.1

In this paragraph we provide the proof of Theorem 5.1, concerning Algorithm 1. We start with the following proposition which allows to deduce an upper bound for the gap of the dual objective function Dt(ut)subscript𝐷𝑡subscript𝑢𝑡D_{t}(u_{t})italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and its minimum value.

Proposition A.1.

Let uargminDsubscript𝑢𝑎𝑟𝑔𝑚𝑖𝑛subscript𝐷u_{\ast}\in argminD_{\infty}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ italic_a italic_r italic_g italic_m italic_i italic_n italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT and let {ut}t0subscriptsubscript𝑢𝑡𝑡0\{u_{t}\}_{t\geq 0}{ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT be the sequence generated by Algorithm 1 with λ0u1subscript𝜆0superscriptdelimited-∥∥subscript𝑢1\lambda_{0}\leq\left\lVert{u_{\ast}}\right\rVert^{-1}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Then, the following estimate holds for all t1𝑡1t\geq 1italic_t ≥ 1 :

Dt(ut)D(u)(D0(u0)D(u))(1γμ1+γμ)tsubscript𝐷𝑡subscript𝑢𝑡subscript𝐷subscript𝑢subscript𝐷0subscript𝑢0subscript𝐷subscript𝑢superscript1𝛾𝜇1𝛾𝜇𝑡D_{t}(u_{t})-D_{\infty}(u_{\ast})\leq\left(D_{0}(u_{0})-D_{\infty}(u_{\ast})% \right)\left(1-\frac{\gamma\mu}{1+\gamma\mu}\right)^{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ ( italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ( 1 - divide start_ARG italic_γ italic_μ end_ARG start_ARG 1 + italic_γ italic_μ end_ARG ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (A.49)

where μ𝜇\muitalic_μ is given by (5.2).

Proof of Proposition A.1.

Let uargminDsubscript𝑢argminsubscript𝐷u_{\ast}\in\operatornamewithlimits{arg\,min}D_{\infty}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT and λ0>0subscript𝜆00\lambda_{0}>0italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0, such that λ0u1subscript𝜆0superscriptdelimited-∥∥subscript𝑢1\lambda_{0}\leq\left\lVert{u_{\ast}}\right\rVert^{-1}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Following the proof of Lemma A.8, by choosing u=u𝑢subscript𝑢u=u_{\ast}italic_u = italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT in (A.31), we find:

Dt(ut+1)Dt(u)+12γut+1u212γutu2subscript𝐷𝑡subscript𝑢𝑡1subscript𝐷𝑡subscript𝑢12𝛾superscriptdelimited-∥∥subscript𝑢𝑡1subscript𝑢212𝛾superscriptdelimited-∥∥subscript𝑢𝑡subscript𝑢2D_{t}(u_{t+1})-D_{t}(u_{\ast})+\frac{1}{2\gamma}\left\lVert{u_{t+1}-u_{\ast}}% \right\rVert^{2}\leq\frac{1}{2\gamma}\left\lVert{u_{t}-u_{\ast}}\right\rVert^{2}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∥ italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (A.50)

Since Dt(u)Dt(u)subscript𝐷𝑡subscript𝑢subscript𝐷𝑡𝑢D_{t}(u_{\ast})\leq D_{t}(u)italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ), for all un𝑢superscript𝑛u\in\mathbb{R}^{n}italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, by neglecting the non-negative term and summing over t0𝑡0t\geq 0italic_t ≥ 0 relation (A.50), we deduce that for all t0𝑡0t\geq 0italic_t ≥ 0 it holds utuu0udelimited-∥∥subscript𝑢𝑡subscript𝑢delimited-∥∥subscript𝑢0subscript𝑢\left\lVert{u_{t}-u_{\ast}}\right\rVert\leq\left\lVert{u_{0}-u_{\ast}}\right\rVert∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ≤ ∥ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥, therefore the sequence {ut}t0subscriptsubscript𝑢𝑡𝑡0\{u_{t}\}_{t\geq 0}{ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT is bounded by R=2u+u0𝑅2delimited-∥∥subscript𝑢delimited-∥∥subscript𝑢0R=2\left\lVert{u_{\ast}}\right\rVert+\left\lVert{u_{0}}\right\rVertitalic_R = 2 ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ + ∥ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥.

On the other hand, by choosing u=ut𝑢subscript𝑢𝑡u=u_{t}italic_u = italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in (A.31)and using the non-increasing property of Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (Lemma A.6), we find:

Dt+1(ut+1)Dt(ut)Dt(ut+1)Dt(ut)12γut+1ut2subscript𝐷𝑡1subscript𝑢𝑡1subscript𝐷𝑡subscript𝑢𝑡subscript𝐷𝑡subscript𝑢𝑡1subscript𝐷𝑡subscript𝑢𝑡12𝛾superscriptdelimited-∥∥subscript𝑢𝑡1subscript𝑢𝑡2D_{t+1}(u_{t+1})-D_{t}(u_{t})\leq D_{t}(u_{t+1})-D_{t}(u_{t})\leq-\frac{1}{2% \gamma}\left\lVert{u_{t+1}-u_{t}}\right\rVert^{2}italic_D start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ - divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∥ italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (A.51)

which allows to conclude that the sequence {ut}t0[DtD0(u0)]={un:Dt(u)D0(u0)}subscriptsubscript𝑢𝑡𝑡0delimited-[]subscript𝐷𝑡subscript𝐷0subscript𝑢0conditional-set𝑢superscript𝑛subscript𝐷𝑡𝑢subscript𝐷0subscript𝑢0\{u_{t}\}_{t\geq 0}\in[D_{t}\leq D_{0}(u_{0})]=\{u\in\mathbb{R}^{n}~{}:~{}D_{t% }(u)\leq D_{0}(u_{0})\}{ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT ∈ [ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] = { italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) ≤ italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) }.

By definition of ut+1=λt1proxγλt(λt(utγZZut))subscript𝑢𝑡1superscriptsubscript𝜆𝑡1subscriptprox𝛾subscript𝜆𝑡superscriptsubscript𝜆𝑡subscript𝑢𝑡𝛾𝑍superscript𝑍topsubscript𝑢𝑡u_{t+1}=\lambda_{t}^{-1}\operatorname{prox}_{\gamma\lambda_{t}\mathcal{L}^{% \ast}}\left(\lambda_{t}\left(u_{t}-\gamma ZZ^{\top}u_{t}\right)\right)italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_prox start_POSTSUBSCRIPT italic_γ italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) (see Algorithm 1) and the characterization of the proximal operator, for all t0𝑡0t\geq 0italic_t ≥ 0, we have λtut+1+γλt(λtut+1)λt(utγZZut)subscript𝜆𝑡subscript𝑢𝑡𝛾𝑍superscript𝑍topsubscript𝑢𝑡subscript𝜆𝑡subscript𝑢𝑡1𝛾subscript𝜆𝑡superscriptsubscript𝜆𝑡subscript𝑢𝑡1\lambda_{t}u_{t+1}+\gamma\lambda_{t}\partial\mathcal{L}^{\ast}(\lambda_{t}u_{t% +1})\ni\lambda_{t}\left(u_{t}-\gamma ZZ^{\top}u_{t}\right)italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_γ italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∂ caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∋ italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) or equivalently

(IdγZZ)(utut+1)γ(λtut+1)+γZZut+1=γDt(ut+1)Id𝛾𝑍superscript𝑍topsubscript𝑢𝑡subscript𝑢𝑡1𝛾superscriptsubscript𝜆𝑡subscript𝑢𝑡1𝛾𝑍superscript𝑍topsubscript𝑢𝑡1𝛾subscript𝐷𝑡subscript𝑢𝑡1\left(\text{Id}-\gamma ZZ^{\top}\right)(u_{t}-u_{t+1})\in\gamma\partial% \mathcal{L}^{\ast}(\lambda_{t}u_{t+1})+\gamma ZZ^{\top}u_{t+1}=\partial\gamma D% _{t}(u_{t+1})( Id - italic_γ italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∈ italic_γ ∂ caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) + italic_γ italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = ∂ italic_γ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) (A.52)

Hence (A.52) together with the contraction property of the operator IdγZZId𝛾𝑍superscript𝑍top\text{Id}-\gamma ZZ^{\top}Id - italic_γ italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT for all γ1XX𝛾1delimited-∥∥𝑋superscript𝑋top\gamma\leq\frac{1}{\left\lVert{XX^{\top}}\right\rVert}italic_γ ≤ divide start_ARG 1 end_ARG start_ARG ∥ italic_X italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ end_ARG, yields:

dist(Dt(ut+1),0)γ1(IdγZZ)(utut+1)γ1ut+1utdistsubscript𝐷𝑡subscript𝑢𝑡10superscript𝛾1delimited-∥∥Id𝛾𝑍superscript𝑍topsubscript𝑢𝑡subscript𝑢𝑡1superscript𝛾1delimited-∥∥subscript𝑢𝑡1subscript𝑢𝑡\text{dist}\left(\partial D_{t}(u_{t+1}),0\right)\leq\gamma^{-1}\left\lVert{% \left(\text{Id}-\gamma ZZ^{\top}\right)(u_{t}-u_{t+1})}\right\rVert\leq\gamma^% {-1}\left\lVert{u_{t+1}-u_{t}}\right\rVertdist ( ∂ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) , 0 ) ≤ italic_γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ ( Id - italic_γ italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥ ≤ italic_γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ (A.53)

By combining relations (A.51) and (A.53), we obtain:

Dt(ut+1)Dt(ut)γ2dist(Dt(ut+1),0)2subscript𝐷𝑡subscript𝑢𝑡1subscript𝐷𝑡subscript𝑢𝑡𝛾2distsuperscriptsubscript𝐷𝑡subscript𝑢𝑡102D_{t}(u_{t+1})-D_{t}(u_{t})\leq-\frac{\gamma}{2}\text{dist}\left(\partial D_{t% }(u_{t+1}),0\right)^{2}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ - divide start_ARG italic_γ end_ARG start_ARG 2 end_ARG dist ( ∂ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) , 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (A.54)

Since ut[DtminDt+M]𝔹R(𝟎)subscript𝑢𝑡delimited-[]subscript𝐷𝑡subscript𝐷𝑡𝑀subscript𝔹𝑅0u_{t}\in[D_{t}\leq\min D_{t}+M]\cap\mathbb{B}_{R}(\mathbf{0})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_M ] ∩ roman_𝔹 start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_0 ) with M=D0(u0)D(u)𝑀subscript𝐷0subscript𝑢0subscript𝐷subscript𝑢M=D_{0}(u_{0})-D_{\infty}(u_{\ast})italic_M = italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) and R=u0+2u𝑅delimited-∥∥subscript𝑢02delimited-∥∥subscript𝑢R=\left\lVert{u_{0}}\right\rVert+2\left\lVert{u_{\ast}}\right\rVertitalic_R = ∥ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ + 2 ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥, by using (5.1) and Lemma A.9 we find:

Dt(ut+1)Dt(ut)γμ(Dt(ut+1)minDt(u))subscript𝐷𝑡subscript𝑢𝑡1subscript𝐷𝑡subscript𝑢𝑡𝛾𝜇subscript𝐷𝑡subscript𝑢𝑡1subscript𝐷𝑡𝑢D_{t}(u_{t+1})-D_{t}(u_{t})\leq-\gamma\mu\left(D_{t}(u_{t+1})-\min D_{t}(u)\right)italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ - italic_γ italic_μ ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) ) (A.55)

By adding and subtracting D(u)subscript𝐷subscript𝑢D_{\infty}(u_{\ast})italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) and using that Dt+1(ut+1)Dt(ut+1)subscript𝐷𝑡1subscript𝑢𝑡1subscript𝐷𝑡subscript𝑢𝑡1D_{t+1}(u_{t+1})\leq D_{t}(u_{t+1})italic_D start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) and D(u)minDt(u)subscript𝐷𝑢subscript𝐷𝑡𝑢D_{\infty}(u)\leq\min D_{t}(u)italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u ) ≤ roman_min italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ) (see Lemma A.6), for all t1𝑡1t\geq 1italic_t ≥ 1, we derive:

(1+γμ)(Dt+1(ut+1)D(u))(Dt(ut)D(u))1𝛾𝜇subscript𝐷𝑡1subscript𝑢𝑡1subscript𝐷subscript𝑢subscript𝐷𝑡subscript𝑢𝑡subscript𝐷subscript𝑢(1+\gamma\mu)\left(D_{t+1}(u_{t+1})-D_{\infty}(u_{\ast})\right)\leq\left(D_{t}% (u_{t})-D_{\infty}(u_{\ast})\right)( 1 + italic_γ italic_μ ) ( italic_D start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ≤ ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) (A.56)

By induction in (A.56), for all t0𝑡0t\geq 0italic_t ≥ 0, it follows

Dt(ut)D(u)(D0(u0)D(u))(1γμ1+γμ)tsubscript𝐷𝑡subscript𝑢𝑡subscript𝐷subscript𝑢subscript𝐷0subscript𝑢0subscript𝐷subscript𝑢superscript1𝛾𝜇1𝛾𝜇𝑡D_{t}(u_{t})-D_{\infty}(u_{\ast})\leq\left(D_{0}(u_{0})-D_{\infty}(u_{\ast})% \right)\left(1-\frac{\gamma\mu}{1+\gamma\mu}\right)^{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ ( italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ( 1 - divide start_ARG italic_γ italic_μ end_ARG start_ARG 1 + italic_γ italic_μ end_ARG ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (A.57)

which allows to conclude the proof. ∎

Next, the proof of Theorem 5.1 can then be derived by using Proposition A.1.

Proof of Theorem 5.1.

Let uargminDsubscript𝑢argminsubscript𝐷u_{\ast}\in\operatornamewithlimits{arg\,min}D_{\infty}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT such that λ0u1subscript𝜆0superscriptdelimited-∥∥subscript𝑢1\lambda_{0}\leq\left\lVert{u_{\ast}}\right\rVert^{-1}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Since wt=Zutsubscript𝑤𝑡superscript𝑍topsubscript𝑢𝑡w_{t}=-Z^{\top}u_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, Lemma A.7 yields

wtw2(D(ut)D(u))delimited-∥∥subscript𝑤𝑡subscript𝑤2subscript𝐷subscript𝑢𝑡subscript𝐷subscript𝑢\left\lVert{w_{t}-w_{\ast}}\right\rVert\leq\sqrt{2\left(D_{\infty}(u_{t})-D_{% \infty}(u_{\ast})\right)}∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ≤ square-root start_ARG 2 ( italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG

Lemma A.6 and the non-increasing property of Dt(ut)subscript𝐷𝑡subscript𝑢𝑡D_{t}(u_{t})italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) proved in Lemma A.8 imply

wtw2(Dt(ut)D(u))delimited-∥∥subscript𝑤𝑡subscript𝑤2subscript𝐷𝑡subscript𝑢𝑡subscript𝐷subscript𝑢\left\lVert{w_{t}-w_{\ast}}\right\rVert\leq\sqrt{2\left(D_{t}(u_{t})-D_{\infty% }(u_{\ast})\right)}∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ≤ square-root start_ARG 2 ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG

We then derive from Proposition A.1 that

wtw2(D0(u0)D(u))(1γμ1+γμ)t2,delimited-∥∥subscript𝑤𝑡subscript𝑤2subscript𝐷0subscript𝑢0subscript𝐷subscript𝑢superscript1𝛾𝜇1𝛾𝜇𝑡2\displaystyle\left\lVert{w_{t}-w_{\ast}}\right\rVert\leq\sqrt{2\left(D_{0}(u_{% 0})-D_{\infty}(u_{*})\right)}\left(1-\frac{\gamma\mu}{1+\gamma\mu}\right)^{% \frac{t}{2}},∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ≤ square-root start_ARG 2 ( italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG ( 1 - divide start_ARG italic_γ italic_μ end_ARG start_ARG 1 + italic_γ italic_μ end_ARG ) start_POSTSUPERSCRIPT divide start_ARG italic_t end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , (A.58)

from which (5.4) follows using the definition of D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and Dsubscript𝐷D_{\infty}italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT. The inequality wwt+wtwdelimited-∥∥subscript𝑤delimited-∥∥subscript𝑤𝑡delimited-∥∥subscript𝑤𝑡subscript𝑤\left\lVert{w_{\ast}}\right\rVert\leq\left\lVert{w_{t}}\right\rVert+\left% \lVert{w_{t}-w_{\ast}}\right\rVert∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ≤ ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ + ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ and the bound (A.58) give

wtwwtwdelimited-∥∥subscript𝑤𝑡delimited-∥∥subscript𝑤delimited-∥∥subscript𝑤𝑡subscript𝑤\displaystyle\left\lVert{w_{t}}\right\rVert\geq\left\lVert{w_{\ast}}\right% \rVert-\left\lVert{w_{t}-w_{\ast}}\right\rVert∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ≥ ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ - ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ww02w2+𝟏,u0u(1γμ1+γμ)t2absentdelimited-∥∥subscript𝑤superscriptdelimited-∥∥subscript𝑤02superscriptdelimited-∥∥subscript𝑤21subscript𝑢0subscript𝑢superscript1𝛾𝜇1𝛾𝜇𝑡2\displaystyle\geq\left\lVert{w_{\ast}}\right\rVert-\sqrt{\left\lVert{w_{0}}% \right\rVert^{2}-\left\lVert{w_{\ast}}\right\rVert^{2}+\left\langle{\mathbf{1}% },{u_{0}-u_{\ast}}\right\rangle}\left(1-\frac{\gamma\mu}{1+\gamma\mu}\right)^{% \frac{t}{2}}≥ ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ - square-root start_ARG ∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⟨ bold_1 , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ end_ARG ( 1 - divide start_ARG italic_γ italic_μ end_ARG start_ARG 1 + italic_γ italic_μ end_ARG ) start_POSTSUPERSCRIPT divide start_ARG italic_t end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT (A.59)
=w(1w02w21+1w2𝟏,u0u(1γμ1+γμ)t2)absentdelimited-∥∥subscript𝑤1superscriptdelimited-∥∥subscript𝑤02superscriptdelimited-∥∥subscript𝑤211superscriptdelimited-∥∥subscript𝑤21subscript𝑢0subscript𝑢superscript1𝛾𝜇1𝛾𝜇𝑡2\displaystyle=\left\lVert{w_{\ast}}\right\rVert\left(1-\sqrt{\frac{\left\lVert% {w_{0}}\right\rVert^{2}}{\left\lVert{w_{\ast}}\right\rVert^{2}}-1+\frac{1}{% \left\lVert{w_{\ast}}\right\rVert^{2}}\left\langle{\mathbf{1}},{u_{0}-u_{\ast}% }\right\rangle}\left(1-\frac{\gamma\mu}{1+\gamma\mu}\right)^{\frac{t}{2}}\right)= ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ( 1 - square-root start_ARG divide start_ARG ∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - 1 + divide start_ARG 1 end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⟨ bold_1 , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ end_ARG ( 1 - divide start_ARG italic_γ italic_μ end_ARG start_ARG 1 + italic_γ italic_μ end_ARG ) start_POSTSUPERSCRIPT divide start_ARG italic_t end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT )

By setting

t=log(14)log(w02w21+1w2𝟏,u0u)log(1γμ1+γμ),superscript𝑡14superscriptdelimited-∥∥subscript𝑤02superscriptdelimited-∥∥subscript𝑤211superscriptdelimited-∥∥subscript𝑤21subscript𝑢0subscript𝑢1𝛾𝜇1𝛾𝜇t^{\ast}=\frac{\log(\frac{1}{4})-\log\left(\frac{\left\lVert{w_{0}}\right% \rVert^{2}}{\left\lVert{w_{\ast}}\right\rVert^{2}}-1+\frac{1}{\left\lVert{w_{% \ast}}\right\rVert^{2}}\left\langle{\mathbf{1}},{u_{0}-u_{\ast}}\right\rangle% \right)}{\log\left(1-\frac{\gamma\mu}{1+\gamma\mu}\right)},italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = divide start_ARG roman_log ( divide start_ARG 1 end_ARG start_ARG 4 end_ARG ) - roman_log ( divide start_ARG ∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - 1 + divide start_ARG 1 end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⟨ bold_1 , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ ) end_ARG start_ARG roman_log ( 1 - divide start_ARG italic_γ italic_μ end_ARG start_ARG 1 + italic_γ italic_μ end_ARG ) end_ARG , (A.60)

for all tt𝑡superscript𝑡t\geq t^{\ast}italic_t ≥ italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, it holds wt12wdelimited-∥∥subscript𝑤𝑡12delimited-∥∥subscript𝑤\left\lVert{w_{t}}\right\rVert\geq\frac{1}{2}\left\lVert{w_{\ast}}\right\rVert∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥. From Lemma A.4 we derive:

1wt,wwtw1w2wtw2 and M(ww)M(wtwt)2XFw2wtwformulae-sequence1subscript𝑤𝑡subscript𝑤delimited-∥∥subscript𝑤𝑡delimited-∥∥subscript𝑤1superscriptdelimited-∥∥subscript𝑤2superscriptdelimited-∥∥subscript𝑤𝑡subscript𝑤2 and 𝑀subscript𝑤delimited-∥∥subscript𝑤𝑀subscript𝑤𝑡delimited-∥∥subscript𝑤𝑡2subscriptdelimited-∥∥𝑋𝐹superscriptdelimited-∥∥subscript𝑤2delimited-∥∥subscript𝑤𝑡subscript𝑤1-\frac{\left\langle{w_{t}},{w_{\ast}}\right\rangle}{\left\lVert{w_{t}}\right% \rVert\left\lVert{w_{\ast}}\right\rVert}\leq\frac{1}{\left\lVert{w_{\ast}}% \right\rVert^{2}}\left\lVert{w_{t}-w_{\ast}}\right\rVert^{2}\quad\text{ and }% \quad M\Big{(}\frac{w_{\ast}}{\left\lVert{w_{\ast}}\right\rVert}\Big{)}-M\Big{% (}\frac{w_{t}}{\left\lVert{w_{t}}\right\rVert}\Big{)}\leq\frac{2\left\lVert{X}% \right\rVert_{F}}{\left\lVert{w_{\ast}}\right\rVert^{2}}\left\lVert{w_{t}-w_{% \ast}}\right\rVert1 - divide start_ARG ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG ≤ divide start_ARG 1 end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and italic_M ( divide start_ARG italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG ) - italic_M ( divide start_ARG italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG ) ≤ divide start_ARG 2 ∥ italic_X ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ (A.61)

which, together with (A.58), allows to conclude the proof of Theorem 5.1.

A.4 Proof of Theorem 5.2

In this paragraph, we turn our attention to the convergence properties of Algorithm 2, hence the proof of Theorem 5.2. The analysis is based on discrete Lyapunov-energy techniques that have recently become very popular for studying inertial schemes like Algorithm 2. Our analysis follows the line of study adopted in a recent stream of papers such as [90, 22, 6, 7, 2, 3, 21] and their related references. The proof of Theorem 5.2 is based on the following proposition which provides some bounds for the dual objective function Dt(ut)D(u)subscript𝐷𝑡subscript𝑢𝑡subscript𝐷subscript𝑢D_{t}(u_{t})-D_{\infty}(u_{\ast})italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ).

Proposition A.2.

Let usubscript𝑢u_{\ast}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT be a solution of the dual problem (4.9) and let {ut}t0subscriptsubscript𝑢𝑡𝑡0\{u_{t}\}_{t\geq 0}{ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT be the sequence generated by Algorithm 2 with α3𝛼3\alpha\geq 3italic_α ≥ 3 and λ0u1subscript𝜆0superscriptdelimited-∥∥subscript𝑢1\lambda_{0}\leq\left\lVert{u_{\ast}}\right\rVert^{-1}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Then the following estimate holds true for the dual-objective function:

Dt(ut)D(u)C2(t+α1)2,subscript𝐷𝑡subscript𝑢𝑡subscript𝐷subscript𝑢superscript𝐶2superscript𝑡𝛼12D_{t}(u_{t})-D_{\infty}(u_{\ast})\leq\frac{C^{2}}{(t+\alpha-1)^{2}},italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ divide start_ARG italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_t + italic_α - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (A.62)

where

C=(α1)((D0(u0)D(u))+u0u22γ)1/2.𝐶𝛼1superscriptsubscript𝐷0subscript𝑢0subscript𝐷subscript𝑢superscriptdelimited-∥∥subscript𝑢0subscript𝑢22𝛾12C=(\alpha-1)\bigg{(}\big{(}D_{0}(u_{0})-D_{\infty}(u_{\ast})\big{)}+\frac{% \left\lVert{u_{0}-u_{\ast}}\right\rVert^{2}}{2\gamma}\bigg{)}^{1/2}.italic_C = ( italic_α - 1 ) ( ( italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) + divide start_ARG ∥ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_γ end_ARG ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT . (A.63)
Proof of Proposition A.2.

Let uargminDsubscript𝑢argminsubscript𝐷u_{\ast}\in\operatornamewithlimits{arg\,min}D_{\infty}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT and utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT the sequence generated by Algorithm 2 with λ0u1subscript𝜆0superscriptdelimited-∥∥subscript𝑢1\lambda_{0}\leq\left\lVert{u_{\ast}}\right\rVert^{-1}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. For all ν>0𝜈0\nu>0italic_ν > 0, let us define the following auxiliary sequences:

rtsubscript𝑟𝑡\displaystyle r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Dt(ut)D(u),δt=utut12,ht=utu2formulae-sequenceabsentsubscript𝐷𝑡subscript𝑢𝑡subscript𝐷subscript𝑢formulae-sequencesubscript𝛿𝑡superscriptdelimited-∥∥subscript𝑢𝑡subscript𝑢𝑡12subscript𝑡superscriptdelimited-∥∥subscript𝑢𝑡subscript𝑢2\displaystyle=D_{t}(u_{t})-D_{\infty}(u_{\ast})~{},\quad\delta_{t}=\left\lVert% {u_{t}-u_{t-1}}\right\rVert^{2}~{},\quad h_{t}=\left\lVert{u_{t}-u_{\ast}}% \right\rVert^{2}= italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (A.64)
ktsubscript𝑘𝑡\displaystyle k_{t}italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =t+α1 and vt=ν(ut1u+)+kt(utut1)2absent𝑡𝛼1 and subscript𝑣𝑡superscriptdelimited-∥∥𝜈subscript𝑢𝑡1subscript𝑢subscript𝑘𝑡subscript𝑢𝑡subscript𝑢𝑡12\displaystyle=t+\alpha-1~{}\text{ and }~{}v_{t}=\left\lVert{\nu(u_{t-1}-u_{+})% +k_{t}(u_{t}-u_{t-1})}\right\rVert^{2}= italic_t + italic_α - 1 and italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∥ italic_ν ( italic_u start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) + italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (A.65)

The following energy sequence that will play a fundamental role in our analysis

Etsubscript𝐸𝑡\displaystyle E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =(t+α1)2(Dt(ut)D(u))+12γν(ut1u)+kt(utut1)2absentsuperscript𝑡𝛼12subscript𝐷𝑡subscript𝑢𝑡subscript𝐷subscript𝑢12𝛾superscriptdelimited-∥∥𝜈subscript𝑢𝑡1subscript𝑢subscript𝑘𝑡subscript𝑢𝑡subscript𝑢𝑡12\displaystyle=(t+\alpha-1)^{2}\big{(}D_{t}(u_{t})-D_{\infty}(u_{\ast})\big{)}+% \frac{1}{2\gamma}\left\lVert{\nu(u_{t-1}-u_{\ast})+k_{t}(u_{t}-u_{t-1})}\right% \rVert^{2}= ( italic_t + italic_α - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) + divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∥ italic_ν ( italic_u start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (A.66)
=kt2rt+12γvtabsentsuperscriptsubscript𝑘𝑡2subscript𝑟𝑡12𝛾subscript𝑣𝑡\displaystyle=k_{t}^{2}r_{t}+\frac{1}{2\gamma}v_{t}= italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

We will show that by tuning properly the parameters α𝛼\alphaitalic_α and ν𝜈\nuitalic_ν, the sequence Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is non-increasing.

The Descent Lemma A.5 (with y=qt𝑦subscript𝑞𝑡y=q_{t}italic_y = italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and x=u𝑥subscript𝑢x=u_{\ast}italic_x = italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT), implies

2γ(Dt(ut+1)Dt(u))2𝛾subscript𝐷𝑡subscript𝑢𝑡1subscript𝐷𝑡subscript𝑢\displaystyle 2\gamma\big{(}D_{t}(u_{t+1})-D_{t}(u_{\ast})\big{)}2 italic_γ ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) utu+αt(utut1)2ht+1absentsuperscriptdelimited-∥∥subscript𝑢𝑡subscript𝑢subscript𝛼𝑡subscript𝑢𝑡subscript𝑢𝑡12subscript𝑡1\displaystyle\leq\left\lVert{u_{t}-u_{\ast}+\alpha_{t}(u_{t}-u_{t-1})}\right% \rVert^{2}-h_{t+1}≤ ∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT (A.67)
=htht+1+αt2δt+2αtutut1,utuabsentsubscript𝑡subscript𝑡1superscriptsubscript𝛼𝑡2subscript𝛿𝑡2subscript𝛼𝑡subscript𝑢𝑡subscript𝑢𝑡1subscript𝑢𝑡subscript𝑢\displaystyle=h_{t}-h_{t+1}+\alpha_{t}^{2}\delta_{t}+2\alpha_{t}\left\langle{u% _{t}-u_{t-1}},{u_{t}-u_{\ast}}\right\rangle= italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟨ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩
=htht+1+αt2δt+αt(δt+htht1)absentsubscript𝑡subscript𝑡1superscriptsubscript𝛼𝑡2subscript𝛿𝑡subscript𝛼𝑡subscript𝛿𝑡subscript𝑡subscript𝑡1\displaystyle=h_{t}-h_{t+1}+\alpha_{t}^{2}\delta_{t}+\alpha_{t}\big{(}\delta_{% t}+h_{t}-h_{t-1}\big{)}= italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )

By multiplying (A.67) with νkt+1𝜈subscript𝑘𝑡1\nu k_{t+1}italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and using the definition of αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT we derive:

2γνkt+1(Dt(ut+1)Dt(u))2𝛾𝜈subscript𝑘𝑡1subscript𝐷𝑡subscript𝑢𝑡1subscript𝐷𝑡subscript𝑢\displaystyle 2\gamma\nu k_{t+1}\big{(}D_{t}(u_{t+1})-D_{t}(u_{\ast})\big{)}2 italic_γ italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) νkt+1(htht+1)+νkt+1αt2δt+νkt+1αt(δt+htht1)absent𝜈subscript𝑘𝑡1subscript𝑡subscript𝑡1𝜈subscript𝑘𝑡1superscriptsubscript𝛼𝑡2subscript𝛿𝑡𝜈subscript𝑘𝑡1subscript𝛼𝑡subscript𝛿𝑡subscript𝑡subscript𝑡1\displaystyle\leq\nu k_{t+1}\big{(}h_{t}-h_{t+1}\big{)}+\nu k_{t+1}\alpha_{t}^% {2}\delta_{t}+\nu k_{t+1}\alpha_{t}\big{(}\delta_{t}+h_{t}-h_{t-1}\big{)}≤ italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) + italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) (A.68)
=νkt+1(htht+1)+νt2kt+1δt+νt(δt+htht1)absent𝜈subscript𝑘𝑡1subscript𝑡subscript𝑡1𝜈superscript𝑡2subscript𝑘𝑡1subscript𝛿𝑡𝜈𝑡subscript𝛿𝑡subscript𝑡subscript𝑡1\displaystyle=\nu k_{t+1}\big{(}h_{t}-h_{t+1}\big{)}+\frac{\nu t^{2}}{k_{t+1}}% \delta_{t}+\nu t\big{(}\delta_{t}+h_{t}-h_{t-1}\big{)}= italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) + divide start_ARG italic_ν italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ν italic_t ( italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )

On the other hand, the Descent Lemma A.5 (with y=qt𝑦subscript𝑞𝑡y=q_{t}italic_y = italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and x=ut𝑥subscript𝑢𝑡x=u_{t}italic_x = italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) yields:

2γ(Dt(ut+1)Dt(ut))αt2δtδt+12𝛾subscript𝐷𝑡subscript𝑢𝑡1subscript𝐷𝑡subscript𝑢𝑡superscriptsubscript𝛼𝑡2subscript𝛿𝑡subscript𝛿𝑡12\gamma\big{(}D_{t}(u_{t+1})-D_{t}(u_{t})\big{)}\leq\alpha_{t}^{2}\delta_{t}-% \delta_{t+1}2 italic_γ ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ≤ italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_δ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT (A.69)

which by multiplying by kt+12(1νkt+1)superscriptsubscript𝑘𝑡121𝜈subscript𝑘𝑡1k_{t+1}^{2}\big{(}1-\frac{\nu}{k_{t+1}}\big{)}italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_ν end_ARG start_ARG italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG ) (here να+1𝜈𝛼1\nu\leq\alpha+1italic_ν ≤ italic_α + 1) and using the definition of αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, gives

2γ(kt+12νkt+1)(Dt(ut+1)Dt(ut))(t2νt2kt+1)δt(kt+12νkt+1)δt+12𝛾superscriptsubscript𝑘𝑡12𝜈subscript𝑘𝑡1subscript𝐷𝑡subscript𝑢𝑡1subscript𝐷𝑡subscript𝑢𝑡superscript𝑡2𝜈superscript𝑡2subscript𝑘𝑡1subscript𝛿𝑡superscriptsubscript𝑘𝑡12𝜈subscript𝑘𝑡1subscript𝛿𝑡12\gamma\big{(}k_{t+1}^{2}-\nu k_{t+1}\big{)}\big{(}D_{t}(u_{t+1})-D_{t}(u_{t})% \big{)}\leq\big{(}t^{2}-\frac{\nu t^{2}}{k_{t+1}}\big{)}\delta_{t}-\big{(}k_{t% +1}^{2}-\nu k_{t+1}\big{)}\delta_{t+1}2 italic_γ ( italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ≤ ( italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG italic_ν italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG ) italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ( italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT (A.70)

By adding relations (A.67) and (A.70), we obtain

2γ(kt+12Dt(ut+1)kt+12Dt(ut)\displaystyle 2\gamma\big{(}k_{t+1}^{2}D_{t}(u_{t+1})-k_{t+1}^{2}D_{t}(u_{t})2 italic_γ ( italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) νkt+1(Dt(ut)Dt(u)))(t2νt2kt+1)δt(kt+12νkt+1)δt+1\displaystyle-\nu k_{t+1}(D_{t}(u_{t})-D_{t}(u_{\ast}))\big{)}\leq\big{(}t^{2}% -\frac{\nu t^{2}}{k_{t+1}}\big{)}\delta_{t}-\big{(}k_{t+1}^{2}-\nu k_{t+1}\big% {)}\delta_{t+1}- italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ) ≤ ( italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG italic_ν italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG ) italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ( italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT (A.71)
+νkt+1(htht+1)+νt2kt+1δt+νt(δt+htht1)𝜈subscript𝑘𝑡1subscript𝑡subscript𝑡1𝜈superscript𝑡2subscript𝑘𝑡1subscript𝛿𝑡𝜈𝑡subscript𝛿𝑡subscript𝑡subscript𝑡1\displaystyle\qquad\qquad+\nu k_{t+1}\big{(}h_{t}-h_{t+1}\big{)}+\frac{\nu t^{% 2}}{k_{t+1}}\delta_{t}+\nu t\big{(}\delta_{t}+h_{t}-h_{t-1}\big{)}+ italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) + divide start_ARG italic_ν italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ν italic_t ( italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )
=t2δtkt+12δt+1+νt(δt+htht1)νkt+1(ht+1htδt+1)absentsuperscript𝑡2subscript𝛿𝑡superscriptsubscript𝑘𝑡12subscript𝛿𝑡1𝜈𝑡subscript𝛿𝑡subscript𝑡subscript𝑡1𝜈subscript𝑘𝑡1subscript𝑡1subscript𝑡subscript𝛿𝑡1\displaystyle=t^{2}\delta_{t}-k_{t+1}^{2}\delta_{t+1}+\nu t\big{(}\delta_{t}+h% _{t}-h_{t-1}\big{)}-\nu k_{t+1}\big{(}h_{t+1}-h_{t}-\delta_{t+1}\big{)}= italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_ν italic_t ( italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) - italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_δ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT )

Next, by using the non-increasing property of Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (in particular Dt+1()Dt()subscript𝐷𝑡1subscript𝐷𝑡D_{t+1}(\cdot)\leq D_{t}(\cdot)italic_D start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( ⋅ ) ≤ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ), see Lemma A.6) and adding and subtracting (kt+12νkt+1)D(u)superscriptsubscript𝑘𝑡12𝜈subscript𝑘𝑡1subscript𝐷subscript𝑢(k_{t+1}^{2}-\nu k_{t+1})D_{\infty}(u_{\ast})( italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) on the left-hand side of (A.71) and using the definition of rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we derive:

2γ(kt+12rt+1(kt+12\displaystyle 2\gamma\big{(}k_{t+1}^{2}r_{t+1}-(k_{t+1}^{2}-2 italic_γ ( italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - ( italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - νkt+1)rtνkt+1(Dt(u)D(u)))\displaystyle\nu k_{t+1})r_{t}-\nu k_{t+1}(D_{t}(u_{\ast})-D_{\infty}(u_{\ast}% ))\big{)}\leqitalic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ) ≤ (A.72)
t2δtkt+12δt+1+νt(δt+htht1)νkt+1(ht+1htδt+1).superscript𝑡2subscript𝛿𝑡superscriptsubscript𝑘𝑡12subscript𝛿𝑡1𝜈𝑡subscript𝛿𝑡subscript𝑡subscript𝑡1𝜈subscript𝑘𝑡1subscript𝑡1subscript𝑡subscript𝛿𝑡1\displaystyle t^{2}\delta_{t}-k_{t+1}^{2}\delta_{t+1}+\nu t\big{(}\delta_{t}+h% _{t}-h_{t-1}\big{)}-\nu k_{t+1}\big{(}h_{t+1}-h_{t}-\delta_{t+1}\big{)}.italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_ν italic_t ( italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) - italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_δ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) .

Since λtu1subscript𝜆𝑡superscriptdelimited-∥∥subscript𝑢1\lambda_{t}\leq\left\lVert{u_{\ast}}\right\rVert^{-1}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, we have λtu[1,0]nsubscript𝜆𝑡subscript𝑢superscript10𝑛\lambda_{t}u_{\ast}\in[-1,0]^{n}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ [ - 1 , 0 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and Dt(u)=D(u)subscript𝐷𝑡subscript𝑢subscript𝐷subscript𝑢D_{t}(u_{\ast})=D_{\infty}(u_{\ast})italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ), for all t1𝑡1t\geq 1italic_t ≥ 1, thus from the previous relation we get:

2γ(kt+12rt+1(kt+12νkt+1)rt)t2δtkt+12δt+1+νt(δt+htht1)νkt+1(ht+1htδt+1)2𝛾superscriptsubscript𝑘𝑡12subscript𝑟𝑡1superscriptsubscript𝑘𝑡12𝜈subscript𝑘𝑡1subscript𝑟𝑡superscript𝑡2subscript𝛿𝑡superscriptsubscript𝑘𝑡12subscript𝛿𝑡1𝜈𝑡subscript𝛿𝑡subscript𝑡subscript𝑡1𝜈subscript𝑘𝑡1subscript𝑡1subscript𝑡subscript𝛿𝑡12\gamma\big{(}k_{t+1}^{2}r_{t+1}-(k_{t+1}^{2}-\nu k_{t+1})r_{t}\big{)}\leq t^{% 2}\delta_{t}-k_{t+1}^{2}\delta_{t+1}+\nu t\big{(}\delta_{t}+h_{t}-h_{t-1}\big{% )}-\nu k_{t+1}\big{(}h_{t+1}-h_{t}-\delta_{t+1}\big{)}2 italic_γ ( italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - ( italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_ν italic_t ( italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) - italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_δ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) (A.73)

By adding and subtracting kt2rtsuperscriptsubscript𝑘𝑡2subscript𝑟𝑡k_{t}^{2}r_{t}italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on both sides of the previous inequality, we have

2γkt+12rt+12γkt2rt2𝛾superscriptsubscript𝑘𝑡12subscript𝑟𝑡12𝛾superscriptsubscript𝑘𝑡2subscript𝑟𝑡\displaystyle 2\gamma k_{t+1}^{2}r_{t+1}-2\gamma k_{t}^{2}r_{t}2 italic_γ italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - 2 italic_γ italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 2γρt+1rt+t2δtkt+12δt+1+νt(δt+htht1)absent2𝛾subscript𝜌𝑡1subscript𝑟𝑡superscript𝑡2subscript𝛿𝑡superscriptsubscript𝑘𝑡12subscript𝛿𝑡1𝜈𝑡subscript𝛿𝑡subscript𝑡subscript𝑡1\displaystyle\leq 2\gamma\rho_{t+1}r_{t}+t^{2}\delta_{t}-k_{t+1}^{2}\delta_{t+% 1}+\nu t\big{(}\delta_{t}+h_{t}-h_{t-1}\big{)}≤ 2 italic_γ italic_ρ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_ν italic_t ( italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) (A.74)
νkt+1(ht+1htδt+1)𝜈subscript𝑘𝑡1subscript𝑡1subscript𝑡subscript𝛿𝑡1\displaystyle\qquad-\nu k_{t+1}\big{(}h_{t+1}-h_{t}-\delta_{t+1}\big{)}- italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_δ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT )

with ρt+1=kt+12νkt+1kt2subscript𝜌𝑡1superscriptsubscript𝑘𝑡12𝜈subscript𝑘𝑡1superscriptsubscript𝑘𝑡2\rho_{t+1}=k_{t+1}^{2}-\nu k_{t+1}-k_{t}^{2}italic_ρ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

By considering now the variation of the sequence vtsubscript𝑣𝑡v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and performing some basic algebraic computations we have:

vt+1vtsubscript𝑣𝑡1subscript𝑣𝑡\displaystyle v_{t+1}-v_{t}italic_v start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =ν2(utu2ut1u2)+kt+12ut+1ut2kt2utut12absentsuperscript𝜈2superscriptdelimited-∥∥subscript𝑢𝑡subscript𝑢2superscriptdelimited-∥∥subscript𝑢𝑡1subscript𝑢2superscriptsubscript𝑘𝑡12superscriptdelimited-∥∥subscript𝑢𝑡1subscript𝑢𝑡2superscriptsubscript𝑘𝑡2superscriptdelimited-∥∥subscript𝑢𝑡subscript𝑢𝑡12\displaystyle=\nu^{2}\big{(}\left\lVert{u_{t}-u_{\ast}}\right\rVert^{2}-\left% \lVert{u_{t-1}-u_{\ast}}\right\rVert^{2}\big{)}+k_{t+1}^{2}\left\lVert{u_{t+1}% -u_{t}}\right\rVert^{2}-k_{t}^{2}\left\lVert{u_{t}-u_{t-1}}\right\rVert^{2}= italic_ν start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_u start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (A.75)
+2νkt+1ut+1ut,utu2νktutut1,ut1u2𝜈subscript𝑘𝑡1subscript𝑢𝑡1subscript𝑢𝑡subscript𝑢𝑡subscript𝑢2𝜈subscript𝑘𝑡subscript𝑢𝑡subscript𝑢𝑡1subscript𝑢𝑡1subscript𝑢\displaystyle\qquad+2\nu k_{t+1}\left\langle{u_{t+1}-u_{t}},{u_{t}-u_{\ast}}% \right\rangle-2\nu k_{t}\left\langle{u_{t}-u_{t-1}},{u_{t-1}-u_{\ast}}\right\rangle+ 2 italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ⟨ italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ - 2 italic_ν italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟨ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩
=ν2(htht1)+kt+12δt+1kt2δt+νkt+1(ht+1htδt+1)νkt(htht1δt)absentsuperscript𝜈2subscript𝑡subscript𝑡1superscriptsubscript𝑘𝑡12subscript𝛿𝑡1superscriptsubscript𝑘𝑡2subscript𝛿𝑡𝜈subscript𝑘𝑡1subscript𝑡1subscript𝑡subscript𝛿𝑡1𝜈subscript𝑘𝑡subscript𝑡subscript𝑡1subscript𝛿𝑡\displaystyle=\nu^{2}\big{(}h_{t}-h_{t-1}\big{)}+k_{t+1}^{2}\delta_{t+1}-k_{t}% ^{2}\delta_{t}+\nu k_{t+1}\big{(}h_{t+1}-h_{t}-\delta_{t+1}\big{)}-\nu k_{t}% \big{(}h_{t}-h_{t-1}-\delta_{t}\big{)}= italic_ν start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) + italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_δ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_ν italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

By adding vt+1vtsubscript𝑣𝑡1subscript𝑣𝑡v_{t+1}-v_{t}italic_v start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on both sides of (A.74) and using (A.75), we obtain:

2γkt+12rt+1+vt+12γkt2rtvt2𝛾superscriptsubscript𝑘𝑡12subscript𝑟𝑡1subscript𝑣𝑡12𝛾superscriptsubscript𝑘𝑡2subscript𝑟𝑡subscript𝑣𝑡\displaystyle 2\gamma k_{t+1}^{2}r_{t+1}+v_{t+1}-2\gamma k_{t}^{2}r_{t}-v_{t}2 italic_γ italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_v start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - 2 italic_γ italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 2γρt+1rt+(t2+νt+νktkt2)δtabsent2𝛾subscript𝜌𝑡1subscript𝑟𝑡superscript𝑡2𝜈𝑡𝜈subscript𝑘𝑡superscriptsubscript𝑘𝑡2subscript𝛿𝑡\displaystyle\leq 2\gamma\rho_{t+1}r_{t}+\big{(}t^{2}+\nu t+\nu k_{t}-k_{t}^{2% }\big{)}\delta_{t}≤ 2 italic_γ italic_ρ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ν italic_t + italic_ν italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (A.76)
+ν(ν+tkt)(htht1)𝜈𝜈𝑡subscript𝑘𝑡subscript𝑡subscript𝑡1\displaystyle\quad+\nu\big{(}\nu+t-k_{t}\big{)}\big{(}h_{t}-h_{t-1}\big{)}+ italic_ν ( italic_ν + italic_t - italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )

Since kt=t+α1subscript𝑘𝑡𝑡𝛼1k_{t}=t+\alpha-1italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_t + italic_α - 1, (A.76) and the definition of Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (A.66), yield:

2γ(Et+1Et)2𝛾subscript𝐸𝑡1subscript𝐸𝑡\displaystyle 2\gamma\big{(}E_{t+1}-E_{t}\big{)}2 italic_γ ( italic_E start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) 2γρt+1rt+(να+1)(2t+α1)δt+ν(να+1)(htht1)absent2𝛾subscript𝜌𝑡1subscript𝑟𝑡𝜈𝛼12𝑡𝛼1subscript𝛿𝑡𝜈𝜈𝛼1subscript𝑡subscript𝑡1\displaystyle\leq 2\gamma\rho_{t+1}r_{t}+(\nu-\alpha+1)\big{(}2t+\alpha-1\big{% )}\delta_{t}+\nu(\nu-\alpha+1)\big{(}h_{t}-h_{t-1}\big{)}≤ 2 italic_γ italic_ρ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( italic_ν - italic_α + 1 ) ( 2 italic_t + italic_α - 1 ) italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ν ( italic_ν - italic_α + 1 ) ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) (A.77)

Thus, setting ν=α1𝜈𝛼1\nu=\alpha-1italic_ν = italic_α - 1 and α3𝛼3\alpha\geq 3italic_α ≥ 3 (here notice that ρt+1=kt+12νkt+1kt2=(3α)kt+2αsubscript𝜌𝑡1superscriptsubscript𝑘𝑡12𝜈subscript𝑘𝑡1superscriptsubscript𝑘𝑡23𝛼subscript𝑘𝑡2𝛼\rho_{t+1}=k_{t+1}^{2}-\nu k_{t+1}-k_{t}^{2}=(3-\alpha)k_{t}+2-\alphaitalic_ρ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_ν italic_k start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( 3 - italic_α ) italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 - italic_α), we get:

Et+1Et0,subscript𝐸𝑡1subscript𝐸𝑡0E_{t+1}-E_{t}\leq 0,italic_E start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ 0 , (A.78)

which shows that the energy sequence Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is non-increasing. In addition by neglecting the non-negative terms in the definition of Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (A.66) and using its non-increasing property, we derive:

kt2(Dt(ut))D(u)EtE0superscriptsubscript𝑘𝑡2subscript𝐷𝑡subscript𝑢𝑡subscript𝐷subscript𝑢subscript𝐸𝑡subscript𝐸0k_{t}^{2}\left(D_{t}(u_{t}))-D_{\infty}(u_{\ast}\right)\leq E_{t}\leq E_{0}italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (A.79)

Finally by the definition of Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (A.66), we get:

rt=Dt(ut)D(u)C(t+α1)2subscript𝑟𝑡subscript𝐷𝑡subscript𝑢𝑡subscript𝐷subscript𝑢𝐶superscript𝑡𝛼12r_{t}=D_{t}(u_{t})-D_{\infty}(u_{\ast})\leq\frac{C}{\big{(}t+\alpha-1\big{)}^{% 2}}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ divide start_ARG italic_C end_ARG start_ARG ( italic_t + italic_α - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (A.80)
with : Cwith : 𝐶\displaystyle\text{with : }\qquad Cwith : italic_C =E0=(α1)2(2γ(D0(u0)D(u))+u0u22γ)absentsubscript𝐸0superscript𝛼122𝛾subscript𝐷0subscript𝑢0subscript𝐷subscript𝑢superscriptdelimited-∥∥subscript𝑢0subscript𝑢22𝛾\displaystyle=E_{0}=(\alpha-1)^{2}\bigg{(}2\gamma\big{(}D_{0}(u_{0})-D_{\infty% }(u_{\ast})\big{)}+\frac{\left\lVert{u_{0}-u_{\ast}}\right\rVert^{2}}{2\gamma}% \bigg{)}\hskip 142.26378pt= italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_α - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 2 italic_γ ( italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) + divide start_ARG ∥ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_γ end_ARG ) (A.81)

which concludes the proof of Proposition A.2. ∎

Proof of Theorem 5.2.

Lemma A.7 and the non-increasing property of Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT given in Lemma A.6, together with the upper bound (A.62) in Proposition A.2 for the dual iterates yield

wtw2(D(ut)D(u))2(Dt(ut)D(u))Ct+α1,delimited-∥∥subscript𝑤𝑡subscript𝑤2subscript𝐷subscript𝑢𝑡subscript𝐷subscript𝑢2subscript𝐷𝑡subscript𝑢𝑡subscript𝐷subscript𝑢𝐶𝑡𝛼1\left\lVert{w_{t}-w_{\ast}}\right\rVert\leq\sqrt{2\big{(}D_{\infty}(u_{t})-D_{% \infty}(u_{\ast})\big{)}}\leq\sqrt{2\big{(}D_{t}(u_{t})-D_{\infty}(u_{\ast})% \big{)}}\leq\frac{C}{t+\alpha-1},∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ≤ square-root start_ARG 2 ( italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG ≤ square-root start_ARG 2 ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG ≤ divide start_ARG italic_C end_ARG start_ARG italic_t + italic_α - 1 end_ARG , (A.82)

where C=(α1)(2(D0(u0)D(u))+u0u2γ)1/2𝐶𝛼1superscript2subscript𝐷0subscript𝑢0subscript𝐷subscript𝑢superscriptdelimited-∥∥subscript𝑢0subscript𝑢2𝛾12C=(\alpha-1)\bigg{(}2\big{(}D_{0}(u_{0})-D_{\infty}(u_{\ast})\big{)}+\frac{% \left\lVert{u_{0}-u_{\ast}}\right\rVert^{2}}{\gamma}\bigg{)}^{1/2}italic_C = ( italic_α - 1 ) ( 2 ( italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) + divide start_ARG ∥ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ end_ARG ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT as follows by (A.63). By using the definition of D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and Dsubscript𝐷D_{\infty}italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT in the bound (A.82) allows to conclude the proof of the first part of Theorem 5.2. Regarding the margin and the angle gap rates, we proceed as in the proof of Theorem 5.1. By using the triangle inequality and (A.82), and setting

t=(α1)(2(w02w21+2𝟏,u0u+u0u2γw2)1/21),superscript𝑡𝛼12superscriptsuperscriptdelimited-∥∥subscript𝑤02superscriptdelimited-∥∥subscript𝑤2121subscript𝑢0subscript𝑢superscriptdelimited-∥∥subscript𝑢0subscript𝑢2𝛾superscriptdelimited-∥∥subscript𝑤2121t^{\ast}=(\alpha-1)\left(2{\left({\frac{\left\lVert{w_{0}}\right\rVert^{2}}{% \left\lVert{w_{\ast}}\right\rVert^{2}}-1+2\left\langle{\mathbf{1}},{u_{0}-u_{% \ast}}\right\rangle+\frac{\left\lVert{u_{0}-u_{\ast}}\right\rVert^{2}}{\gamma% \left\lVert{w_{\ast}}\right\rVert^{2}}}\right)^{1/2}}-1\right),italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( italic_α - 1 ) ( 2 ( divide start_ARG ∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - 1 + 2 ⟨ bold_1 , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ + divide start_ARG ∥ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT - 1 ) , (A.83)

we deduce that for all tt𝑡superscript𝑡t\geq t^{\ast}italic_t ≥ italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we have wt12wdelimited-∥∥subscript𝑤𝑡12delimited-∥∥subscript𝑤\left\lVert{w_{t}}\right\rVert\geq\frac{1}{2}\left\lVert{w_{\ast}}\right\rVert∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥. Therefore, thanks to Lemma A.4, the proof follows directly from the bound (A.82). ∎

A.5 Proof of Theorem 5.3

In this paragraph we provide the proof of Theorem 5.3 regarding the stability properties of Algorithm 1, in presence of noise, as discussed in section 5.2 .

Proof of Theorem 5.3.

Without loss of generality let us assume that SN={1,,N}subscript𝑆𝑁1𝑁S_{N}=\{1,\dots,N\}italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = { 1 , … , italic_N }, i.e. the set of flipped labels consists of the first N𝑁Nitalic_N indices (notice that up to a re-indexation one can always retrieve this case). Let also u~0=u0=0subscript~𝑢0subscript𝑢00\tilde{u}_{0}=u_{0}=0over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 and {(w~t,u~t)}t1subscriptsubscript~𝑤𝑡subscript~𝑢𝑡𝑡1\{(\tilde{w}_{t},\tilde{u}_{t})\}_{t\geq 1}{ ( over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t ≥ 1 end_POSTSUBSCRIPT, {wt,ut}t0subscriptsubscript𝑤𝑡subscript𝑢𝑡𝑡0\{w_{t},u_{t}\}_{t\geq 0}{ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT be the pair of sequences generated by Algorithm 1 applied to the noisy (X,Y~)𝑋~𝑌(X,\tilde{Y})( italic_X , over~ start_ARG italic_Y end_ARG ) and true (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) data respectively.

By using the triangle inequality, the Algorithm 1 for w~tsubscript~𝑤𝑡\tilde{w}_{t}over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the definition of the operator norm, for all t0𝑡0t\geq 0italic_t ≥ 0, it holds:

w~twdelimited-∥∥subscript~𝑤𝑡subscript𝑤\displaystyle\left\lVert{\tilde{w}_{t}-w_{\ast}}\right\rVert∥ over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ w~twt+wtwabsentdelimited-∥∥subscript~𝑤𝑡subscript𝑤𝑡delimited-∥∥subscript𝑤𝑡subscript𝑤\displaystyle\leq\left\lVert{\tilde{w}_{t}-w_{t}}\right\rVert+\left\lVert{w_{t% }-w_{\ast}}\right\rVert≤ ∥ over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ + ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ (A.84)
=Z~u~tZut+wtwabsentdelimited-∥∥superscript~𝑍topsubscript~𝑢𝑡superscript𝑍topsubscript𝑢𝑡delimited-∥∥subscript𝑤𝑡subscript𝑤\displaystyle=\left\lVert{\tilde{Z}^{\top}\tilde{u}_{t}-Z^{\top}u_{t}}\right% \rVert+\left\lVert{w_{t}-w_{\ast}}\right\rVert= ∥ over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ + ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥
Z~opu~tut+(Z~Z)ut+wtwabsentsubscriptdelimited-∥∥superscript~𝑍top𝑜𝑝delimited-∥∥subscript~𝑢𝑡subscript𝑢𝑡delimited-∥∥superscript~𝑍topsuperscript𝑍topsubscript𝑢𝑡delimited-∥∥subscript𝑤𝑡subscript𝑤\displaystyle\leq\left\lVert{\tilde{Z}^{\top}}\right\rVert_{op}\left\lVert{% \tilde{u}_{t}-u_{t}}\right\rVert+\left\lVert{(\tilde{Z}^{\top}-Z^{\top})u_{t}}% \right\rVert+\left\lVert{w_{t}-w_{\ast}}\right\rVert≤ ∥ over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_o italic_p end_POSTSUBSCRIPT ∥ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ + ∥ ( over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ + ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥

The bound in the right-hand side of (A.84) is formed by three parts, the first two are related with the stability properties of Algorithm 1, while the last one to its optimization one. By using the convergence result (5.4) in Theorem 5.1, the bound (A.84) takes the following form:

w~twZ~(u~tut)+(Z~Z)ut+C(1ρ)t2delimited-∥∥subscript~𝑤𝑡subscript𝑤delimited-∥∥superscript~𝑍topsubscript~𝑢𝑡subscript𝑢𝑡delimited-∥∥superscript~𝑍topsuperscript𝑍topsubscript𝑢𝑡𝐶superscript1𝜌𝑡2\left\lVert{\tilde{w}_{t}-w_{\ast}}\right\rVert\leq\left\lVert{\tilde{Z}^{\top% }(\tilde{u}_{t}-u_{t})}\right\rVert+\left\lVert{(\tilde{Z}^{\top}-Z^{\top})u_{% t}}\right\rVert+C(1-\rho)^{\frac{t}{2}}∥ over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ≤ ∥ over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ + ∥ ( over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ + italic_C ( 1 - italic_ρ ) start_POSTSUPERSCRIPT divide start_ARG italic_t end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT (A.85)

where C=Zu02Zu2+2𝟏,u0u𝐶superscriptdelimited-∥∥superscript𝑍topsubscript𝑢02superscriptdelimited-∥∥superscript𝑍topsubscript𝑢221subscript𝑢0subscript𝑢C=\sqrt{\left\lVert{Z^{\top}u_{0}}\right\rVert^{2}-\left\lVert{Z^{\top}u_{\ast% }}\right\rVert^{2}+2\left\langle{\mathbf{1}},{u_{0}-u_{\ast}}\right\rangle}italic_C = square-root start_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ⟨ bold_1 , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ end_ARG and ρ=γμ1+γμ𝜌𝛾𝜇1𝛾𝜇\rho=\frac{\gamma\mu}{1+\gamma\mu}italic_ρ = divide start_ARG italic_γ italic_μ end_ARG start_ARG 1 + italic_γ italic_μ end_ARG.

From the definition of the noise model y~iyi=1subscript~𝑦𝑖subscript𝑦𝑖1\tilde{y}_{i}y_{i}=-1over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - 1, for all iSN𝑖subscript𝑆𝑁i\in S_{N}italic_i ∈ italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, the i𝑖iitalic_i-th row of matrix Z~Zsuperscript~𝑍topsuperscript𝑍top\tilde{Z}^{\top}-Z^{\top}over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT can be expressed as follows

(Z~Z)i=(y~ixiyixi)={2xi if 0 if iSNsubscriptsuperscript~𝑍topsuperscript𝑍top𝑖subscript~𝑦𝑖subscript𝑥𝑖subscript𝑦𝑖subscript𝑥𝑖cases2subscript𝑥𝑖 if 0 if 𝑖subscript𝑆𝑁(\tilde{Z}^{\top}-Z^{\top})_{i}=(\tilde{y}_{i}x_{i}-y_{i}x_{i})=\begin{cases}-% 2x_{i}~{}&\text{ if }\\ 0&\text{ if }i\in S_{N}\end{cases}( over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { start_ROW start_CELL - 2 italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL if end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_i ∈ italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_CELL end_ROW (A.86)

By using the expression (A.86) the second term in the right-hand side of (A.84) can be bounded as follows:

(Z~Z)utdelimited-∥∥superscript~𝑍topsuperscript𝑍topsubscript𝑢𝑡\displaystyle\left\lVert{(\tilde{Z}^{\top}-Z^{\top})u_{t}}\right\rVert∥ ( over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ =2(i=1N(uti)2xi2)122KNutabsent2superscriptsuperscriptsubscript𝑖1𝑁superscriptsuperscriptsubscript𝑢𝑡𝑖2superscriptdelimited-∥∥subscript𝑥𝑖2122𝐾𝑁delimited-∥∥subscript𝑢𝑡\displaystyle=2\left(\sum_{i=1}^{N}(u_{t}^{i})^{2}\left\lVert{x_{i}}\right% \rVert^{2}\right)^{\frac{1}{2}}\leq 2K\sqrt{N}\left\lVert{u_{t}}\right\rVert= 2 ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ≤ 2 italic_K square-root start_ARG italic_N end_ARG ∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ (A.87)
2KN(utu+u)absent2𝐾𝑁delimited-∥∥subscript𝑢𝑡subscript𝑢delimited-∥∥subscript𝑢\displaystyle\leq 2K\sqrt{N}\left(\left\lVert{u_{t}-u_{\ast}}\right\rVert+% \left\lVert{u_{\ast}}\right\rVert\right)≤ 2 italic_K square-root start_ARG italic_N end_ARG ( ∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ + ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ )

where we used that xiKdelimited-∥∥subscript𝑥𝑖𝐾\left\lVert{x_{i}}\right\rVert\leq K∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ≤ italic_K, for all i{1,,n}𝑖1𝑛i\in\{1,\dots,n\}italic_i ∈ { 1 , … , italic_n } and i=1N|uti|utsuperscriptsubscript𝑖1𝑁superscriptsubscript𝑢𝑡𝑖delimited-∥∥subscript𝑢𝑡\sum_{i=1}^{N}\lvert{u_{t}^{i}}\rvert\leq\left\lVert{u_{t}}\right\rVert∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | ≤ ∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥. In addition since utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is generated by Algorithm 1, it satisfies the Fejer property with respect to any minimizer usubscript𝑢u_{\ast}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT of Dsubscript𝐷D_{\infty}italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT, i.e. utuu0udelimited-∥∥subscript𝑢𝑡subscript𝑢delimited-∥∥subscript𝑢0subscript𝑢\left\lVert{u_{t}-u_{\ast}}\right\rVert\leq\left\lVert{u_{0}-u_{\ast}}\right\rVert∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ≤ ∥ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ (this can be seen as an immediate consequence of Lemma A.8). Therefore from (A.87), for all t1𝑡1t\geq 1italic_t ≥ 1, it follows:

(Z~Z)ut2K(u0u+u)Ndelimited-∥∥superscript~𝑍topsuperscript𝑍topsubscript𝑢𝑡2𝐾delimited-∥∥subscript𝑢0subscript𝑢delimited-∥∥subscript𝑢𝑁\left\lVert{(\tilde{Z}^{\top}-Z^{\top})u_{t}}\right\rVert\leq 2K\left(\left% \lVert{u_{0}-u_{\ast}}\right\rVert+\left\lVert{u_{\ast}}\right\rVert\right)% \sqrt{N}∥ ( over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ≤ 2 italic_K ( ∥ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ + ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ) square-root start_ARG italic_N end_ARG (A.88)

For the first term in the right-hand side of (A.84), by using Algorithm 1, we obtain:

u~t+1ut+1delimited-∥∥subscript~𝑢𝑡1subscript𝑢𝑡1\displaystyle\left\lVert{\tilde{u}_{t+1}-u_{t+1}}\right\rVert∥ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ =proxγλt(λt)(u~tγZ~Z~u~t)proxγλt(λt)(utγZZut)\displaystyle=\left\lVert{\operatorname{prox}_{\frac{\gamma}{\lambda_{t}}% \mathcal{L}^{\ast}(\lambda_{t}\cdot)}\left(\tilde{u}_{t}-\gamma\tilde{Z}\tilde% {Z}^{\top}\tilde{u}_{t}\right)-\operatorname{prox}_{\frac{\gamma}{\lambda_{t}}% \mathcal{L}^{\ast}(\lambda_{t}\cdot)}\left(u_{t}-\gamma ZZ^{\top}u_{t}\right)}\right\rVert= ∥ roman_prox start_POSTSUBSCRIPT divide start_ARG italic_γ end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ ) end_POSTSUBSCRIPT ( over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ over~ start_ARG italic_Z end_ARG over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - roman_prox start_POSTSUBSCRIPT divide start_ARG italic_γ end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ ) end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ (A.89)
=λt1proxγλt(λt(u~tγZ~Z~u~t))λt1proxγλt(λt(utγZZut))absentdelimited-∥∥superscriptsubscript𝜆𝑡1subscriptprox𝛾subscript𝜆𝑡superscriptsubscript𝜆𝑡subscript~𝑢𝑡𝛾~𝑍superscript~𝑍topsubscript~𝑢𝑡superscriptsubscript𝜆𝑡1subscriptprox𝛾subscript𝜆𝑡superscriptsubscript𝜆𝑡subscript𝑢𝑡𝛾𝑍superscript𝑍topsubscript𝑢𝑡\displaystyle=\left\lVert{\lambda_{t}^{-1}\operatorname{prox}_{\gamma\lambda_{% t}\mathcal{L}^{\ast}}\left(\lambda_{t}\left(\tilde{u}_{t}-\gamma\tilde{Z}% \tilde{Z}^{\top}\tilde{u}_{t}\right)\right)-\lambda_{t}^{-1}\operatorname{prox% }_{\gamma\lambda_{t}\mathcal{L}^{\ast}}\left(\lambda_{t}\left(u_{t}-\gamma ZZ^% {\top}u_{t}\right)\right)}\right\rVert= ∥ italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_prox start_POSTSUBSCRIPT italic_γ italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ over~ start_ARG italic_Z end_ARG over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_prox start_POSTSUBSCRIPT italic_γ italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥
u~tγZ~Z~u~tut+γZZutabsentdelimited-∥∥subscript~𝑢𝑡𝛾~𝑍superscript~𝑍topsubscript~𝑢𝑡subscript𝑢𝑡𝛾𝑍superscript𝑍topsubscript𝑢𝑡\displaystyle\leq\left\lVert{\tilde{u}_{t}-\gamma\tilde{Z}\tilde{Z}^{\top}% \tilde{u}_{t}-u_{t}+\gamma ZZ^{\top}u_{t}}\right\rVert≤ ∥ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ over~ start_ARG italic_Z end_ARG over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥
(IdγZ~Z~)(u~tut)+γ(ZZZ~Z~)utabsentdelimited-∥∥Id𝛾~𝑍superscript~𝑍topsubscript~𝑢𝑡subscript𝑢𝑡𝛾delimited-∥∥𝑍superscript𝑍top~𝑍superscript~𝑍topsubscript𝑢𝑡\displaystyle\leq\left\lVert{\left(\text{Id}-\gamma\tilde{Z}\tilde{Z}^{\top}% \right)(\tilde{u}_{t}-u_{t})}\right\rVert+\gamma\left\lVert{\left(ZZ^{\top}-% \tilde{Z}\tilde{Z}^{\top}\right)u_{t}}\right\rVert≤ ∥ ( Id - italic_γ over~ start_ARG italic_Z end_ARG over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ( over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ + italic_γ ∥ ( italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - over~ start_ARG italic_Z end_ARG over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥
u~tut+γ(ZZZ~Z~)utabsentdelimited-∥∥subscript~𝑢𝑡subscript𝑢𝑡𝛾delimited-∥∥𝑍superscript𝑍top~𝑍superscript~𝑍topsubscript𝑢𝑡\displaystyle\leq\left\lVert{\tilde{u}_{t}-u_{t}}\right\rVert+\gamma\left% \lVert{\left(ZZ^{\top}-\tilde{Z}\tilde{Z}^{\top}\right)u_{t}}\right\rVert≤ ∥ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ + italic_γ ∥ ( italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - over~ start_ARG italic_Z end_ARG over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥

where in the second equality we used the scaling property of the proximal operator [11, Theorem 6.126.126.126.12] and in the first inequality its non expansiveness property [10, Proposition 12.2812.2812.2812.28]. The second inequality is the triangular inequality and the last one follows from the non-expansiveness of the operator IdγZ~Z~Id𝛾~𝑍superscript~𝑍top\text{Id}-\gamma\tilde{Z}\tilde{Z}^{\top}Id - italic_γ over~ start_ARG italic_Z end_ARG over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (since γZ~Z~op1𝛾superscriptsubscriptdelimited-∥∥~𝑍superscript~𝑍top𝑜𝑝1\gamma\leq\left\lVert{\tilde{Z}\tilde{Z}^{\top}}\right\rVert_{op}^{-1}italic_γ ≤ ∥ over~ start_ARG italic_Z end_ARG over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_o italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT).

By applying recursively relation (A.89), since u~0=u0subscript~𝑢0subscript𝑢0\tilde{u}_{0}=u_{0}over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, for all t1𝑡1t\geq 1italic_t ≥ 1, it follows:

u~tuts=0t1Busdelimited-∥∥subscript~𝑢𝑡subscript𝑢𝑡superscriptsubscript𝑠0𝑡1delimited-∥∥𝐵subscript𝑢𝑠\left\lVert{\tilde{u}_{t}-u_{t}}\right\rVert\leq\sum_{s=0}^{t-1}\left\lVert{Bu% _{s}}\right\rVert∥ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ≤ ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ∥ italic_B italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ (A.90)

where we used the notation B:=Z~Z~ZZassign𝐵~𝑍superscript~𝑍top𝑍superscript𝑍topB:=\tilde{Z}\tilde{Z}^{\top}-ZZ^{\top}italic_B := over~ start_ARG italic_Z end_ARG over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT.

By using the noise model y~iyi=1subscript~𝑦𝑖subscript𝑦𝑖1\tilde{y}_{i}y_{i}=-1over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - 1, for all iSN𝑖subscript𝑆𝑁i\in S_{N}italic_i ∈ italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, the matrix B=Z~Z~ZZ𝐵~𝑍superscript~𝑍top𝑍superscript𝑍topB=\tilde{Z}\tilde{Z}^{\top}-ZZ^{\top}italic_B = over~ start_ARG italic_Z end_ARG over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT can be expressed element-wise as follows

(B)i,j=(y~iy~jyiyj)xi,xj={2xi,xj if ((i,j)SNc×SN)((i,j)SN×SNc)0 if ((i,j)SN×SN)((i,j)SNc×SNc)subscript𝐵𝑖𝑗subscript~𝑦𝑖subscript~𝑦𝑗subscript𝑦𝑖subscript𝑦𝑗subscript𝑥𝑖subscript𝑥𝑗cases2subscript𝑥𝑖subscript𝑥𝑗 if 𝑖𝑗superscriptsubscript𝑆𝑁𝑐subscript𝑆𝑁𝑖𝑗subscript𝑆𝑁superscriptsubscript𝑆𝑁𝑐0 if 𝑖𝑗subscript𝑆𝑁subscript𝑆𝑁𝑖𝑗superscriptsubscript𝑆𝑁𝑐superscriptsubscript𝑆𝑁𝑐(B)_{i,j}=(\tilde{y}_{i}\tilde{y}_{j}-y_{i}y_{j})\left\langle{x_{i}},{x_{j}}% \right\rangle=\begin{cases}-2\left\langle{x_{i}},{x_{j}}\right\rangle~{}&\text% { if }\left((i,j)\in S_{N}^{c}\times S_{N}\right)\cup\left((i,j)\in S_{N}% \times S_{N}^{c}\right)\\ 0&\text{ if }\left((i,j)\in S_{N}\times S_{N}\right)\cup\left((i,j)\in S_{N}^{% c}\times S_{N}^{c}\right)\end{cases}( italic_B ) start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = ( over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟨ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ = { start_ROW start_CELL - 2 ⟨ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ end_CELL start_CELL if ( ( italic_i , italic_j ) ∈ italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT × italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ∪ ( ( italic_i , italic_j ) ∈ italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT × italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if ( ( italic_i , italic_j ) ∈ italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT × italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ∪ ( ( italic_i , italic_j ) ∈ italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT × italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) end_CELL end_ROW (A.91)

By the definition of the euclidean norm, and the expression (A.91), for any un𝑢superscript𝑛u\in\mathbb{R}^{n}italic_u ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the term Budelimited-∥∥𝐵𝑢\left\lVert{Bu}\right\rVert∥ italic_B italic_u ∥ can be bounded as follows:

Bu2superscriptdelimited-∥∥𝐵𝑢2\displaystyle\left\lVert{Bu}\right\rVert^{2}∥ italic_B italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =(Z~Z~ZZ)u2absentsuperscriptdelimited-∥∥~𝑍superscript~𝑍top𝑍superscript𝑍top𝑢2\displaystyle=\left\lVert{\left(\tilde{Z}\tilde{Z}^{\top}-ZZ^{\top}\right)u}% \right\rVert^{2}= ∥ ( over~ start_ARG italic_Z end_ARG over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (A.92)
=4((i=N+1nuix1,xi)2++(i=N+1nuixN,xi)2+(i=1NuixN+1,xi)2++(i=1Nuixn,xi)2)absent4superscriptsuperscriptsubscript𝑖𝑁1𝑛subscript𝑢𝑖subscript𝑥1subscript𝑥𝑖2superscriptsuperscriptsubscript𝑖𝑁1𝑛subscript𝑢𝑖subscript𝑥𝑁subscript𝑥𝑖2superscriptsuperscriptsubscript𝑖1𝑁subscript𝑢𝑖subscript𝑥𝑁1subscript𝑥𝑖2superscriptsuperscriptsubscript𝑖1𝑁subscript𝑢𝑖subscript𝑥𝑛subscript𝑥𝑖2\displaystyle=4\left(\left(\sum_{i=N+1}^{n}u_{i}\left\langle{x_{1}},{x_{i}}% \right\rangle\right)^{2}+\dots+\left(\sum_{i=N+1}^{n}u_{i}\left\langle{x_{N}},% {x_{i}}\right\rangle\right)^{2}+\left(\sum_{i=1}^{N}u_{i}\left\langle{x_{N+1}}% ,{x_{i}}\right\rangle\right)^{2}+\dots+\left(\sum_{i=1}^{N}u_{i}\left\langle{x% _{n}},{x_{i}}\right\rangle\right)^{2}\right)= 4 ( ( ∑ start_POSTSUBSCRIPT italic_i = italic_N + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⋯ + ( ∑ start_POSTSUBSCRIPT italic_i = italic_N + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⋯ + ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
8((x12++xN2)i=N+1n(ui)2xi2+(xN+12++xn2)i=1N(ui)2xi2)absent8superscriptdelimited-∥∥subscript𝑥12superscriptdelimited-∥∥subscript𝑥𝑁2superscriptsubscript𝑖𝑁1𝑛superscriptsubscript𝑢𝑖2superscriptdelimited-∥∥subscript𝑥𝑖2superscriptdelimited-∥∥subscript𝑥𝑁12superscriptdelimited-∥∥subscript𝑥𝑛2superscriptsubscript𝑖1𝑁superscriptsubscript𝑢𝑖2superscriptdelimited-∥∥subscript𝑥𝑖2\displaystyle\leq 8\left(\left(\left\lVert{x_{1}}\right\rVert^{2}+\dots+\left% \lVert{x_{N}}\right\rVert^{2}\right)\sum_{i=N+1}^{n}(u_{i})^{2}\left\lVert{x_{% i}}\right\rVert^{2}+\left(\left\lVert{x_{N+1}}\right\rVert^{2}+\dots+\left% \lVert{x_{n}}\right\rVert^{2}\right)\sum_{i=1}^{N}(u_{i})^{2}\left\lVert{x_{i}% }\right\rVert^{2}\right)≤ 8 ( ( ∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⋯ + ∥ italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_i = italic_N + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( ∥ italic_x start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⋯ + ∥ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
8(K4Ni=N+1n(ui)2+K4(nN)i=1N(ui)2)absent8superscript𝐾4𝑁superscriptsubscript𝑖𝑁1𝑛superscriptsubscript𝑢𝑖2superscript𝐾4𝑛𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑢𝑖2\displaystyle\leq 8\left(K^{4}N\sum_{i=N+1}^{n}(u_{i})^{2}+K^{4}(n-N)\sum_{i=1% }^{N}(u_{i})^{2}\right)≤ 8 ( italic_K start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_N ∑ start_POSTSUBSCRIPT italic_i = italic_N + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_K start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( italic_n - italic_N ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=8K4(Nu2+(n2N)i=1N(ui)2)absent8superscript𝐾4𝑁superscriptdelimited-∥∥𝑢2𝑛2𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑢𝑖2\displaystyle=8K^{4}\left(N\left\lVert{u}\right\rVert^{2}+(n-2N)\sum_{i=1}^{N}% (u_{i})^{2}\right)= 8 italic_K start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( italic_N ∥ italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_n - 2 italic_N ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
8K4(n+12N)Nu2absent8superscript𝐾4𝑛12𝑁𝑁superscriptdelimited-∥∥𝑢2\displaystyle\leq 8K^{4}(n+1-2N)N\left\lVert{u}\right\rVert^{2}≤ 8 italic_K start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( italic_n + 1 - 2 italic_N ) italic_N ∥ italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where in the first inequality we used the convexity of the squared norm, in the second one the definition of K=maxin{xi}𝐾𝑖𝑛delimited-∥∥subscript𝑥𝑖K=\underset{i\leq n}{\max}\{\left\lVert{x_{i}}\right\rVert\}italic_K = start_UNDERACCENT italic_i ≤ italic_n end_UNDERACCENT start_ARG roman_max end_ARG { ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ } and in the last one, the convention Nn2𝑁𝑛2N\leq\frac{n}{2}italic_N ≤ divide start_ARG italic_n end_ARG start_ARG 2 end_ARG.

Therefore, by using (A.92) in (A.90), it follows that:

u~tutdelimited-∥∥subscript~𝑢𝑡subscript𝑢𝑡\displaystyle\left\lVert{\tilde{u}_{t}-u_{t}}\right\rVert∥ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ 22K22(n+12N)Ns=0t1usabsent22superscript𝐾22𝑛12𝑁𝑁superscriptsubscript𝑠0𝑡1delimited-∥∥subscript𝑢𝑠\displaystyle\leq 2\sqrt{2}K^{2}\sqrt{2(n+1-2N)N}\sum_{s=0}^{t-1}\left\lVert{u% _{s}}\right\rVert≤ 2 square-root start_ARG 2 end_ARG italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG 2 ( italic_n + 1 - 2 italic_N ) italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ (A.93)
22K22(n+12N)Ns=0t1(usu+u)absent22superscript𝐾22𝑛12𝑁𝑁superscriptsubscript𝑠0𝑡1delimited-∥∥subscript𝑢𝑠subscript𝑢delimited-∥∥subscript𝑢\displaystyle\leq 2\sqrt{2}K^{2}\sqrt{2(n+1-2N)N}\sum_{s=0}^{t-1}\left(\left% \lVert{u_{s}-u_{\ast}}\right\rVert+\left\lVert{u_{\ast}}\right\rVert\right)≤ 2 square-root start_ARG 2 end_ARG italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG 2 ( italic_n + 1 - 2 italic_N ) italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ( ∥ italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ + ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ )
22K22(n+12N)N(u0u+u)tabsent22superscript𝐾22𝑛12𝑁𝑁delimited-∥∥subscript𝑢0subscript𝑢delimited-∥∥subscript𝑢𝑡\displaystyle\leq 2\sqrt{2}K^{2}\sqrt{2(n+1-2N)N}\left(\left\lVert{u_{0}-u_{% \ast}}\right\rVert+\left\lVert{u_{\ast}}\right\rVert\right)t≤ 2 square-root start_ARG 2 end_ARG italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG 2 ( italic_n + 1 - 2 italic_N ) italic_N end_ARG ( ∥ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ + ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ) italic_t

where we used the triangular inequality ususu+udelimited-∥∥subscript𝑢𝑠delimited-∥∥subscript𝑢𝑠subscript𝑢delimited-∥∥subscript𝑢\left\lVert{u_{s}}\right\rVert\leq\left\lVert{u_{s}-u_{\ast}}\right\rVert+% \left\lVert{u_{\ast}}\right\rVert∥ italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ ≤ ∥ italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ + ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ and the Fejer property of the sequence ussubscript𝑢𝑠u_{s}italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT with respect to any minimizer usubscript𝑢u_{\ast}italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT of Dsubscript𝐷D_{\infty}italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT, i.e. usuu0udelimited-∥∥subscript𝑢𝑠subscript𝑢delimited-∥∥subscript𝑢0subscript𝑢\left\lVert{u_{s}-u_{\ast}}\right\rVert\leq\left\lVert{u_{0}-u_{\ast}}\right\rVert∥ italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ≤ ∥ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ (as an immediate consequence of Lemma A.8).

Finally, by using the bounds (A.88) and (A.93), in the estimate (A.85), for all t1𝑡1t\geq 1italic_t ≥ 1, it holds:

w~twC12(n+12N)Nt+C2N+C(1ρ)t2delimited-∥∥subscript~𝑤𝑡subscript𝑤subscript𝐶12𝑛12𝑁𝑁𝑡subscript𝐶2𝑁𝐶superscript1𝜌𝑡2\left\lVert{\tilde{w}_{t}-w_{\ast}}\right\rVert\leq C_{1}\sqrt{2(n+1-2N)N}t+C_% {2}\sqrt{N}+C(1-\rho)^{\frac{t}{2}}∥ over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ≤ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT square-root start_ARG 2 ( italic_n + 1 - 2 italic_N ) italic_N end_ARG italic_t + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT square-root start_ARG italic_N end_ARG + italic_C ( 1 - italic_ρ ) start_POSTSUPERSCRIPT divide start_ARG italic_t end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT (A.94)

where C1=22K2(u0u+u)subscript𝐶122superscript𝐾2delimited-∥∥subscript𝑢0subscript𝑢delimited-∥∥subscript𝑢C_{1}=2\sqrt{2}K^{2}\left(\left\lVert{u_{0}-u_{\ast}}\right\rVert+\left\lVert{% u_{\ast}}\right\rVert\right)italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2 square-root start_ARG 2 end_ARG italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∥ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ + ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ), C2=2K(u0u+u)subscript𝐶22𝐾delimited-∥∥subscript𝑢0subscript𝑢delimited-∥∥subscript𝑢C_{2}=2K\left(\left\lVert{u_{0}-u_{\ast}}\right\rVert+\left\lVert{u_{\ast}}% \right\rVert\right)italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 italic_K ( ∥ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ + ∥ italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ), C=Zu02Zu2+2𝟏,u0u𝐶superscriptdelimited-∥∥superscript𝑍topsubscript𝑢02superscriptdelimited-∥∥superscript𝑍topsubscript𝑢221subscript𝑢0subscript𝑢C=\sqrt{\left\lVert{Z^{\top}u_{0}}\right\rVert^{2}-\left\lVert{Z^{\top}u_{\ast% }}\right\rVert^{2}+2\left\langle{\mathbf{1}},{u_{0}-u_{\ast}}\right\rangle}italic_C = square-root start_ARG ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ⟨ bold_1 , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ end_ARG and ρ=γμ1+γμ𝜌𝛾𝜇1𝛾𝜇\rho=\frac{\gamma\mu}{1+\gamma\mu}italic_ρ = divide start_ARG italic_γ italic_μ end_ARG start_ARG 1 + italic_γ italic_μ end_ARG.

The (worst-case) optimal stopping time t(N):=max{1,2ln(11ρ)ln(Cln(11ρ)2C12(n+12N)N)}assignsubscript𝑡𝑁1211𝜌𝐶11𝜌2subscript𝐶12𝑛12𝑁𝑁t_{\ast}(N):=\max\left\{1,\frac{2}{\ln\left(\frac{1}{1-\rho}\right)}\ln\left(% \frac{C\ln\left(\frac{1}{1-\rho}\right)}{2C_{1}\sqrt{2(n+1-2N)N}}\right)\right\}italic_t start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_N ) := roman_max { 1 , divide start_ARG 2 end_ARG start_ARG roman_ln ( divide start_ARG 1 end_ARG start_ARG 1 - italic_ρ end_ARG ) end_ARG roman_ln ( divide start_ARG italic_C roman_ln ( divide start_ARG 1 end_ARG start_ARG 1 - italic_ρ end_ARG ) end_ARG start_ARG 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT square-root start_ARG 2 ( italic_n + 1 - 2 italic_N ) italic_N end_ARG end_ARG ) }, as expressed in Theorem 5.3 follows by optimizing the right-hand side of (A.94) over t𝑡titalic_t and then evaluating it in (A.94), which yields (5.14) and conclude the proof of Theorem 5.3. ∎

References

  • [1] D. Angluin and P. Laird, Learning from noisy examples, Machine learning, 2 (1988), pp. 343–370.
  • [2] V. Apidopoulos, J.-F. Aujol, and C. Dossal, Convergence rate of inertial forward–backward algorithm beyond nesterov’s rule, Mathematical Programming, 180 (2020), pp. 137–156.
  • [3] V. Apidopoulos, J.-F. Aujol, C. Dossal, and A. Rondepierre, Convergence rates of an inertial gradient descent algorithm under growth and flatness conditions, Mathematical Programming, 187 (2021), pp. 151–193.
  • [4] V. Apidopoulos, N. Ginatta, and S. Villa, Convergence rates for the heavy-ball continuous dynamics for non-convex optimization, under polyak–łojasiewicz condition, Journal of Global Optimization, (2022), pp. 1–27.
  • [5] H. Attouch, Viscosity solutions of minimization problems, SIAM Journal on Optimization, 6 (1996), pp. 769–806.
  • [6] H. Attouch, A. Cabot, Z. Chbani, and H. Riahi, Rate of convergence of inertial gradient dynamics with time-dependent viscous damping coefficient, Evolution Equations and Control Theory, 7 (2018), pp. 353–371.
  • [7] H. Attouch, Z. Chbani, J. Peypouquet, and P. Redont, Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity, Mathematical Programming, 168 (2018), pp. 123–175.
  • [8] A. Auslender, J. Crouzeix, and P. Fedit, Penalty-proximal methods in convex programming, Journal of Optimization Theory and Applications, 55 (1987), pp. 1–21.
  • [9] M. Bahraoui and B. Lemaire, Convergence of diagonally stationary sequences in convex optimization, Set-Valued Analysis, 2 (1994), pp. 49–61.
  • [10] H. H. Bauschke and P. L. Combettes, Convex analysis and monotone operator theory in Hilbert spaces, Springer Science & Business Media, 2011.
  • [11] A. Beck, First-order methods in optimization, SIAM, 2017.
  • [12] A. Beck and S. Shtern, Linearly convergent away-step conditional gradient for non-strongly convex functions, Mathematical Programming, 164 (2017), pp. 1–27.
  • [13] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM journal on imaging sciences, 2 (2009), pp. 183–202.
  • [14] J. Bolte, T. P. Nguyen, J. Peypouquet, and B. W. Suter, From error bounds to the complexity of first-order descent methods for convex functions, Mathematical Programming, 165 (2017), pp. 471–507.
  • [15] L. Bottou and O. Bousquet, The tradeoffs of large-scale learning, Optimization for machine learning, (2011), pp. 351–368.
  • [16] S. Boucheron, O. Bousquet, and G. Lugosi, Theory of classification: a survey of some recent advances, ESAIM: Probability and Statistics, 9 (2010), pp. 323–375.
  • [17] O. Bousquet and A. Elisseeff, Stability and generalization, Journal of Machine Learning Research, 2 (2002), pp. 499–526.
  • [18] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge university press, 2004.
  • [19] R. Boyer, Quelques algorithmes diagonaux en optimisation convexe, PhD thesis, Université de Provence, 1974.
  • [20] P. Brianzi, F. Di Benedetto, and C. Estatico, Preconditioned iterative regularization in banach spaces, Computational Optimization and Applications, 54 (2013), pp. 263–282.
  • [21] L. Calatroni, G. Garrigos, L. Rosasco, and S. Villa, Accelerated iterative regularization via dual diagonal descent, arXiv preprint arXiv:1912.12153, (2019).
  • [22] A. Chambolle and C. Dossal, On the convergence of the iterates of the “fast iterative shrinkage/thresholding algorithm”, Journal of Optimization Theory and Applications, 166 (2015), pp. 968–982.
  • [23] O. Chapelle, Training a support vector machine in the primal, Neural computation, 19 (2007), pp. 1155–1178.
  • [24] L. Chizat and F. Bach, Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss, in Conference on Learning Theory, PMLR, 2020, pp. 1305–1338.
  • [25] K. L. Clarkson, E. Hazan, and D. P. Woodruff, Sublinear optimization for machine learning, Journal of the ACM (JACM), 59 (2012), pp. 1–49.
  • [26] P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting, Multiscale Modeling & Simulation, 4 (2005), pp. 1168–1200.
  • [27] C. Cortes and V. Vapnik, Support-vector networks, Machine learning, 20 (1995), pp. 273–297.
  • [28] N. Cristianini, J. Shawe-Taylor, et al., An introduction to support vector machines and other kernel-based learning methods, Cambridge university press, 2000.
  • [29] A. Dieuleveut, A. Durmus, and F. Bach, Bridging the gap between constant step size stochastic gradient descent and markov chains, The Annals of Statistics, 48 (2020), pp. 1348–1382.
  • [30] A. Dontchev and F. Lempio, Difference methods for differential inclusions: A survey, SIAM Review, 34 (1992), pp. 263–294.
  • [31] H. W. Engl, M. Hanke, and A. Neubauer, Regularization of inverse problems, vol. 375, Springer Science & Business Media, 1996.
  • [32] M. C. Ferris and T. S. Munson, Interior-point methods for massive support vector machines, SIAM Journal on Optimization, 13 (2002), pp. 783–804.
  • [33] Y. Freund and R. E. Schapire, Large margin classification using the perceptron algorithm, in Proceedings of the eleventh annual conference on Computational learning theory, 1998, pp. 209–217.
  • [34] G. Garrigos, L. Rosasco, and S. Villa, Iterative regularization via dual diagonal descent, Journal of Mathematical Imaging and Vision, 60 (2018), pp. 189–215.
  • [35]  , Convergence of the forward-backward algorithm: Beyond the worst case with the help of geometry, Mathematical Programming, (2023), pp. 937–996.
  • [36] O. Güler, Foundations of optimization, vol. 258, Springer Science & Business Media, 2010.
  • [37] S. Gunasekar, J. Lee, D. Soudry, and N. Srebro, Characterizing implicit bias in terms of optimization geometry, in International Conference on Machine Learning, PMLR, 2018, pp. 1832–1841.
  • [38] S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro, Implicit bias of gradient descent on linear convolutional networks, in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds., Curran Associates, Inc., 2018, pp. 9461–9471.
  • [39] S. Gunasekar, B. Woodworth, and N. Srebro, Mirrorless mirror descent: A natural derivation of mirror descent, in International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp. 2305–2313.
  • [40] S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro, Implicit regularization in matrix factorization, in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds., vol. 30, Curran Associates, Inc., 2017.
  • [41] M. Hardt, B. Recht, and Y. Singer, Train faster, generalize better: Stability of stochastic gradient descent, in International Conference on Machine Learning, PMLR, 2016, pp. 1225–1234.
  • [42] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu, The entire regularization path for the support vector machine, Journal of Machine Learning Research, 5 (2004), pp. 1391–1415.
  • [43] A. J. Hoffman, On approximate solutions of systems of linear inequalities, Journal of Research of the National Bureau of Standards, 49 (1952).
  • [44] Z. Ji, M. Dudík, R. E. Schapire, and M. Telgarsky, Gradient descent follows the regularization path for general losses, in Conference on Learning Theory, 2020, pp. 2109–2136.
  • [45] Z. Ji, N. Srebro, and M. Telgarsky, Fast margin maximization via dual acceleration, in International Conference on Machine Learning, PMLR, 2021, pp. 4860–4869.
  • [46] Z. Ji and M. Telgarsky, The implicit bias of gradient descent on nonseparable data, in Conference on Learning Theory, 2019, pp. 1772–1798.
  • [47] Q. Jin, X. Lu, and L. Zhang, Stochastic mirror descent method for linear ill-posed problems in banach spaces, Inverse Problems, 39 (2023), p. 065010.
  • [48] H. Karimi, J. Nutini, and M. Schmidt, Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition, in Machine Learning and Knowledge Discovery in Databases, P. Frasconi, N. Landwehr, G. Manco, and J. Vreeken, eds., Cham, 2016, Springer International Publishing, pp. 795–811.
  • [49] M. J. Kearns, R. E. Schapire, and L. M. Sellie, Toward efficient agnostic learning, in Proceedings of the fifth annual workshop on Computational learning theory, 1992, pp. 341–352.
  • [50] S. Keerthi and E. Gilbert, Convergence of a generalized smo algorithm for svm classifier design, Machine Learning, 46 (2002), pp. 351–360.
  • [51] M. J. Keith, A. Jameson, W. van Straten, M. Bailes, S. Johnston, M. Kramer, A. Possenti, S. D. Bates, N. D. R. Bhat, M. Burgay, S. Burke-Spolaor, N. D’Amico, L. Levin, P. L. McMahon, S. Milia, and B. W. Stappers, The High Time Resolution Universe Pulsar Survey – I. System configuration and initial discoveries, Monthly Notices of the Royal Astronomical Society, 409 (2010), pp. 619–627.
  • [52] S. Kindermann, Optimal-order convergence of nesterov acceleration for linear ill-posed problems, Inverse Problems, 37 (2021), p. 065002.
  • [53] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), pp. 2278–2324.
  • [54] J. Lin, L. Rosasco, and D.-X. Zhou, Iterative regularization for learning with convex loss functions, Journal of Machine Learning Research, 17 (2016), pp. 1–38.
  • [55] S. Łojasiewicz, Une propriété topologique des sous-ensembles analytiques réels, in Les Équations aux Dérivées Partielles (Paris, 1962), Éditions du Centre National de la Recherche Scientifique, Paris, 1963, pp. 87–89.
  • [56] J. Lopez and J. Dorronsoro, The convergence rate of linearly separable smo, in Proceedings of the International Joint Conference on Neural Networks, 08 2013, pp. 1–7.
  • [57] R. J. Lyon, B. Stappers, S. Cooper, J. M. Brooke, and J. D. Knowles, Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach, Monthly Notices of the Royal Astronomical Society, 459 (2016), pp. 1104–1123.
  • [58] K. Lyu and J. Li, Gradient descent maximizes the margin of homogeneous neural networks, arXiv preprint arXiv:1906.05890, (2019).
  • [59] B. Martinet, Perturbation des méthodes d’optimisation. applications, RAIRO. Analyse numérique, 12 (1978), pp. 153–171.
  • [60] P. Massart and É. Nédélec, Risk bounds for statistical learning, The Annals of Statistics, (2006), pp. 2326–2366.
  • [61] S. Matet, L. Rosasco, S. Villa, and B. C. Vu, Implicit regularization with strongly convex bias: Stability and acceleration, Analysis and Applications, (2023), pp. 165–191.
  • [62] C. Molinari, M. Massias, L. Rosasco, and S. Villa, Iterative regularization for convex regularizers, in International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp. 1684–1692.
  • [63] D. Molitor, D. Needell, and R. Ward, Bias of homotopic gradient descent for the hinge loss, Applied Mathematics & Optimization, (2020), pp. 1–27.
  • [64] N. Mücke, G. Neu, and L. Rosasco, Beating sgd saturation with tail-averaging and minibatching, Advances in Neural Information Processing Systems, 32 (2019), pp. 12568–12577.
  • [65] M. S. Nacson, S. Gunasekar, J. Lee, N. Srebro, and D. Soudry, Lexicographic and depth-sensitive margins in homogeneous and non-homogeneous deep models, in Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov, eds., vol. 97 of Proceedings of Machine Learning Research, PMLR, 09–15 Jun 2019, pp. 4683–4692.
  • [66] M. S. Nacson, J. Lee, S. Gunasekar, P. H. Savarese, N. Srebro, and D. Soudry, Convergence of gradient descent on separable data, arXiv preprint arXiv:1803.01905, (2018).
  • [67] Y. Nesterov, Introductory lectures on convex optimization: A basic course, 2013.
  • [68] A. Neubauer, On Nesterov acceleration for Landweber iteration of linear ill-posed problems, Journal of Inverse and Ill-posed Problems, 25 (2016), pp. 381 – 390.
  • [69] B. Neyshabur, R. Tomioka, R. Salakhutdinov, and N. Srebro, Geometry of optimization and implicit regularization in deep learning, arXiv preprint arXiv:1705.03071, (2017).
  • [70] B. Neyshabur, R. Tomioka, and N. Srebro, In search of the real inductive bias: On the role of implicit regularization in deep learning, arXiv preprint arXiv:1412.6614, (2014).
  • [71] A. B. Novikoff, On convergence proofs for perceptrons, tech. rep., Stanford research institution menlo park CA, 1963.
  • [72] N. Pagliana and L. Rosasco, Implicit regularization of accelerated methods in hilbert spaces, in Advances in Neural Information Processing Systems, 2019, pp. 14481–14491.
  • [73] J. Peypouquet, Convex optimization in normed spaces: theory, methods and examples, Springer, 2015.
  • [74] J. C. Platt, Fast training of support vector machines using sequential minimal optimization, in Advances in kernel methods: support vector learning, MIT Press, 1999, pp. 185–208.
  • [75] A. Ramdas and J. Pena, Towards a deeper geometric, analytic and algorithmic understanding of margins, Optimization Methods and Software, 31 (2016), pp. 377–391.
  • [76] M. Rando, L. Carratino, S. Villa, and L. Rosasco, Ada-bkb: Scalable gaussian process optimization on continuous domains by adaptive discretization, in International Conference on Artificial Intelligence and Statistics, PMLR, 2022, pp. 7320–7348.
  • [77] A. Rangamani, M. Xu, A. Banburski, Q. Liao, and T. Poggio, Dynamics and neural collapse in deep classifiers trained with the square loss, Center for Brains, Minds and Machines (CBMM) Memo, (2021).
  • [78] G. Raskutti, M. J. Wainwright, and B. Yu, Early stopping and non-parametric regression: an optimal data-dependent stopping rule, The Journal of Machine Learning Research, 15 (2014), pp. 335–366.
  • [79] L. Rosasco and S. Villa, Learning with incremental iterative regularization, in Advances in Neural Information Processing Systems, 2015, pp. 1630–1638.
  • [80] L. Rosasco, S. Villa, and B. C. Vũ, Convergence of stochastic proximal gradient algorithm, Applied Mathematics and Optimization, (2020), pp. 891–917.
  • [81] S. Rosset, J. Zhu, and T. Hastie, Boosting as a regularized path to a maximum margin classifier, Journal of Machine Learning Research, 5 (2004), pp. 941–973.
  • [82] S. Rosset, J. Zhu, and T. J. Hastie, Margin maximizing loss functions, in Advances in neural information processing systems, 2004, pp. 1237–1244.
  • [83] S. Salzo and S. Villa, Proximal gradient methods for machine learning and imaging, in Harmonic and applied analysis—from Radon transforms to machine learning, Appl. Numer. Harmon. Anal., Birkhäuser/Springer, Cham, [2021] ©2021, pp. 149–244.
  • [84] T. Schuster, B. Kaltenbacher, B. Hofmann, and K. S. Kazimierski, Regularization methods in banach spaces, in Regularization Methods in Banach Spaces, de Gruyter, 2012.
  • [85] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms, Cambridge university press, 2014.
  • [86] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter, Pegasos: Primal estimated sub-gradient solver for svm, Mathematical programming, 127 (2011), pp. 3–30.
  • [87] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro, The implicit bias of gradient descent on separable data, J. Mach. Learn. Res., 19 (2018), p. 2822–2878.
  • [88] B. Stankewitz, N. Mücke, and L. Rosasco, From inexact optimization to learning via gradient concentration, arXiv preprint arXiv:2106.05397, (2021).
  • [89] I. Steinwart and A. Christmann, Support vector machines, Springer Science & Business Media, 2008.
  • [90] W. Su, S. Boyd, and E. J. Candes, A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights, Journal of Machine Learning Research, 17 (2016), pp. 1–43.
  • [91] A. N. Tikhonov, Solution of incorrectly formulated problems and the regularization method, Soviet Math., 4 (1963), pp. 1035–1038.
  • [92] A. N. Tikhonov and V. Arsenine, Solutions of ill-posed problems, John Wiley and Sons, 1977.
  • [93] V. Vapnik, The nature of statistical learning theory, Springer science & business media, 2013.
  • [94] P.-W. Wang and C.-J. Lin, Iteration complexity of feasible descent methods for convex optimization, The Journal of Machine Learning Research, 15 (2014), pp. 1523–1548.
  • [95] Y. Yao, L. Rosasco, and A. Caponnetto, On early stopping in gradient descent learning, Constructive Approximation, 26 (2007), pp. 289–315.
  • [96] C. Zǎlinescu, Sharp estimates for hoffman’s constant for systems of linear inequalities and equalities, SIAM Journal on Optimization, 14 (2003), pp. 517–533.
  • [97] G. Zanghirati and L. Zanni, A parallel solver for large quadratic programs in training support vector machines, Parallel Computing, 29 (2003), pp. 535 – 551. Parallel computing in numerical optimization.
  • [98] T. Zhang and B. Yu, Boosting with early stopping: Convergence and consistency, The Annals of Statistics, 33 (2005), pp. 1538–1579.
  • [99] Y. Zhang, On the acceleration of optimal regularization algorithms for linear ill-posed inverse problems, Calcolo, 60 (2023), p. 6.