Iterative regularization in classification via hinge loss diagonal descent
Abstract
Iterative regularization is a classic idea in regularization theory, that has recently become popular in machine learning. On the one hand, it allows to design efficient algorithms controlling at the same time numerical and statistical accuracy. On the other hand it allows to shed light on the learning curves observed while training neural networks. In this paper, we focus on iterative regularization in the context of classification. After contrasting this setting with that of linear inverse problems, we develop an iterative regularization approach based on the use of the hinge loss function. More precisely we consider a diagonal approach for a family of algorithms for which we prove convergence as well as rates of convergence and stability results for a suitable classification noise model. Our approach compares favorably with other alternatives, as confirmed by numerical simulations.
1 Introduction
Estimating a quantity of interest from finite measurements is a central problem in a number of fields including inverse problems but also machine learning, statistics and signal processing. In this context, a key idea is that reliable estimation requires imposing some prior assumptions on the problem at hand. Regularization theory for inverse problems provides a principled framework to formalize this idea [31]. The quantity of interest is typically seen as a function, or a vector, and prior assumptions take the form of suitable functionals, called regularizers. Following this idea, Tikhonov regularization provides a classic approach to estimate solutions [91, 92]. The latter are found minimizing an empirical objective where a data fit term is penalized adding a chosen regularizer. Other regularization approaches are classic in inverse problems, and in particular iterative regularization has become popular in machine learning, see e.g. [95, 78, 88]. This approach is based on the observation that iterative optimization procedures have a self-regularizing property, so that a chosen regularization can be enforced implicitly along the iterations. Such an observation seems to shed light on some theoretical properties of deep learning approaches in machine learning [41, 69, 37, 38, 77]. More generally, iterative regularization provides an approach to design algorithms striking a balance between statistical accuracy and computational efficiency [15, 98, 79, 61].
In this paper, we focus on iterative regularization in the context of classification, perhaps the most classical among machine learning problems [27, 89, 85]. After discussing the differences and similarities between classification and classical linear inverse problems, we recall how different iterative regularization schemes for classification can be defined depending on the considered loss function. Then, we focus on the hinge loss used in support vector machines [89]. In this case, compared to other loss functions such as the exponential and logistic loss [66, 87, 46], a simple gradient approach does not allow to establish iterative regularization properties. Indeed, we propose a diagonal approach in the same spirit of [34] and its inertial version [21] and prove their regularization properties, including convergence and stability. The proposed approach compares favorably to analogous results for the logistic loss [66, 87, 46, 44], but also with recent approaches considering the hinge loss [63]. Indeed, we show that fast convergence rates are possible, largely improving previous results. Further, we prove the first stability results under a suitable classification noise model which is inspired by the deterministic noise classically considered in linear inverse problems. We note that in a noiseless setting, our approach can also be seen as a way to solve the basic separable support vector machines problem introduced in the seminal work [27]. In this view, relevant studies, among the rich literature, can be found for example in [27, 33, 28, 23, 86]. Our theoretical results are illustrated via numerical simulations, where we also investigate empirically the stability properties of the proposed methods.
The rest of the paper is organized as follows. In Section 2, we briefly discuss the ideas of explicit and implicit regularization approach for regression problems. In Section 3 we are reviewing these approaches in terms of classification problems. In Section 4, we describe the main proposed schemes based on a diagonal iterative regularization procedure for the hinge loss. In Section 5 we present the main results and we provide the corresponding convergence and stability analysis. Finally in Section 6 we illustrate the performance of the proposed algorithms on some simple numerical examples. Appendix A contains some general facts and all the technical proofs and lemmas.
1.1 Notation
We first introduce some notations and recall a few basic notions that will be needed throughout the paper. The interested reader can consult [10] regarding the main tools and their associated properties used in this work. Let denote the standard Euclidean inner product and the associated Euclidean norm. For a linear operator , we denote with and its range and kernel respectively We also note with and its operator and Frobenious norm respectively. We also denote with Id the identity operator and with the -vector with entry in each coordinate.
Given a convex and closed set , the distance of a point from the set is . The indicator function of , is defined as
For a proper, convex and lower semicontinuous function , we define its subdifferential , as . For , the proximal operator of , with step , is defined by , for all . The projection operator onto a convex and closed set is defined as such that . We denote the Fenchel conjugate of as , such that .
2 Background: explicit and implicit regularization in regression
The classical regression problem in supervised learning and statistics corresponds to estimating a function of interest , given a finite number of (possibly noisy) evaluations at a number of input points. The problem takes a simple and familiar form if is assumed to be linear. Indeed, in this case it corresponds to estimating , given a set of equations,
(2.1) |
where and .
The above problem can be restated as the linear inverse problem
(2.2) |
where is a matrix with rows the input points, and is a vector with entries the outputs. Moreover, the simple linear case can be generalized if belongs to a reproducing kernel Hilbert space (RKHS).
Remark 2.1 (Regression in RKHS [89]).
Recall that a Hilbert space of real valued functions on a set is called a RKHS if for all and , there exists such that From the above definition and Riesz lemma it follows immediately that there exists a such that Then we can write the regression problem as the problem of finding , given a set of equations,
where and .
Keeping the above remark in mind, in the following we primarily focus on the linear case to ease the notation.
A key observation is that the problem (2.2) might not admit solutions or admit multiple ones. This latter situation is the most common in machine learning, where high dimensional (overparameterized), or even infinite dimensional (like RKHS) models are often considered. In the linear setting, this corresponds to the case where in (2.2), (assuming the inputs to be linearly independent). In this case, problem (2.2) admits infinitely many solutions and a classic way to select one is to consider the minimum norm solution
(2.3) |
The minimum norm solution can be written in terms of the pseudoinverse of as [31, Definition ]. This makes it clear that instability in the solution might occur when is ill conditioned. The basic idea of regularization is to consider a family of estimates that approaches with better stability properties. In this sense, the regularization property of such estimates is related with i) the convergence of to the minimal norm solution and ii) its stability. Here the notion of stability expresses how close are the estimates and that are generated from the true output and a noisy version of it (respectively), provided and are close enough (see e.g. [31, Section & Definition ] for a detailed definition). We next recall two basic approaches with this property.
Tikhonov (explicit) regularization.
The most classic regularization approach is Tikhonov regularization
(2.4) |
where is called the regularization parameter. The set of solutions corresponding to different values of defines the regularization method and a direct computation shows that . From this expression is easy to see that the sequence converges to as tends to zero, (see e.g. [31, Theorem ]). The above ideas can be generalized as we discuss next.
Remark 2.2 (Loss and regularizers).
The ideas in (2.3), (2.4) can be generalized replacing the squared norm in (2.4) with other regularizers , the norm being a popular example. Similarly instead of the least squares error in (2.4) other error cost functions (loss functions) can be considered. In regression, loss functions depend on the difference , examples besides the square loss include the absolute value loss, or the -insensitive loss used in support vector machine regression [89]. For general losses, the corresponding solutions might not admit a closed form expression, but it is still possible to prove that the Tikhonov regularized solutions converge in some proper sense to the corresponding minimum norm solutions (see e.g. [30, 5]). For sake of completeness we provide a proof in a general setting in Lemma A.1 in Appendix A.
In machine learning, regularization à la Tikhonov is sometimes called explicit since a penalty is added to the data fit term. We next recall iterative regularization and discuss why it is called implicit in machine learning [40].
Iterative (implicit) regularization.
The simplest example of iterative regularization is the gradient descent iteration of the least squares error, that is
for some suitable initialization and step size . It is well known that the above iteration converges to the minimal norm solution (2.3), and further that the stability of the solution varies along the iterations [31, Section ]. In this view, the stopping time becomes the regularization parameter defining a family of regularized solutions. This kind of regularization is called implicit in machine learning since there is no explicit penalty or constraint in the optimization model [70, 69], and stability relies on the self-regularizing properties of the optimization process. These ideas have recently received a lot of attention in machine learning, since they are useful to understand the theoretical properties of more complex non-linear systems like neural networks [40, 24]. Further, they have been advocated as a way to design resource efficient machine learning algorithms controlling at the same time numerical and statistical accuracy. In the remainder of the paper we discuss how these ideas extend beyond regression to the classification setting.
We first add some remarks. First, we note that other optimization approaches than gradient descent can be, and have been considered, including accelerated [31, 68, 72, 52, 99] and stochastic methods [80, 64, 29], as well as mirror descent approaches [39, 47]. Second we note that, compared to Tikhonov regularization, for iterative regularization the extension to other losses and regularizers is not straightforward. The case of regularizers that are norms in reflexive Banach space has been considered in [84, 20], whereas the case of strongly convex regularizers has been considered in [61, 37]. The case of convex regularizers has been recently considered in [62]. The case of smooth loss functions for regression is quite straightforward, considering the gradient iteration
(2.5) |
where is the derivative of the loss. The case of convex but non smooth loss function can also be considered using subgradient methods [54]. Note that, if the above iteration is initialized in the span of the input points, it remains in the span. This observation is at basis of the proof that the above iteration converges to the minimal norm interpolant (2.3), see Lemma A.2 in Appendix A.
Provided, with the above background we next discuss the case of classification.
3 Implicit regularization: from regression to classification
In this section, we introduce the problem of classification and investigate how the ideas reviewed in the previous section for regression and linear inverse problems translate to this setting. In particular, we first discuss a notion of minimal norm solution for classification.
Similarly to regression, in classification the goal is to estimate a functional relationship, but the difference is that the outputs are binary valued, that is . Estimating a binary valued function is computationally unfeasible and a classic approach relies on estimating a real valued function of which the sign is then taken, that is if , and otherwise (here ties are broken arbitrarily). In this context, a natural quantity is the product , called the margin of at . If the margin is positive it means that will classify correctly the input point, if the margin is large we can intuitively expect a confident prediction.
For linear functions we can formalize this idea (see e.g. [27, 89]) considering the problem of finding satisfying the set of inequalities
(3.1) |
If such a exists we say that the data are linearly separable. Deriving necessary and sufficient conditions for the above inequalities to be feasible is not an elementary problem. As shown below, it is easy to see that overparametrization (), hence interpolation, will be a sufficient condition for linear separability. Since this is the relevant setting for us, from now on, we assume that the data are linearly separable:
Assumption 1 (Linear separability).
There exists some that separates the data, i.e.
(3.2) |
In general, also in this setting we can expect multiple solutions, the classical way to approach the problem is to consider the so called max margin solution. The linear margin for a given dataset is defined as,
(3.3) |
and, correspondingly, the maximum (max) margin problem is defined as
(MM) |
We make a few observations. First, the intuition underling the above problem is that among all separating solutions, we are interested into one for which the margin is maximized. Second, we note that without the unit norm constraint the problem of maximizing the margin (3.3) is degenerate and one can obtain trivial solutions by rescaling arbitrarily any separating solution. Indeed, since the margin is scale invariant, the max margin becomes a direction problem, which naturally leads to considering the constrained problem (MM) (see e.g. [27]). Third, it is possible to show that the max margin problem has an equivalent formulation, that highlights the connection to minimal norm solutions. Indeed, consider the problem
(MN) |
Then the following result holds.
Lemma 3.1.
A few observations can be made. The above lemma is a classical result in the theory of SVM. Indeed, Problem (MN) is called hard margin support vector machines [27]. For sake of completeness we provide its proof (see Lemma A.3 in Appendix A). In terms of the constrained min-norm problem (MN), Assumption 1, ensures that the feasible set is non-empty and hence such a solution (thus ) exists. In addition since (MN) can also be equivalently expressed as
(3.4) |
the solution (thus ) is unique, thanks to the strong convexity of the squared norm in (3.4). Note that,from an optimization point of view, formulation (3.4) is more useful and will be used hereafter, instead of (MN).
From a regularization perspective, the above result shows that the max margin problem can be see as a minimum norm problem akin to Problem (2.3) in linear inverse problems, but the linear equations are now replaced by inequalities. This can be made even more explicit noting that for binary valued outputs, it holds:
This last expression also clarifies the earlier observation that interpolation implies separation, and that with linearly independent inputs is a sufficient condition for linear separability.
We next discuss how the ideas of explicit and implicit regularization can be adapted to the classification context. Note that in the following, in analogy to the regression case, we will say that a family of solutions has the regularization property if it converges to the max margin (min norm) solution is stable with respect to a noisy version of the true output (see the related discussion in paragraph 5.2).
Explicit regularization for classification.
To extend Tikhonov regularization approach to classification, an appropriate loss function needs to be considered. As it turns out in classification, loss functions depend on the margin , rather than the difference as in regression, and indeed are called margin loss functions. Some popular examples are the following:
-
•
hinge loss :
-
•
exponential loss :
-
•
logistic loss :
Given a convex margin loss function, and considering again linear models, Tikhonov regularization in classification corresponds to
(3.5) |
for . It is natural to ask whether the above approach can be shown to be regularizing in the sense that the sequence converges to the minimum norm solution (MN). Indeed, for the hinge loss this can be proven, while in the cases of the exponential or logistic loss, where the set of minimizers is empty, one can show that diverges (see e.g. Lemma A.1 in Appendix A). Nevertheless under some suitable assumptions on the loss function, convergence in direction that is
has been proven for a wide class of margin loss functions (see e.g. [82, Theorem ] and [42, 81]). Notice that according to the previous discussion and Lemma 3.1, since the max margin problem (MM) is a direction problem, it is still relevant to consider convergence in direction. We next turn our attention to implicit regularization reviewing some recent results.
Implicit regularization for classification.
Following the same reasoning as in regression for the minimal norm interpolating solution (2.3) and the gradient method (2.5), it is natural to ask whether it is possible to prove the regularization properties of the gradient iteration applied to a margin loss function, i.e.
(3.6) |
where is a suitable step size. The above iteration is well defined for smooth loss functions such as the exponential or the logistic loss. However, for losses like exponential or logistic which do not admit any minimizer, the gradient descent iteration in (3.6) diverges. Interestingly, in a recent line of works (see [87, 66, 46] and references therein), convergence in direction of gradient descent has been proved for these loss functions, i.e.
In addition, rates of convergence for the normalized iterates , the angle gap , and the normalized margin gap were also proved, but are very slow. For example for the logistic and exponential losses the margin rate is of order (see [87]). Improved rates of order were given recently considering a variable step-size gradient descent version in [66]. Similar works include also (accelerated) mirror descent approaches [44, 45] that provide margin rates up to . Concerning the hinge loss, the natural extension of the above idea is to consider a subgradient iteration
(3.7) |
for some sequence of step-sizes and where is an element of the subgradient of the loss. In this case, the minimization problem
does have a solution and indeed the iteration in (3.7) converges to it. However, the solution minimizing the hinge loss error cannot be expected to be the max margin (min norm) solution in general, and thus the subgradient iteration in (3.7) does not a provide regularization properties.
Next, we focus on the hinge loss and provide two iterative regularization schemes via a diagonal principle. For ease of reading, we first present and describe the two main iterative methods and then the main convergence and stability results characterizing their regularization properties.
4 Iterative regularization for hinge loss via diagonal descent
In this section, we present two iterative regularization approaches based on the hinge loss. The first is given in Algorithm 1, and the second in Algorithm 2 corresponds to a practically faster variant.
Let be a decreasing-to-zero sequence of positive numbers, and . For all , consider , and , such that for all and :
(4.1) | ||||
(4.2) | ||||
(4.3) |
Let a decreasing-to-zero sequence of positive numbers, and . For all , consider , , and , such that for all and :
(4.4) | ||||
(4.5) | ||||
(4.6) | ||||
(4.7) |
We add a few comments before providing a detailed derivation of the two procedures above. In both Algorithms 1 and 2, denotes a constant step-size and a vanishing-with-iterations parameter. Both procedures are based on simple and easy to implement iterations that require only vector multiplication and thresholding operations. The sequence represents a dual variable and is computed coordinate-wise via a simple projection operation (see (4.2) and (4.6)), while corresponds to a classical gradient (forward) step related to the square norm . We note that the difference between the two schemes is that in Algorithm 2 the sequence is computed via the auxiliary sequence instead of . The sequence equals to extrapolated by the term , called inertial term, which plays an important role on the convergence speed of Algorithm 2. Finally, corresponds to the primal sequence designed to approximate the min-norm solution . In the next section, we discuss the derivation of the above two procedures.
4.1 Diagonal methods via dual hinge loss
Algorithms 1 and 2 are based on a so called diagonal regularization process [9] applied to a suitably defined dual problem. We next describe these ideas in some detail. We note in passing that, while diagonal approaches have been considered before for inverse problems [8, 59, 19, 9, 34, 21]), we are not aware of their application to classification.
The basic idea of diagonal approaches can perhaps be better explained recalling that the penalized functional such as (3.5) converges to minimal norm separating solution (3.4), as the regularization parameter goes to zero. The idea is to start by considering an optimization procedure for solving Problem (3.5), to then modify it by letting the regularization parameter decrease at each iteration. In this way, the obtained iteration no longer converges to a solution of Problem (3.5), but rather directly to the minimal norm separating solution (3.4). To derive Algorithms 1 and 2 this basic idea is actually applied to the dual formulation of Problem (3.5).
To describe the above ideas more precisely, we begin rewriting Problem (3.5) for the special case of hinge loss . In order to ease the notation, we define the matrix and consequently , with , we set , with , and we get that solving Problem (3.5) is equivalent to solve
(4.8) |
The objective function in (4.8) is the sum of a smooth term, the squared norm, and a convex nonsmooth one . Problems such as (4.8) are called structured composite convex minimization problems, and can be often solved efficiently by proximal-gradient methods [26], a class of first order methods splitting the contribution of the smooth and the nonsmooth part. At every iteration, the smooth part is activated through a gradient step, while for the nondifferentiable one the computation of the proximity operator is required. In order to implement a proximal gradient algorithm to solve problem (4.8) we would need the computation of the proximity operator of , which is not available in closed form and may be computationally expensive. Therefore, we consider the dual problem (see for example [10, Definition ]) associated to (4.8) (which is equivalent to (MN)) which is given by
(4.9) |
where denotes the Fenchel conjugate of defined in Section 1.1. Its computation is simplified by the fact that is a sum of separable functions, and can be written as , which implies
(4.10) |
It is also important to write the dual problem associated to min-norm problem 3.4 which is given by
(4.11) |
Indeed, it is possible to show (see Lemma A.6) that, for , the dual regularized Problem (4.9) converges point-wise to Problem (4.11). To pass from convergence properties in the dual space to the primal space, recall that, if strong duality holds (see e.g. [10, Section ]), the value of problem (4.9) is the same as the value of (4.8) and, for every , one can recover a solution of the primal problem (4.8), from a solution of the dual problem (4.9) via the formula [10, Section ],
(4.12) |
Problem 4.9 is another composite convex optimization problem, where , has a -Lipschitz gradient, and is the nonsmooth part, of which the proximity operator can be easily computed, as we show below. We can then implement a diagonal proximal gradient iteration on the dual function defined in (4.9) as follows. For a given a starting point , some step-size and a decreasing sequence , the (diagonal) proximal gradient iteration corresponding to Problem (4.9) is given by,
(4.13) |
We add several comments. First, if is taken to be constant then the above iteration solves Problem (4.9), if the stepsize . Following, the previous discussion, by letting decrease at each iteration we obtain a diagonal process solving Problem (4.11). Second,the computation of the proximity operator is simplified by the fact that is a sum of separable functions, as can be seen from (4.10). Indeed, this allows to compute the proximal operator component-wise
and derive the following iteration
(4.14) |
Note that the proximal operator of , can be computed in closed form. Indeed, for any , ,
(4.15) | ||||
Finally, putting together all the above observations we derive Algorithm 1. Algorithm 2 is derived by considering a classic variation of the proximal gradient iteration in Algorithm 1, namely a so called inertial step. The latter corresponds to replacing in (4.14) with . For both Algorithm 1 and Algorithm 2 the last step corresponds to the dual-to-primal update .
Remark 4.1.
Notice that Algorithms 1 and 2 can be obtained by replacing with . In fact, our analysis also extends to considering general linearly parametrized models of the form , where denotes some feature mapping (possibly infinite dimensional) which may be specified explicitly, or via some kernel operator, i.e. 666The use of this feature-mapping allows to consider non-linear classifiers for possibly non-linearly separable data (see for example [27, 89]).. If we set , the penalized hinge-loss problem associated to the minimal norm separating problem and its corresponding dual are:
(4.16) | |||
(4.17) |
Since Algorithms 1 and 2 are designed from the dual problem (4.17), the information needed for the dual update is only the kernel evaluation at each data-point and not the one of (see e.g. [89]). In this case Algorithms 1 and 2 can be rewritten by replacing , with the operator . Finally while the primal update may not be computable, the predictor can be still computed as via the dual iterates and the kernel evaluation.
We end this section with two remarks commenting on some related literature and then analyze the properties of Algorithm 1 and Algorithm 2 in the next section.
Remark 4.2 (Implicit regularization via homotopic subgradient).
Another implicit regularization approach for the hinge loss was recently studied in [63] and derived using an homotopic subgradient method for the primal problem (4.8). Our approach, in contrast, is based on a diagonal process on the dual problem (4.9) and, as discussed later, leads to faster convergence rates.
Remark 4.3 (Hard-SVM).
The dual formulation (4.9) is the one used for solving linear SVM problems (see [89]). Indeed tackling the max-margin problem via its dual formulation (4.9) is popular, due to its favorable structure and there is a very rich literature on methods to solve it (see e.g. interior point-methods ([18, 32]) or decomposition methods (see e.g. [27, 89], [32, 97] and [74, 50, 56]. Compared to this methods our diagonal approach enjoys good theoretical guarantees while providing a direct link to regularization methods.
5 Main results and convergence analysis
In this section we present and discuss the main results of this work, deferring the associated proofs to the Appendix. Before stating the main theorems, we discuss a key property of the sequence of dual problems. In particular, there exist some positive constants , and , such that, for all , each of the dual functions related to problem (4.9) satisfies the -Łojasiewicz condition in , i.e. for all and , it holds
(5.1) |
as we will show in Lemma A.9. The previous condition is a relaxation of strong convexity, and it is well-known to imply linear convergence of the standard Forward-Backward scheme (see e.g. [55, 14, 35, 48] and references therein). Theorem (5.1) extends these classical results to the diagonal setting. The value of is crucial, since it determines the constant appearing in the linear convergence bound in (5.4). As can be seen from (5.1), is independent from , and an explicit expression for can be given by
(5.2) |
where is the Hoffman constant (see e.g. [43, 36] for a definition) of a system of linear inequalities and equalities describing the set of minimizers of the dual objective function . The explicit computation of this constant is expensive, but an expression is available in closed form, and is given by (see e.g. [94, Lemma ]):
(5.3) |
where and .
We are now ready to state the main results. In Theorem 5.1 we state the convergence results for the sequence generated by Algorithm 1 to the minimal norm separating solution (see (MN)), and in Theorem 5.2 we state the convergence results for the inertial Algorithm 5.2.
Theorem 5.1.
Let the solution of (MN), and and the sequences generated by Algorithm 1. Then converges to . In addition, if is a solution of the associated dual problem (4.11) and , then for all , the following estimate holds true:
(5.4) |
where and is defined in (5.2). In addition, there exists some , such that for all , the following rates hold true for the angle and the margin gap (respectively):
(5.5) |
(5.6) |
Theorem 5.2.
Let the solution of (MN) and and the sequences generated by Algorithm 2 with . Then converges to . In particular, if is a solution of the associated dual problem (4.11) and , then for all , the following estimate holds true:
(5.7) |
where .
In addition, there exists some , such that for all , the following rates hold true for the angle and the margin gap (respectively):
(5.8) |
(5.9) |
We add several remarks discussing the results in Theorems 5.1 and 5.2 before comparing them to some recent related works and deriving their proofs.
Remark 5.1.
In Theorem 5.1 we derived the linear convergence of the sequence thanks to condition (5.1), as discussed above. Even if the inertial version should give better convergence than the basic iteration, Theorem 5.2 provides only a sublinear rate of convergence. We believe this is due to technical, rather than fundamental reasons. The numerical results in Section 6 suggest that inertial variants can indeed provide faster convergence, but proving linear rates of convergence for inertial variants is a challenging question and is an active area of research in the optimization literature, see e.g. the discussion in [35, 3].
Remark 5.2 (Error metrics).
Theorems 5.1 and 5.2 provide rates of convergence for the distance of the iterates to the minimal norm solution, as well as the angle gap and the margin gap of the normalized iterates to the max-margin solution , for Algorithms 1 and 2 (respectively). As mentioned in Section 3, since the original max-margin problem (MM) is a direction problem [27, 89], the margin and the angle gap are relevant quantities to measure the performance of the proposed methods, see [66, 87, 63, 46, 44, 45].
Remark 5.3 (Parameter choice).
Both in Theorems 5.1 and 5.2, the requirement , where allows to deduce the bounds(5.4) and (5.7), for all . If this condition is not verified the estimates in Theorems 5.1 and 5.2 still hold true asymptotically due to the decreasing property of . In addition, one can freely choose the decay rate-to-zero of . In Section 6 we numerically evaluate the impact of different choices of on the performance of the method.
5.1 Comparison to other convergence results for implicit regularization in classification
We next compare the convergence results of Theorems 5.1 and 5.2, with existing results in the related literature. We begin by noting that Theorems 5.1 and 5.2 provide improved rates compared to those for classical perceptron variants [71] which are of order , see for example [75, Theorem ]. Margin rates similar to those in (5.6) and (5.9) have been derived for other optimization procedures applied to different losses and regularizers. For the iterates generated by gradient descent applied to exponentialy-tailed losses (such as logistic or exponential loss) a margin rate of order is derived in [66, 87, 46]). For the iterates of the same algorithm with adaptive step size variants, the margin rates are of the order ([66, Theorem ]) or [25, 75]. In all these cases, the rates are worse than the ones we obtain in (5.6) and (5.9). The rates for the margin in Theorem 5.2 for Algorithm 2 match the ones in [44] (see Theorem ). They are slightly worse than those found in [45, Theorem ] which are of order , and are obtained considering a mirror-descent method on the smoothed margin for the exponential and logistic loss [58, 44, 45]. Finally, we compare to the results in [63] considering a different implicit regularization approach, based on the use of a homotopic subgradient method to minimize the primal penalized hinge loss. The rates given in Theorems 5.1 and 5.2 are considerably better than the ones in [63], which are approximately of order (see [63, Corollary , Lemma ]). None of the existing results provides linear rates, as the ones we derive for Algorithm 1.
5.2 Stability
In Theorems 5.1 and 5.2 we established the regularization properties of Algorithms 1 and 2 in the sense of convergence to the minimal norm separating solution (MN) corresponding to the true labels . In practice, labels are typically corrupted by noise and regularization methods should provide stable solutions. In this section, we study the stability of Algorithm 1 introducing a suitable notion of label noise (analogous results could be derived also for Algorithm 2 and are let for future study).
For classical inverse problems noise is measured with respect to some norm in the data space. In the context of classification, a possible noise model is to consider a fraction of labels to be wrong [1, 49]. Since the data are binary valued, a natural way to measure the discrepancy between correct outputs and mislabeled outputs is to assume there exists such that
(5.10) |
where is the Hamming distance defined as,
(5.11) |
The Hamming distance replaces the norm in the data space to quantify the label noise. Here, can be seen as the noise level. The constraint is natural since higher values would correspond to simply renaming the classes. The case corresponds to noiseless data, that is where no mislabelling is present. Note that, Assumption 5.10 implies that there is a set of indices with cardinality , such that
(5.12) |
In the above setting, if is a solution obtained using labels with noise level , then the goal is to derive error estimates with respect to the true solution in terms of . The error estimates should decrease in so that the correct solution is recovered as the noise decreases. In the case of the iterative regularization procedure defined by Algorithm 1, this requires specifying a suitable choice for the stopping time. The following theorem provides such a choice and the corresponding error estimate.
Theorem 5.3 (Stability).
Let and be the solutions of problems (MN) and (4.11) respectively and . Let be a vector of noisy outputs satisfying Assumption (5.10). Finally, let and the sequences generated by Algorithm 1 applied to the data and , with . Let , with as defined in (5.2). Then for all , the following estimate holds true:
(5.13) |
In particular, for the stopping time , the following bound holds,
(5.14) | ||||
where , , .
The proof of the above result can be found in Section A.5 of Appendix A. We add two remarks to discuss the above result.
Remark 5.4 (Stopping time and stability).
The above result shows that the best stopping time choice arises from a trade-off between stability and convergence. More precisely, the error estimate in (5.13) is composed of three terms. The first two terms are related to the stability of the algorithm and are increasing along the iterations due to the presence of the label noise. The last term is related to the convergence of the algorithm already analyzed in Theorem 5.1 in the absence of noise. The best stopping time is derived optimizing the bound (5.13). In this sense, this is an a priori choice. Deriving appropriate a posteriori choice is an interesting question left to a future study. Here, we note that the optimal stopping time is larger when the noise level is smaller, whereas the corresponding error decreases with respect to the noise. Another interesting question would be to derive corresponding lower bounds.
Remark 5.5.
[Noise model] The classification noise model considered above is simple and inspired by the classic deterministic noise in inverse problems. It allows to take a first step towards understanding the stability property of iterative regularization in classification. We note that other, possibly more complex, noise models are possible. For example, stochastic noise could be considered, possibly considering so called margin conditions [60]. A more substantial development would be to consider random input data, as often done in machine learning. This is likely to require results from empirical process theory [16] and possibly different statistical notions of stability already used in machine learning [17].
6 Numerical results
In this section, we investigate numerically the properties of Algorithms 1 and 2. First, we analyze their convergence and stability on some synthetic datasets. Second, we study their performance on two benchmark datasets, and compare them to some recent related works.
6.1 Synthetic data-set
Following [87] and [63], we consider a solution vector defining the maximal margin separator and two pairs of support vectors , labeled with and , labeled with . We then generate data-points and assign them to the two classes, so that the support vectors do not change, i.e. we have a larger distance from than the points , , and , see Figure 2.
We test Algorithms 1 and 2 for iteration with regularization parameter for all . In Figure 3 we illustrate the convergence results in terms of the margin and angle gap, and the error of the difference (i.e. ), as found in Theorems 5.1 and (5.2).
In this toy example we can notice that while the theoretical worst case bounds for Algorithm 1 are better than the ones for 2 as expressed in Theorems 5.1 and 5.2, this is not necessarily reflected in Figure 2. This is due to the pessimistic worst case bound found in Theorem 5.2 for the inertial Algorithm 2, rather than a numerical issue. Indeed this mismatch between theory and practice for inertial methods is commonly observed in similar settings and is an active area of research which is let for future study. A second remark on this example concerns the influence of the over-relaxation parameter for the convergence behavior of Algorithm 2. As one can notice in Figure 2, the choice of can highly affect the performance of Algorithm 2 in relation with the stopping time. This observation rises an interesting question about the tuning of the parameter which is let for future study (see also discussion in [3]).
Second, we consider a data set of points in , that consists of two classes distributed independently and we split them equally in training and test data. In this case, the training points are not linearly separable and algorithms 1 and 2 are implemented with a Gaussian kernel with parameter , see Remark 4.1. The total number of iterations (kernel evaluations) is set . In this experiment, we aim at illustrating convergence but also the stability of the proposed methods in the presence of the noise induced by mislabeling an amount of the training data. In particular, we are interested in the effect of the parameter on the convergence behavior, as also on the stability performance of the proposed methods 1 and 2. In these experiments, the convergence is measured in terms of the margin, while the stability via the test error on the test data. In Figure 4, we plot the margin gap and the test error of Algorithms 1 and 2 with parameter with , for three different levels of noise ( percentage of flipped training labels) starting from (no noise), (moderate noise) and (strong noise). In Figure 5, we repeat the same experiment by choosing various orders of decay for , that is . Several comments can be made. First, we note that larger initialization (e.g. in Figure 4) or slower decay rate (e.g. or in Figure 5) lead to slower margin convergence, but better generalization properties especially in presence of errors (second and third rows in Figures 4 and 5). In addition, while Algorithm 1 seems more robust with respect to the various changes of , the situation is different for the inertial variant 2. Both in Figures 4 and 5, for Algorithm 1, the behavior of the margin gap and the test error do not change radically unless the initialization is very large (e.g. in Figure 4) or the decay rate is very slow (e.g. or in Figure 5) . On the other hand, Algorithm 2 seems more sensible to the different choices of the regularization parameter both in terms of margin gap and test error.
In the noiseless case (first row in Figures 4 and 5) it seems that choosing small enough (or decaying moderately fast to zero) can be a good policy. Note however that choosing too small initialization (or too fast decay to zero) does not offer any significant advantage with respect to more moderate choices of (see in particular the blue, magenta and khaki lines in the first row of Figures 4 and 5 ). On the other hand, in presence of errors (second and third rows in Figures 4 and 5), larger initialization (like or in Figure 4) and slower decay rate (as or in Figure 5) may offer a better trade-off between margin convergence and test error.
6.2 Real data-set
Finally, we test the proposed methods on the MNIST dataset (see [53]) and on the HTRU dataset777HTRU2 is a data set which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey (South)[51] (see [57, 76]). In particular we compare the performance of Algorithms 1 and 2, with some of the recent proposed methods for binary classification [66, 87], and [44, 45] in terms of margin convergence and test error. The method in [66, 87] is a normalized gradient descent with variable stepsize on the exponential loss and the one in [44, 45] is based on an accelerated mirror descent approach on the smoothed margin of the exponential loss. We want to stress out that this comparison is indicative for the theoretical convergence properties of the tested methods and is not meant to be exhaustive. The results are reported in Figure 6.
All the data were standardized with zero mean and unit standard deviation and the binary labels of each dataset are set to be and . The MNIST dataset is restricted to a two-digits comparison of vs , vs and vs . For the HTRU dataset the training part was chosen by randomly picking of the whole dataset, while the test part is composed of the remaining (see e.g. [76]). All methods are implemented with the Gaussian kernel with parameter for MNIST digits and for the HTRU dataset. Each update use only the kernel evaluation and thus all the schemes have the same computational complexity per iteration. The test error is measured by the standard zero-one loss.
As a general remark Algorithm 2 has the best performance in terms of margin convergence and test error in the MNIST datasets. For HTRU, while Algorithm 2 still provides the fastest margin convergence, it performs slightly worse than the method in [45] in terms of test error, whereas Algorithm 1 and the method proposed in [66] seem to give better test error results.
7 Conclusion and possible future work
In this work, we propose and study iterative regularization for classification in machine learning. Considering the hinge loss function, we derive an iterative regularization method defined by a dual diagonal optimization approach and further consider its accelerated variant, see [34, 21]. We provide convergence results, as well as convergence rates, to the minimum norm separating solution. Moreover we derive stability results for a natural classification noise model.
Several further research directions can be explored. For example it would be interesting to consider other form of regularization, extending to classification the results for linear inverse problems, see [62] and references therein. Moreover, it would be interesting to consider different data model including stochastic noise and random input data as often done in statistical learning theory [93]. Finally it would be very interesting to consider nonlinear models following [77, 65].
Acknowledgements : We acknowledge the financial support of the European Research Council (grant SLING 819789), the AFOSR ((European Office of Aerospace Research and Development)) project FA9550-18-1-7009 and FA8655-22-1-7034, the EU H2020-MSCA-RISE project NoMADS - DLV-777826, the H2020-MSCA-ITN Project Trade-OPT 2019; L. R. acknowledges the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216 and IIT. S.V. and L.R. acnowledge the support of the Ministry of Education, University and Research (PRIN 202244A7YL project ”Gradient Flows and Non-Smooth Geometric Structures with Applications to Optimization and Machine Learning”) is part of the Indam group ”Gruppo Nazionale per l’Analisi Matematica, la Probabilità e le loro applicazioni”. This work has been partially supported by project MIS 5154714 of the National Recovery and Resilience Plan Greece 2.0 funded by the European Union under the NextGenerationEU Program.
Appendix A Appendix
The Appendix is organized as follows: In paragraph A.1 we recall some basic results concerning Tikhonov regularization and gradient descent for linear problems, as also some basic facts on the max-margin problem MM. Paragraph A.2 contains some auxiliary results that are needed for the proofs of main Theorems 5.1, 5.2 and 5.3. In paragraphs A.3 and A.4 we analyze Algorithm 1 and 2 (respectively) and provide the proofs of Theorems 5.1 and 5.2. Finally in paragraph A.5 we provide the proof of Theorem 5.3 regarding stability of Algorithm 1.
A.1 General Lemmas
In the following two lemmas we establish the regularization properties of Tikhonov regularization and gradient descent to the minimal norm interpolating solution, for general losses.
Lemma A.1 (Tikhonov).
Let be a continuous and convex function. For all , consider:
(A.1) |
-
1.
If , then
-
2.
If , then
Proof.
For the first point, let and . By the definition of and .
yielding
(A.2) |
which allows to deduce that the sequence is uniformly bounded. This implies that (up to a subsequence) converges to an element . By lower semi-continuity of , we have :
which shows that . In addition from the (weak) lower semi-continuity of the norm and relation (A.2), we have :
and since the last inequality holds for an arbitrary , we deduce that .
The second point can be proven by contrapositive. Let us assume the existence of some , such that . By following the arguments of the proof of the first point, we deduce the existence of a limit point , which allows to conclude.
∎
Lemma A.2 (Gradient descent).
Let be a linear operator, with , and be a proper, convex function with -Lipschitz gradient, such that
(A.3) |
Let also , and consider the gradient iteration on the first argument of , i.e.
(A.4) |
Then
(A.5) |
Proof.
First of all, since , the set of minimizers of is non-empty (i.e. , such that ). By the standard gradient descent analysis (see for example [67, Paragraph ]), it holds that .
In addition, since , by using (A.4), and a recurrence argument, we have that t. By closeness of , it follows that . By definition of , we have that the set of minimizers is an affine space, hence it can be expressed as , for some suitable space , where denotes the projection of the element in and is equal to , by definition.
Let us show that . Indeed for any , we have , and , which gives . On the other hand, if and , we have that , which allows to conclude that . By definition of the minimal norm solution it holds that (see [31, Proposition and Theorem ]). By combining the facts , , and , it follows that . ∎
The next lemma establishes the equivalence between the max-margin problem (MM) and the min-norm problem (MN). In fact the result holds true for general positively -homogeneous features, i.e. by replacing in (MM) and (MN), by the general margin:
(A.6) |
Lemma A.3 (Max margin & min norm).
Let be a family of positively -homogeneous functions for all and consider the following optimization problems:
(A.7) |
Proof.
First of all, from the definition of , see (A.6), we have that for all
(A.9) |
By making the change of variable and taking into consideration (A.9), we can rewrite the max margin problem related to (A.7) as
and using the homogeneity property of ,
(A.10) |
Then setting , (A.10) can be equivalently written as
(A.11) |
Since the above problem is still scale invariant by letting , (A.11) is equivalent to
which is equivalent to the min-norm problem related to (A.8) (in the sense that the they have the same set of solutions) . By taking into consideration all the change of variables, it follows that and , as also that and . ∎
The following Lemma gives an expression of the dependence of the margin and the angle gap as a function of the gap of a method’s iterates approaching to the hard-margin solution .
Lemma A.4 (Lemma [63]: Bounds for angle and margin).
Let and , and be the min-norm solution as defined in (MN). Let be a sequence such that . Then the following estimates hold true :
(A.12) |
(A.13) |
Proof.
Since , by using the identity
(A.14) |
we find :
(A.15) | ||||
which allow to prove the first point.
The next lemma is a general descent lemma that is classically used for proximal gradient methods applied to structured composite convex optimization problems, such as (4.9). The interested reader can find a proof in [13, Lemma ]
Lemma A.5.
Let , where and are convex lower-semi continuous function and is continuously differentiable with -Lipschitz gradient. For all , consider the operator , such that . Then for all it holds:
(A.18) |
A.2 Preliminary results
In this paragraph, we state some basic facts concerning the properties of the dual regularized problem (4.9), necessary for the analysis of Algorithms 1 and 2.
We first recall the objective function associated to the dual penalized hinge loss problem (4.9) (see also (4.10)), where we take the regularization parameter to be given by a sequence .
(A.19) |
and the dual problem of (3.4) (see (4.11)), that is
(A.20) |
Below we state some fundamental properties of the dual objective function , that will be useful for the convergence analysis of both Algorithms 1 and 2. We start by showing that the sequence of regularized dual functions is monotonically pointwise decreasing to .
Lemma A.6.
Proof of Lemma A.6.
Since is non-increasing, for every it follows that the function is non-increasing in .
In addition, by direct computation for all , it holds:
(A.22) |
Thus, for all , the function is non-increasing in and that ∎
The following lemma, is well-known (see e.g. [83, Proposition 94]), and we recall it here for the sake of completeness. It plays a fundamental role in our analysis since it provides a bound of the distance of the primal iterates from the min norm solution (MN) in terms of the distance of the dual objective function from its minimum.
Lemma A.7.
Proof of Lemma A.7.
From the separability assumption (1) there exists some , such that . Let and let . Then by setting , we have . Thus we deduce the existence of an element , such that is continuous at (since is continuous and ). By [73, Corollary ] and the optimality condition for the min-norm separating solution , the sum rule for the subdifferential holds and thus we have
(A.25) |
From (A.25), there exists some such that or equivalently . Hence, by replacing , we obtain:
(A.26) | ||||
which shows that . In order to prove (A.24), let and .
Since and , by using the Fenchel-Young equality for and , we find:
(A.27) | ||||
By using the strong convexity of (with parameter ) and the convexity of , we conclude that:
(A.28) |
∎
As a consequence of Lemma A.6, the following Lemma provides some basic estimates for the sequence generated by Algorithm 1.
Lemma A.8.
Let and be the sequence generated by Algorithm 1. Then the following estimate holds for all :
(A.29) |
In addition, is non-increasing and
(A.30) |
Proof of Lemma A.8.
Let . Since has -Lipschitz gradient, by applying the Descent Lemma A.5 with , for all and it holds:
(A.31) |
which allows to deduce (A.30).
By choosing in (A.31) and using the non-increasing property of in (see Lemma A.6), we obtain:
(A.32) |
which shows that the function is non-increasing in . Since is also non-increasing in and convergent to , from (A.31) we also have that is bounded for all . In addition , which shows that is also bounded from below by zero, and therefore converges to a non-negative limit.
By adding and subtracting in (A.31), for all , we find:
(A.33) |
which by summing up to gives:
(A.34) |
The last relation allows to conclude that and since and , we have that . ∎
In the next lemma we prove that each of the regularized dual problems satisfies the Łojasiewicz condition (5.1) with a common constant , not depending on . The proof of Lemma A.9 (see below) is inspired by the analysis presented in [12, Lemma ] with some modifications. More precisely we will prove that there exists some positive constants , such that for all , satisfies the following growth condition:
(A.35) |
In fact, relation (A.35) can be met under the name quadratic growth (see e.g. [35]) and is equivalent to (5.1).
It is worth mentioning that we cannot apply directly the proof in [12, Lemma ], due to the fact of possible unboudedness of , when . Indeed by applying directly [12, Lemma ] we can deduce the existence of , such that (A.35) hold true. However this is not sufficient for establishing linear convergence of the proposed scheme (see Proposition A.1), since, in general, may vanish asymptotically.
Lemma A.9.
Proof of Lemma A.9.
For all , the problem is equivalent to the following constrained optimization one:
(A.37) |
If , relation (A.35) holds, so without loss of generality, we assume that , so that .
For all , let . By the optimality conditions for problem (A.37), for all , it holds:
(A.38) |
Notice that is a polyhedral set since , with and .
By [12, Lemma ], there exist a unique vector and a scalar such that the following equivalence holds true:
(A.39) |
so that , where and .
According to Hoffman’s lemma [43] (see also [94, Lemma ]) for the polyhedral sets and , by setting , there exists some positive constant given by (5.3), such that
(A.40) |
It is important to stress out that the Hoffman’s error bound constant only depends on the matrices and (see e.g. [96, Remark ] and the associated references). In our setting this means that the constant only depends on the data-matrix .
By taking the squares in (A.40), we find:
(A.41) |
Let us now bound appropriately the two terms in the right-hand-side of (A.41).
For the first term, by developing the square we find:
(A.42) | ||||
where in the last inequality we used the optimality condition (A.38). In addition, since , from (A.42) it follows:
(A.43) |
Let us now provide an upper bound for the term . On the one hand we have:
(A.44) | ||||
where in the first inequality we used the Cauchy-Schwarz inequality and for the last one the optimality condition (A.38). On the other hand we find:
(A.45) | ||||
where in the first inequality we used the Cauchy-Schwarz inequality and in the last one the convexity of . By relations (A.44) and (A.45), it follows that:
(A.46) | ||||
where in the second inequality we used (A.43) and in the last one the fact that and the definition of the norm operator . By injecting relations (A.43) and (A.46) into (A.41), for all we find:
(A.47) | ||||
which shows that for all , satisfies the growth condition (A.35) in , with
A.3 Proof of Theorem 5.1
In this paragraph we provide the proof of Theorem 5.1, concerning Algorithm 1. We start with the following proposition which allows to deduce an upper bound for the gap of the dual objective function and its minimum value.
Proposition A.1.
Proof of Proposition A.1.
Let and , such that . Following the proof of Lemma A.8, by choosing in (A.31), we find:
(A.50) |
Since , for all , by neglecting the non-negative term and summing over relation (A.50), we deduce that for all it holds , therefore the sequence is bounded by .
On the other hand, by choosing in (A.31)and using the non-increasing property of (Lemma A.6), we find:
(A.51) |
which allows to conclude that the sequence .
By definition of (see Algorithm 1) and the characterization of the proximal operator, for all , we have or equivalently
(A.52) |
Hence (A.52) together with the contraction property of the operator for all , yields:
(A.53) |
By combining relations (A.51) and (A.53), we obtain:
(A.54) |
Since with and , by using (5.1) and Lemma A.9 we find:
(A.55) |
By adding and subtracting and using that and (see Lemma A.6), for all , we derive:
(A.56) |
By induction in (A.56), for all , it follows
(A.57) |
which allows to conclude the proof. ∎
Proof of Theorem 5.1.
Let such that . Since , Lemma A.7 yields
Lemma A.6 and the non-increasing property of proved in Lemma A.8 imply
We then derive from Proposition A.1 that
(A.58) |
from which (5.4) follows using the definition of and . The inequality and the bound (A.58) give
(A.59) | ||||
By setting
(A.60) |
for all , it holds . From Lemma A.4 we derive:
(A.61) |
which, together with (A.58), allows to conclude the proof of Theorem 5.1.
∎
A.4 Proof of Theorem 5.2
In this paragraph, we turn our attention to the convergence properties of Algorithm 2, hence the proof of Theorem 5.2. The analysis is based on discrete Lyapunov-energy techniques that have recently become very popular for studying inertial schemes like Algorithm 2. Our analysis follows the line of study adopted in a recent stream of papers such as [90, 22, 6, 7, 2, 3, 21] and their related references. The proof of Theorem 5.2 is based on the following proposition which provides some bounds for the dual objective function .
Proposition A.2.
Proof of Proposition A.2.
Let and the sequence generated by Algorithm 2 with . For all , let us define the following auxiliary sequences:
(A.64) | ||||
(A.65) |
The following energy sequence that will play a fundamental role in our analysis
(A.66) | ||||
We will show that by tuning properly the parameters and , the sequence is non-increasing.
The Descent Lemma A.5 (with and ), implies
(A.67) | ||||
By multiplying (A.67) with and using the definition of we derive:
(A.68) | ||||
On the other hand, the Descent Lemma A.5 (with and ) yields:
(A.69) |
which by multiplying by (here ) and using the definition of , gives
(A.70) |
Next, by using the non-increasing property of (in particular , see Lemma A.6) and adding and subtracting on the left-hand side of (A.71) and using the definition of , we derive:
(A.72) | ||||
Since , we have and , for all , thus from the previous relation we get:
(A.73) |
By adding and subtracting on both sides of the previous inequality, we have
(A.74) | ||||
with .
By considering now the variation of the sequence and performing some basic algebraic computations we have:
(A.75) | ||||
By adding on both sides of (A.74) and using (A.75), we obtain:
(A.76) | ||||
Since , (A.76) and the definition of (A.66), yield:
(A.77) |
Thus, setting and (here notice that ), we get:
(A.78) |
which shows that the energy sequence is non-increasing. In addition by neglecting the non-negative terms in the definition of (A.66) and using its non-increasing property, we derive:
(A.79) |
Proof of Theorem 5.2.
Lemma A.7 and the non-increasing property of given in Lemma A.6, together with the upper bound (A.62) in Proposition A.2 for the dual iterates yield
(A.82) |
where as follows by (A.63). By using the definition of and in the bound (A.82) allows to conclude the proof of the first part of Theorem 5.2. Regarding the margin and the angle gap rates, we proceed as in the proof of Theorem 5.1. By using the triangle inequality and (A.82), and setting
(A.83) |
we deduce that for all , we have . Therefore, thanks to Lemma A.4, the proof follows directly from the bound (A.82). ∎
A.5 Proof of Theorem 5.3
In this paragraph we provide the proof of Theorem 5.3 regarding the stability properties of Algorithm 1, in presence of noise, as discussed in section 5.2 .
Proof of Theorem 5.3.
Without loss of generality let us assume that , i.e. the set of flipped labels consists of the first indices (notice that up to a re-indexation one can always retrieve this case). Let also and , be the pair of sequences generated by Algorithm 1 applied to the noisy and true data respectively.
By using the triangle inequality, the Algorithm 1 for and and the definition of the operator norm, for all , it holds:
(A.84) | ||||
The bound in the right-hand side of (A.84) is formed by three parts, the first two are related with the stability properties of Algorithm 1, while the last one to its optimization one. By using the convergence result (5.4) in Theorem 5.1, the bound (A.84) takes the following form:
(A.85) |
where and .
From the definition of the noise model , for all , the -th row of matrix can be expressed as follows
(A.86) |
By using the expression (A.86) the second term in the right-hand side of (A.84) can be bounded as follows:
(A.87) | ||||
where we used that , for all and . In addition since is generated by Algorithm 1, it satisfies the Fejer property with respect to any minimizer of , i.e. (this can be seen as an immediate consequence of Lemma A.8). Therefore from (A.87), for all , it follows:
(A.88) |
For the first term in the right-hand side of (A.84), by using Algorithm 1, we obtain:
(A.89) | ||||
where in the second equality we used the scaling property of the proximal operator [11, Theorem ] and in the first inequality its non expansiveness property [10, Proposition ]. The second inequality is the triangular inequality and the last one follows from the non-expansiveness of the operator (since ).
By applying recursively relation (A.89), since , for all , it follows:
(A.90) |
where we used the notation .
By using the noise model , for all , the matrix can be expressed element-wise as follows
(A.91) |
By the definition of the euclidean norm, and the expression (A.91), for any , the term can be bounded as follows:
(A.92) | ||||
where in the first inequality we used the convexity of the squared norm, in the second one the definition of and in the last one, the convention .
Therefore, by using (A.92) in (A.90), it follows that:
(A.93) | ||||
where we used the triangular inequality and the Fejer property of the sequence with respect to any minimizer of , i.e. (as an immediate consequence of Lemma A.8).
References
- [1] D. Angluin and P. Laird, Learning from noisy examples, Machine learning, 2 (1988), pp. 343–370.
- [2] V. Apidopoulos, J.-F. Aujol, and C. Dossal, Convergence rate of inertial forward–backward algorithm beyond nesterov’s rule, Mathematical Programming, 180 (2020), pp. 137–156.
- [3] V. Apidopoulos, J.-F. Aujol, C. Dossal, and A. Rondepierre, Convergence rates of an inertial gradient descent algorithm under growth and flatness conditions, Mathematical Programming, 187 (2021), pp. 151–193.
- [4] V. Apidopoulos, N. Ginatta, and S. Villa, Convergence rates for the heavy-ball continuous dynamics for non-convex optimization, under polyak–łojasiewicz condition, Journal of Global Optimization, (2022), pp. 1–27.
- [5] H. Attouch, Viscosity solutions of minimization problems, SIAM Journal on Optimization, 6 (1996), pp. 769–806.
- [6] H. Attouch, A. Cabot, Z. Chbani, and H. Riahi, Rate of convergence of inertial gradient dynamics with time-dependent viscous damping coefficient, Evolution Equations and Control Theory, 7 (2018), pp. 353–371.
- [7] H. Attouch, Z. Chbani, J. Peypouquet, and P. Redont, Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity, Mathematical Programming, 168 (2018), pp. 123–175.
- [8] A. Auslender, J. Crouzeix, and P. Fedit, Penalty-proximal methods in convex programming, Journal of Optimization Theory and Applications, 55 (1987), pp. 1–21.
- [9] M. Bahraoui and B. Lemaire, Convergence of diagonally stationary sequences in convex optimization, Set-Valued Analysis, 2 (1994), pp. 49–61.
- [10] H. H. Bauschke and P. L. Combettes, Convex analysis and monotone operator theory in Hilbert spaces, Springer Science & Business Media, 2011.
- [11] A. Beck, First-order methods in optimization, SIAM, 2017.
- [12] A. Beck and S. Shtern, Linearly convergent away-step conditional gradient for non-strongly convex functions, Mathematical Programming, 164 (2017), pp. 1–27.
- [13] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM journal on imaging sciences, 2 (2009), pp. 183–202.
- [14] J. Bolte, T. P. Nguyen, J. Peypouquet, and B. W. Suter, From error bounds to the complexity of first-order descent methods for convex functions, Mathematical Programming, 165 (2017), pp. 471–507.
- [15] L. Bottou and O. Bousquet, The tradeoffs of large-scale learning, Optimization for machine learning, (2011), pp. 351–368.
- [16] S. Boucheron, O. Bousquet, and G. Lugosi, Theory of classification: a survey of some recent advances, ESAIM: Probability and Statistics, 9 (2010), pp. 323–375.
- [17] O. Bousquet and A. Elisseeff, Stability and generalization, Journal of Machine Learning Research, 2 (2002), pp. 499–526.
- [18] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge university press, 2004.
- [19] R. Boyer, Quelques algorithmes diagonaux en optimisation convexe, PhD thesis, Université de Provence, 1974.
- [20] P. Brianzi, F. Di Benedetto, and C. Estatico, Preconditioned iterative regularization in banach spaces, Computational Optimization and Applications, 54 (2013), pp. 263–282.
- [21] L. Calatroni, G. Garrigos, L. Rosasco, and S. Villa, Accelerated iterative regularization via dual diagonal descent, arXiv preprint arXiv:1912.12153, (2019).
- [22] A. Chambolle and C. Dossal, On the convergence of the iterates of the “fast iterative shrinkage/thresholding algorithm”, Journal of Optimization Theory and Applications, 166 (2015), pp. 968–982.
- [23] O. Chapelle, Training a support vector machine in the primal, Neural computation, 19 (2007), pp. 1155–1178.
- [24] L. Chizat and F. Bach, Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss, in Conference on Learning Theory, PMLR, 2020, pp. 1305–1338.
- [25] K. L. Clarkson, E. Hazan, and D. P. Woodruff, Sublinear optimization for machine learning, Journal of the ACM (JACM), 59 (2012), pp. 1–49.
- [26] P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting, Multiscale Modeling & Simulation, 4 (2005), pp. 1168–1200.
- [27] C. Cortes and V. Vapnik, Support-vector networks, Machine learning, 20 (1995), pp. 273–297.
- [28] N. Cristianini, J. Shawe-Taylor, et al., An introduction to support vector machines and other kernel-based learning methods, Cambridge university press, 2000.
- [29] A. Dieuleveut, A. Durmus, and F. Bach, Bridging the gap between constant step size stochastic gradient descent and markov chains, The Annals of Statistics, 48 (2020), pp. 1348–1382.
- [30] A. Dontchev and F. Lempio, Difference methods for differential inclusions: A survey, SIAM Review, 34 (1992), pp. 263–294.
- [31] H. W. Engl, M. Hanke, and A. Neubauer, Regularization of inverse problems, vol. 375, Springer Science & Business Media, 1996.
- [32] M. C. Ferris and T. S. Munson, Interior-point methods for massive support vector machines, SIAM Journal on Optimization, 13 (2002), pp. 783–804.
- [33] Y. Freund and R. E. Schapire, Large margin classification using the perceptron algorithm, in Proceedings of the eleventh annual conference on Computational learning theory, 1998, pp. 209–217.
- [34] G. Garrigos, L. Rosasco, and S. Villa, Iterative regularization via dual diagonal descent, Journal of Mathematical Imaging and Vision, 60 (2018), pp. 189–215.
- [35] , Convergence of the forward-backward algorithm: Beyond the worst case with the help of geometry, Mathematical Programming, (2023), pp. 937–996.
- [36] O. Güler, Foundations of optimization, vol. 258, Springer Science & Business Media, 2010.
- [37] S. Gunasekar, J. Lee, D. Soudry, and N. Srebro, Characterizing implicit bias in terms of optimization geometry, in International Conference on Machine Learning, PMLR, 2018, pp. 1832–1841.
- [38] S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro, Implicit bias of gradient descent on linear convolutional networks, in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds., Curran Associates, Inc., 2018, pp. 9461–9471.
- [39] S. Gunasekar, B. Woodworth, and N. Srebro, Mirrorless mirror descent: A natural derivation of mirror descent, in International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp. 2305–2313.
- [40] S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro, Implicit regularization in matrix factorization, in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds., vol. 30, Curran Associates, Inc., 2017.
- [41] M. Hardt, B. Recht, and Y. Singer, Train faster, generalize better: Stability of stochastic gradient descent, in International Conference on Machine Learning, PMLR, 2016, pp. 1225–1234.
- [42] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu, The entire regularization path for the support vector machine, Journal of Machine Learning Research, 5 (2004), pp. 1391–1415.
- [43] A. J. Hoffman, On approximate solutions of systems of linear inequalities, Journal of Research of the National Bureau of Standards, 49 (1952).
- [44] Z. Ji, M. Dudík, R. E. Schapire, and M. Telgarsky, Gradient descent follows the regularization path for general losses, in Conference on Learning Theory, 2020, pp. 2109–2136.
- [45] Z. Ji, N. Srebro, and M. Telgarsky, Fast margin maximization via dual acceleration, in International Conference on Machine Learning, PMLR, 2021, pp. 4860–4869.
- [46] Z. Ji and M. Telgarsky, The implicit bias of gradient descent on nonseparable data, in Conference on Learning Theory, 2019, pp. 1772–1798.
- [47] Q. Jin, X. Lu, and L. Zhang, Stochastic mirror descent method for linear ill-posed problems in banach spaces, Inverse Problems, 39 (2023), p. 065010.
- [48] H. Karimi, J. Nutini, and M. Schmidt, Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition, in Machine Learning and Knowledge Discovery in Databases, P. Frasconi, N. Landwehr, G. Manco, and J. Vreeken, eds., Cham, 2016, Springer International Publishing, pp. 795–811.
- [49] M. J. Kearns, R. E. Schapire, and L. M. Sellie, Toward efficient agnostic learning, in Proceedings of the fifth annual workshop on Computational learning theory, 1992, pp. 341–352.
- [50] S. Keerthi and E. Gilbert, Convergence of a generalized smo algorithm for svm classifier design, Machine Learning, 46 (2002), pp. 351–360.
- [51] M. J. Keith, A. Jameson, W. van Straten, M. Bailes, S. Johnston, M. Kramer, A. Possenti, S. D. Bates, N. D. R. Bhat, M. Burgay, S. Burke-Spolaor, N. D’Amico, L. Levin, P. L. McMahon, S. Milia, and B. W. Stappers, The High Time Resolution Universe Pulsar Survey – I. System configuration and initial discoveries, Monthly Notices of the Royal Astronomical Society, 409 (2010), pp. 619–627.
- [52] S. Kindermann, Optimal-order convergence of nesterov acceleration for linear ill-posed problems, Inverse Problems, 37 (2021), p. 065002.
- [53] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), pp. 2278–2324.
- [54] J. Lin, L. Rosasco, and D.-X. Zhou, Iterative regularization for learning with convex loss functions, Journal of Machine Learning Research, 17 (2016), pp. 1–38.
- [55] S. Łojasiewicz, Une propriété topologique des sous-ensembles analytiques réels, in Les Équations aux Dérivées Partielles (Paris, 1962), Éditions du Centre National de la Recherche Scientifique, Paris, 1963, pp. 87–89.
- [56] J. Lopez and J. Dorronsoro, The convergence rate of linearly separable smo, in Proceedings of the International Joint Conference on Neural Networks, 08 2013, pp. 1–7.
- [57] R. J. Lyon, B. Stappers, S. Cooper, J. M. Brooke, and J. D. Knowles, Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach, Monthly Notices of the Royal Astronomical Society, 459 (2016), pp. 1104–1123.
- [58] K. Lyu and J. Li, Gradient descent maximizes the margin of homogeneous neural networks, arXiv preprint arXiv:1906.05890, (2019).
- [59] B. Martinet, Perturbation des méthodes d’optimisation. applications, RAIRO. Analyse numérique, 12 (1978), pp. 153–171.
- [60] P. Massart and É. Nédélec, Risk bounds for statistical learning, The Annals of Statistics, (2006), pp. 2326–2366.
- [61] S. Matet, L. Rosasco, S. Villa, and B. C. Vu, Implicit regularization with strongly convex bias: Stability and acceleration, Analysis and Applications, (2023), pp. 165–191.
- [62] C. Molinari, M. Massias, L. Rosasco, and S. Villa, Iterative regularization for convex regularizers, in International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp. 1684–1692.
- [63] D. Molitor, D. Needell, and R. Ward, Bias of homotopic gradient descent for the hinge loss, Applied Mathematics & Optimization, (2020), pp. 1–27.
- [64] N. Mücke, G. Neu, and L. Rosasco, Beating sgd saturation with tail-averaging and minibatching, Advances in Neural Information Processing Systems, 32 (2019), pp. 12568–12577.
- [65] M. S. Nacson, S. Gunasekar, J. Lee, N. Srebro, and D. Soudry, Lexicographic and depth-sensitive margins in homogeneous and non-homogeneous deep models, in Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov, eds., vol. 97 of Proceedings of Machine Learning Research, PMLR, 09–15 Jun 2019, pp. 4683–4692.
- [66] M. S. Nacson, J. Lee, S. Gunasekar, P. H. Savarese, N. Srebro, and D. Soudry, Convergence of gradient descent on separable data, arXiv preprint arXiv:1803.01905, (2018).
- [67] Y. Nesterov, Introductory lectures on convex optimization: A basic course, 2013.
- [68] A. Neubauer, On Nesterov acceleration for Landweber iteration of linear ill-posed problems, Journal of Inverse and Ill-posed Problems, 25 (2016), pp. 381 – 390.
- [69] B. Neyshabur, R. Tomioka, R. Salakhutdinov, and N. Srebro, Geometry of optimization and implicit regularization in deep learning, arXiv preprint arXiv:1705.03071, (2017).
- [70] B. Neyshabur, R. Tomioka, and N. Srebro, In search of the real inductive bias: On the role of implicit regularization in deep learning, arXiv preprint arXiv:1412.6614, (2014).
- [71] A. B. Novikoff, On convergence proofs for perceptrons, tech. rep., Stanford research institution menlo park CA, 1963.
- [72] N. Pagliana and L. Rosasco, Implicit regularization of accelerated methods in hilbert spaces, in Advances in Neural Information Processing Systems, 2019, pp. 14481–14491.
- [73] J. Peypouquet, Convex optimization in normed spaces: theory, methods and examples, Springer, 2015.
- [74] J. C. Platt, Fast training of support vector machines using sequential minimal optimization, in Advances in kernel methods: support vector learning, MIT Press, 1999, pp. 185–208.
- [75] A. Ramdas and J. Pena, Towards a deeper geometric, analytic and algorithmic understanding of margins, Optimization Methods and Software, 31 (2016), pp. 377–391.
- [76] M. Rando, L. Carratino, S. Villa, and L. Rosasco, Ada-bkb: Scalable gaussian process optimization on continuous domains by adaptive discretization, in International Conference on Artificial Intelligence and Statistics, PMLR, 2022, pp. 7320–7348.
- [77] A. Rangamani, M. Xu, A. Banburski, Q. Liao, and T. Poggio, Dynamics and neural collapse in deep classifiers trained with the square loss, Center for Brains, Minds and Machines (CBMM) Memo, (2021).
- [78] G. Raskutti, M. J. Wainwright, and B. Yu, Early stopping and non-parametric regression: an optimal data-dependent stopping rule, The Journal of Machine Learning Research, 15 (2014), pp. 335–366.
- [79] L. Rosasco and S. Villa, Learning with incremental iterative regularization, in Advances in Neural Information Processing Systems, 2015, pp. 1630–1638.
- [80] L. Rosasco, S. Villa, and B. C. Vũ, Convergence of stochastic proximal gradient algorithm, Applied Mathematics and Optimization, (2020), pp. 891–917.
- [81] S. Rosset, J. Zhu, and T. Hastie, Boosting as a regularized path to a maximum margin classifier, Journal of Machine Learning Research, 5 (2004), pp. 941–973.
- [82] S. Rosset, J. Zhu, and T. J. Hastie, Margin maximizing loss functions, in Advances in neural information processing systems, 2004, pp. 1237–1244.
- [83] S. Salzo and S. Villa, Proximal gradient methods for machine learning and imaging, in Harmonic and applied analysis—from Radon transforms to machine learning, Appl. Numer. Harmon. Anal., Birkhäuser/Springer, Cham, [2021] ©2021, pp. 149–244.
- [84] T. Schuster, B. Kaltenbacher, B. Hofmann, and K. S. Kazimierski, Regularization methods in banach spaces, in Regularization Methods in Banach Spaces, de Gruyter, 2012.
- [85] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms, Cambridge university press, 2014.
- [86] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter, Pegasos: Primal estimated sub-gradient solver for svm, Mathematical programming, 127 (2011), pp. 3–30.
- [87] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro, The implicit bias of gradient descent on separable data, J. Mach. Learn. Res., 19 (2018), p. 2822–2878.
- [88] B. Stankewitz, N. Mücke, and L. Rosasco, From inexact optimization to learning via gradient concentration, arXiv preprint arXiv:2106.05397, (2021).
- [89] I. Steinwart and A. Christmann, Support vector machines, Springer Science & Business Media, 2008.
- [90] W. Su, S. Boyd, and E. J. Candes, A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights, Journal of Machine Learning Research, 17 (2016), pp. 1–43.
- [91] A. N. Tikhonov, Solution of incorrectly formulated problems and the regularization method, Soviet Math., 4 (1963), pp. 1035–1038.
- [92] A. N. Tikhonov and V. Arsenine, Solutions of ill-posed problems, John Wiley and Sons, 1977.
- [93] V. Vapnik, The nature of statistical learning theory, Springer science & business media, 2013.
- [94] P.-W. Wang and C.-J. Lin, Iteration complexity of feasible descent methods for convex optimization, The Journal of Machine Learning Research, 15 (2014), pp. 1523–1548.
- [95] Y. Yao, L. Rosasco, and A. Caponnetto, On early stopping in gradient descent learning, Constructive Approximation, 26 (2007), pp. 289–315.
- [96] C. Zǎlinescu, Sharp estimates for hoffman’s constant for systems of linear inequalities and equalities, SIAM Journal on Optimization, 14 (2003), pp. 517–533.
- [97] G. Zanghirati and L. Zanni, A parallel solver for large quadratic programs in training support vector machines, Parallel Computing, 29 (2003), pp. 535 – 551. Parallel computing in numerical optimization.
- [98] T. Zhang and B. Yu, Boosting with early stopping: Convergence and consistency, The Annals of Statistics, 33 (2005), pp. 1538–1579.
- [99] Y. Zhang, On the acceleration of optimal regularization algorithms for linear ill-posed inverse problems, Calcolo, 60 (2023), p. 6.