Availability of large high-dimensional data-sets has motivated an interest in the interplay between statistics and optimization, towards developing new, more efficient learning solutions
. Indeed, while much theoretical work has been classically devoted to study statistical properties of estimators defined by variational schemes (e.g. Empirical Risk Minimization or Tikhonov regularization ), and to the computational properties of optimization procedures to solve the corresponding minimization problems (see e.g. ), much less work has considered the integration of statistical and optimization aspects, see for example [15, 39, 25].
With the latter objective in mind, in this paper, we focus on so called iterative regularization. This class of methods, originated in a series of work in the mid-eighties [23, 26], is based on the observation that early termination of an iterative optimization scheme applied to an ill-posed problem has a regularization effect. A critical implication of this fact is that the number of iterations serves as a regularization parameter, hence linking modeling and computational aspects: computational resources are directly linked to the generalization properties in the data, rather than their raw amount. Further, iterative regularization algorithms have a built-in ”warm restart” property which allows to compute automatically a whole sequence (path) of solutions corresponding to different levels of regularization. This latter property is especially relevant to efficiently determine the appropriate regularization via model selection.
and references therein. In machine learning, iterative regularization is often simply referred to as early stopping and is a well known ”trick”, e.g. in training neural networks. Theoretical studies of iterative regularization in machine learning have mostly focused on the least squares loss function [11, 41, 7, 27]. Indeed, it is in this latter case that the connection to inverse problems can be made precise . Interestingly, early stopping with the square loss has been shown to be related to boosting  and also to be a special case of a large class of regularization approaches based on spectral filtering [19, 4]. The regularizing effect of early stopping for loss functions other than the least squares one has hardly been studied. Indeed, to the best of our knowledge the only papers considering related ideas are [3, 6, 20, 45], where early stopping is studied in the context of boosting algorithms.
This paper is a different step towards understanding how early stopping can be employed with general convex loss functions. Within a statistical learning setting, we consider convex loss functions and propose a new form of iterative regularization based on the subgradient method, or the gradient descent if the loss is smooth. The resulting algorithms provide iterative regularization alternatives to support vector machines or regularized logistic regression, and have built in the property of computing the whole regularization path. Our primary contribution in this paper is theoretical. By integrating optimization and statistical results, we establish non-asymptotic bounds quantifying the generalization properties of the proposed method under standard regularity assumptions. Interestingly, our study shows that considering the last iterate leads to essentially the same results as considering averaging, or selecting of the ”best” iterate, as typically done in subgradient methods. From a technical point of view, considering a general convex loss requires different error decompositions than those for the square loss. Moreover, operator theoretic techniques need to be replaced by convex analysis and empirical process results. The error decomposition we consider, accounts for the contribution of both optimization and statistics to the error, and could be useful also for other methods.
The rest of the paper is organized as as follows. We begin in Section 2 by briefly recalling the supervised learning problem, and then introduce our learning algorithm, discuss its numerical realization. In Section 3, after discussing the assumptions that underlie our analysis, we present our main theorems with discussions and discuss the general error decomposition which are composed of three error terms: the computational, the sample and approximation error terms. In Section 4, we will estimate computational error, while in Section 5, we develop sample error bounds, and finally prove our main results.
2 Learning Algorithm
After briefly recalling the supervised learning problem, we introduce the algorithm we propose and give some comments on its numerical realization.
2.1 Problem Statement
In this paper we consider the problem of supervised learning. Let be a separable metric space, and let
be a Borel probability measure onMoreover, let be a so called loss function, measuring the local error for and . The generalization error (or expected risk) associated to is given by
and is well defined for any measurable loss function and measurable function . We assume throughout that there exists a function that minimizes the expected error among all measurable functions . Roughly speaking, the goal of learning is to find an approximation of when the measure is known only through a sample of size independently and identically drawn according to . More precisely, given the goal is to design a computational procedure to efficiently estimate a function , an estimator, for which it is possible to derive an explicit probabilistic bound on the excess expected risk
We end with this section with a remark and an example.
For several loss functions, it is possible to show that exists– see example below. However, as will be seen in the following, the search for an estimator in practice is often restricted to some hypothesis space of measurable functions. In this case one should replace by . Interestingly, examples of hypothesis spaces are known for which , namely universal hypothesis spaces . In the following, we consider , with the understanding that it should be replaced by the infimum over , if the latter is not universal.
The following example gives several possible choices of loss functions.
The most classical example of loss function is probably the square loss , . In this case, is the regression function, defined at every point as the expectation of the conditional distribution of given [17, 34]. Further examples include the absolute value loss for which is the median of the conditional distribution and more generally -loss functions , . Vapnik’s -insensitive loss , and its generalizations , provide yet other examples. For classification i.e. , other examples of loss functions used in classification, include the hinge loss , the logistic loss and the exponential loss . For all these examples can be computed, see e.g. , and measurability is easy to check.
2.2 Learning via Subgradient Methods with Early Stopping
To present the proposed learning algorithm we need a few preliminary definitions. Consider a reproducing kernel , that is a symmetric function, such that the matrix is positive semidefinite for any finite set of points in . Recall that a reproducing kernel defines a reproducing kernel Hilbert space (RKHS) as the completion of the linear span of the set with respect to the inner product . Moreover, assume the loss function to be measurable and convex in its second argument, so that the corresponding left derivative exists and is non-decreasing at every point. For a step size sequence , a stopping iteration and a initial value , we consider the iteration
Indeed, it is easy to see that , the subgradient of the empirical risk for . In the special case where the loss function is smooth then (2.1) reduces to the gradient descent algorithm. Since the subgradient method is not a descent algorithm, rather then the last iterate, the so called Cesáro mean is often considered, corresponding, for , to the following weighted average
Alternatively, the best iterate is also often considered, which is defined for by
In what follows, we will consider the learning algorithms obtained considering these different choices.
We note that, classical results [5, 10, 9] on the subgradient method focus on how the iteration (2.1) can be used to minimize . Different to these studies, in the following we are interested in showing how iteration (2.1) can be used to define a statistical estimator, hence a learning algorithm to minimize the expected risk , rather than the empirical risk . We end with one remark.
Remark 2.2 (Early Stopping SVM and Kernel Perceptron).
If we consider the hinge loss function in (2.1 ), the corresponding algorithm is closely related to a batch (kernel) version of the perceptron
), the corresponding algorithm is closely related to a batch (kernel) version of the perceptron[29, 1], where an entire pass over the data is done before updating the solution. Such an algorithm can also be seen as an early stopping version of Support Vector Machines . Interestingly, in this case the whole regularization path is computed incrementally albeit sparsity could be lost. We defer to a future work the study of the practical implications of these observations.
2.3 Numerical Realization
The simplest case to derive a numerical procedure from Algorithm 2.1 is when for some and is the associated inner product. In this case it is straightforward to see that for all , with
Beyond the linear kernel, it can be easily seen that given a finite dictionary
one can consider the kernel . In this case, it holds , for all , with
and . Finally, for a general kernel it is easy to prove by induction that for all , with
for and with . Indeed, The base case is straightforward to check and moreover by the inductive hypothesis
3 Main Results with Discussions
After presenting our main assumptions, in this section we state and discuss our main results.
Our learning rates will be stated under several conditions on the triple , that we describe and comment next. We begin with a basic assumption.
We assume the kernel to be bounded, that is and moreover and . Furthermore, we consider the following growth condition for the left derivative . For some and constant it holds,
The boundness conditions on and are fairly common [17, 34]. They could probably be weakened by considering a more involved analysis which is outside the scope of this paper. Interestingly, the growth condition on the left derivative of is weaker than assuming the loss, or its gradient, to be Lipschitz in its second entry, as often done both in learning theory [17, 34] and in optimization . We note that the growth condition (3.1) is implied by the requirement for the loss function to be Nemitiski, as introduced in  (see also ). This latter condition, which is satisfied by most loss function, is natural to provide a variational characterization of the learning problem.
The second assumption refines the above boundness condition by considering a variance-expectation bound which quantifies a notion of noise in the measurewith respect to the balls in , [17, 34].
We assume that there exist an exponent and a positive constant such that for any and , we have
Assumption 3.2 always holds true for , in which case will also depend on . In classification, the above condition can be related to the so called Tsybakov margin condition. The latter quantifies the intuition that a classification problem is hard if the conditional probability of given is close to for many input points. More precisely if we denote by the conditional probability for all and by the marginal probability on , then we say that satisfies the Tsybakov margin condition with exponent if there exists a constant such that for all
Interestingly, under Tsybakov margin condition Assumption 3.2 holds with and with depending only .
The third condition is about the decay of a suitable notion approximation error .
Let be a minimizer of:
The approximation error associated with the tripe is defined by
We assume that for some and , the approximation error satisfies
The above assumption is standard when analyzing regularized empirical risk minimization and is related to the the definition of interpolation spaces by means of so the called- functional . Interestingly, we will see in the following that it is also important when analyzing the approximation properties of the subgradient algorithms 2.1.
Finally, the last condition characterizes the capacity of a ball in the RKHS in terms of empirical covering numbers, and plays an essential role in sample error estimates. Recall that for a subset of a metric space , the covering number is defined by
Let be a set of functions on . The metric is defined on by
We assume that for some , , the covering numbers of the unit ball in with respect to satisfy
The smaller is the more stringent is the capacity assumption. As approaches we are essentially considering a capacity independent scenario. In what follows, we will briefly comment on the connection between the above assumption and other related assumptions. Recall that capacity of the RKHS may be measured by various concepts: covering numbers of balls in
, (dyadic) entropy numbers and decay of the eigenvalues of the integral operatorgiven by where For a subset of a metric space , its -th entropy number is defined by
First, the covering and entropy numbers are equivalent (see e.g. [34, Lemma 6.21]). Indeed, for the covering numbers satisfy
for some , if and only if the entropy numbers satisfy
for some Second, it is shown in  that if the eigenvalues of the integral operator satisfy
for some constants and then the expectations of the random entropy numbers satisfy
for some constant Hence, using the equivalence of covering and entropy numbers, one can be estimated from the eigenvalue decay of the integral operator . Last, since one has that for any is bounded by the uniform covering number of under the metric Thus, the covering numbers can be estimated given the uniform smoothness of the kernel .
3.2 Finite Sample Bounds for General Convex Loss Functions
The following is our main result providing a general finite sample bound for the iterative regularization induced by the the subgradient method for convex loss functions considering the last iterate.
The proof is deferred to Section 5 and is based on a novel error decomposition, discussed in Section 3.6, integrating statistical and optimization aspects. We illustrate the above result for Lipschitz loss functions, that is considering .
The above results give finite sample bounds on the excess risk, provided that a suitable stopping rule is considered. While the stopping rule in above theorems is distribution dependent, a data-drive stopping rule can be given by hold-out cross validation and adaptively achieves the same bound. The proof of this latter result is straightforward using the techniques in  and is omitted. The obtained bounds directly yields strong consistency (almost sure convergence) using standard arguments. Interestingly, our analysis suggests that a decaying stepsize needs to be chosen to achieve meaningful error bounds. The stepsize choice can influence both the early stopping rule and the error bounds. More precisely, if the step size decreases fast enough , the stopping rule depends on the decay speed but the error bound does not. In this case the best possible choice for the early stopping rule is , that is in the case of Lipschitz loss functions. With this choice, if for example we take the limit , , we have that the stopping rule scales as whereas the corresponding finite sample bounds is . A slower stepsize decay given by affects both the stopping rule and the error bounds, but the results in these regime worsen. A more detailed discussion of the obtained bounds in comparison to other learning algorithms is postponed to Section 3.5. Next we discuss the behavior of different variants of the proposed algorithm.
As mentioned before in the subgradient method, when the goal is empirical risk minimization, the average or best iterates are often considered (see (2.2), (2.3)). It is natural to ask what are the properties of the estimator obtained with these latter choices, that is when they are used as approximate minimizers of the expected, rather than the empirical, risk. The following theorem provides an answer.
The above result shows that, perhaps surprisingly, the behavior of the average and best iterates is essentially the same as the last iterate. Indeed, there is only a subtle difference between the upper bounds in Theorem 3.7 and Theorem 3.5, since the latter has an extra factor when In the next section we consider the case where loss is not only convex but also smooth.
3.3 Finite Sample Bounds for Smooth Loss Functions
In this section, we additionally assume that is differentiable and is Lipschitz continuous with constant , i.e., for any and
For the logistic loss in binary classification, see Example 2.1, it is easy to prove that both and is Lipschitz continuous with constant , for all . With the above smoothness assumption, we prove the following convergence result.
The proof of this result will be given in Section 5. We can simplify the result by considering Lipschitz loss function () and setting .
Under the assumptions of Theorem 3.8, let If is the integer part of , then for any , with confidence , we have
where the power indices and are defined as
and is a constant independent of or .
The finite sample bound obtained above is essentially the same as the best possible bound obtained for general convex loss. However, the important difference is that for smooth loss function, a constant stepsize can be chosen and allows to considerably improve the stopping rule. Indeed, if for example we can consider the limit , , we have that the stopping is , rather than , whereas the corresponding finite sample bounds is again .
3.4 Iterative Regularization for Classification: Surrogate Loss Functions and Hinge Loss
We briefly discuss how the above results allows us to derive error bounds in binary classification. In this latter case and a natural choice for the loss function is the misclassification loss given by
for and , if , and otherwise. The corresponding generalization error, usually denoted by , is called mislassification risk, since it can be shown to be the probability of the event . The minimizer of the misclassification error is the Bayes rule given by
The misclassification loss (3.10) is neither convex nor smooth and thus leads to intractable problems. Moreover, the search of a solution among binary valued functions is also unfeasible. In practice, a convex (so called surrogate
) loss function is typically considered and a classifier is obtained by estimating a real functionand then taking its sign defined as
The question arises of if, and how, error bounds on the excess risk yields results on . Indeed, so called comparison results are known relating these different error measures, see e.g. [17, 34] and references therein. We discuss in particular the case of the hinge loss function, see Example 2.1, since in this case for all measurable functions it holds that
Indeed, the hinge loss satisfies Assumption (3.1) with and, under Tsybakov noise condition, Assumption (3.2). Misclassification error bound, for the iterative regularization induced by the hinge loss, can then be obtained as a corollary of Theorem 3.5 and using the above facts. Below we provide a simplified result.
The proof of the above result is given in Section 5, whereas we comment on the obtained rates in the next section. We add one of observation first. We note that, as illustrated by the the next result, a different regularization strategy than early stopping can be considered, where the stopping rule is kept fixed while the step size is chosen in a distribution dependent way.
3.5 Comparison with Other Learning Algorithms
As mentioned before iterative regularization has clear advantages from a computational point of view. The algorithm reduces to a simple first order method with typically low iteration cost and allows to seemly compute the estimators corresponding to different regularization level (the regularization path), a crucial fact since model selection needs to be performed. It is natural to compare the obtained statistical bounds with those for other learning algorithms. For general convex loss functions, the methods for which sharp bounds are available, are penalized empirical risk minimization (Tikhonov regularization), i.e.
which reduces to
if no variance assumption is made () and in capacity independent limit (). While from Theorem 3.5 for Lipschitz loss functions, we see that the bound we obtain are of order with exponent
in no variance and capacity independent limit. The obtained bounds are worse than the best ones available for Tikhnov regularization. However, the analysis of the latter does not take into account the optimization error and it is still an open question whether the best rate is preserved when such an error is incorporated. At this point we are prone to believe this gap to be a byproduct of our analysis rather than a fundamental fact, and addressing this point should be a subject of further work. Moreover, we note that our analysis allows to derive error bound for all Nemitski loss functions.
Beyond Tikhnov regularization, we can compare with the online regularization scheme for the hinge loss. The online learning algorithms with regularization sequence defined by
were studied in [43, 42]. Our results improves the results in [43, 42] in two aspects. The bound obtained in  is of the form while the bound in Theorem 3.11 is of type by substituting the expression for . Moreover, our results are with high probability and promptly yields almost sure convergence whereas the results in  are only in expectation. We note that, interestingly, sharp bounds for Lipschitz loss functions are derived in , although the obtained results do not take into account capacity and variance assumptions that could lead to large improvements.
We next compare with the previous results on iterative regularization. The only results available thus far have been obtained for the square loss, for which bounds have been first derived for gradient descent in , but only for a fixed design regression setting, and in  for a general statistical learning setting. While the bounds in  are suboptimal, they have later been improved in [4, 14, 27]. Interestingly, sharp error bounds have also been proved for iterative regularization induced by other, potentially faster, iterative techniques, including incremental gradient , conjugate gradient  and the so called -method [4, 14], an accelerated gradient descent technique related to Chebyshev method . The best obtained bounds are of order and can be shown to be optimal since they match a corresponding minimax lower bound . Holding not only for the square loss, but for general Nemitski loss functions, the bound obtained in Theorem 3.8 is of order , which is worse. In the capacity independent limit, the best available bound we obtain is of order , whereas the optimal bound is of order . Also, in this case, the reason for the gap appears to be of technical reason and should be further studied.
Finally, before giving the proof of our results in details, in the next section, we discuss the general error decomposition underlying our approach, which highlights the interplay between statistics and optimization and could be also useful in other contexts.
3.6 Error Decomposition
Theorems 3.5 and 3.8 rely on a key error decomposition, that we derive next. The goal is to estimate the excess risk , and the starting point is to split the error by introducing a reference function ,
The above equation can be further developed by considering
Inspection of the above expression provides several insights. The first term is a computational error related to optimization. It quantifies the discrepancy between the empirical errors of the iterate defined by the subgradient method and that of the reference function. The last two terms are related to statistics. The second term is a sample error and can be studied using empirical process theory, provided that a bound on the norm of the iterates (and of the reference function) is available. Indeed, to get a sharper concentration the recentered quantity
Finally the last term suggests that a natural choice for the reference function is an almost minimizer
of the expected risk, having bounded norm, and for which the approximation level can be quantified. While there is a certain degree of freedom in the latter choice, in the following we will consider, the minimizer of (3.3). With this latter choice we can control
by given in Assumption 3.3. Indeed, other choices are possible, for example