The stochastic gradient method (SGM), often called stochastic gradient descent, has become an algorithm of choice in machine learning, because of its simplicity and small computational cost especially when dealing with big data sets.
Despite its widespread use, the generalization properties of the variants of SGM used in practice are relatively little understood. Most previous works consider generalization properties of SGM with only one pass over the data, see e.g.  or  and references therein, while in practice multiple passes are usually considered. The effect of multiple passes has been studied extensively for the optimization of an empirical objective 
, but the role for generalization is less clear. In practice, early-stopping of the number of iterations, for example monitoring a hold-out set error, is a strategy often used to regularize. Moreover, the step-size is typically tuned to obtain the best results. The study in this paper is a step towards grounding theoretically these commonly used heuristics.
Our starting points are a few recent works considering the generalization properties of different variants of SGM. One first series of results focus on least squares, either with one [21, 20, 10], or multiple (deterministic) passes over the data . In the former case it is shown that, in general, if only one pass over the data is considered, then the step-size needs to be tuned to ensure optimal results. In  it is shown that a universal step-size choice can be taken, if multiple passes are considered. In this case, it is the stopping time that needs to be tuned.
In this paper, we are interested in general, possibly non smooth, convex loss functions. The analysis for least squares heavily exploits properties of the loss and does not generalize to this broader setting. Here, our starting points are the results in [12, 11, 15] considering convex loss functions. In , early stopping of a (kernelized) batch subgradient method is analyzed, whereas in  the stability properties of SGM for smooth loss functions are considered in a general stochastic optimization setting and certain convergence results are derived. In , a more complex variant of SGM is analyzed and shown to achieve optimal rates.
Since we are interested in analyzing regularization and generalization properties of SGM, in this paper we consider a general non-parametric setting. In this latter setting, the effects of regularization are typically more evident since it can directly affect the convergence rates. In this context, the difficulty of a problem is characterized by an assumption on the approximation error. Under this condition, the need for regularization becomes clear. Indeed, in the absence of other constraints, the good performance of the algorithm relies on a bias-variance trade-off that can be controlled by suitably choosing the step-size and/or the number of passes. These latter parameters can be seen to act as regularization parameters. Here, we refer to the regularization as ‘implicit’, in the sense that it is achieved neither by penalization nor by adding explicit constraints. The two main variants of the algorithm are the same as in least squares: one pass over the data with tuned step-size, or, fixed step-size choice and number of passes appropriately tuned. While in principle optimal parameter tuning requires explicitly solving a bias-variance trade-off, in practice adaptive choices can be implemented by cross-validation. In this case, both algorithm variants achieve optimal results, but different computations are entailed. In the first case, multiple single pass SGM need to be considered with different step-sizes, whereas in the second case, early stopping is used. Experimental results, complementing the theoretical analysis, are given and provide further insights on the properties of the algorithms.
The rest of the paper is organized as follows. In Section 2
, we describe the supervised learning setting and the algorithm, and in Section3, we state and discuss our main results. The proofs are postponed to the supplementary material. In Section 4, we present some numerical experiments on real datasets.
Notation. For notational simplicity, denotes for any . The notation means that there exists a universal constant such that for all Denote by the smallest integer greater than for any given
2 Learning with SGM
In this section, we introduce the supervised learning problem and the SGM algorithm.
be a probability space andbe a subset of . Let be a probability measure on Given a measurable loss function the associated expected risk is defined as
The distribution is assumed to be fixed, but unknown, and the goal is to find a function minimizing the expected risk given a sample of size independently drawn according to . Many classical examples of learning algorithms are based on empirical risk minimization, that is replacing the expected risk with the empirical risk defined as
In this paper, we consider spaces of functions which are linearly parameterized. Consider a possibly non-linear data representation/feature map , mapping the data space in , , or more generally in a (real separable) Hilbert space with inner product and norm . Then, for we consider functions of the form
Examples of the above setting include the case where we consider infinite dictionaries, , , so that , for all and (1) corresponds to . Also, this setting includes, and indeed is equivalent to considering, functions defined by a positive definite kernel , in which case , for all , the reproducing kernel Hilbert space associated with , and (1) corresponds to the reproducing property
In the following, we assume the feature map to be measurable and define expected and empirical risks over functions of the form (1). For notational simplicity, we write as , and as .
Stochastic Gradient Method.
For any fixed , assume the univariate function on to be convex, hence its left-hand derivative exists at every and is non-decreasing.
Given a sample , the stochastic gradient method (SGM) is defined by and
for a non-increasing sequence of step-sizes and a stopping rule .
Here, are independent and identically distributed (i.i.d.) random variables from the uniform distribution on
are independent and identically distributed (i.i.d.) random variables111More precisely, are conditionally independent given any .
from the uniform distribution on. The (weighted) averaged iterates are defined by
Note that may be greater than , indicating that we can use the sample more than once. We shall write to mean , which will be also abbreviated as when there is no confusion.
The main purpose of the paper is to estimate the expected excess risk of the last iterate
or similarly the expected excess risk of the averaged iterate , and study how different parameter settings in (1) affect the estimates. Here, the expectation stands for taking the expectation with respect to (given any ) first, and then the expectation with respect to .
3 Implicit Regularization for SGM
In this section, we present and discuss our main results. We begin in Subsection 3.1 with a universal convergence result and then provide finite sample bounds for smooth loss functions in Subsection 3.2, and for non-smooth functions in Subsection 3.3. As corollaries of these results we derive different implicit regularization strategies for SGM.
We begin presenting a convergence result, involving conditions on both the step-sizes and the number of iterations. We need some basic assumptions.
Furthermore, the loss function is convex with respect to its second entry, and . Moreover, its left-hand derivative is bounded:
The above conditions are common in statistical learning theory[19, 9]. For example, they are satisfied for the hinge loss or the logistic loss for all , if is compact and is continuous.
The bounded derivative condition (5) is implied by the requirement on the loss function to be Lipschitz in its second entry, when is a bounded domain. Given these assumptions, the following result holds.
If Assumption 1 holds, then
provided the sequence and the stopping rule satisfy
As seen from the proof in the appendix, Conditions (A) and (B) arise from the analysis of suitable sample, computational, and approximation errors. Condition (B) is similar to the one required by stochastic gradient methods [3, 7, 6]. The difference is that here the limit is taken with respect to the number of points, but the number of passes on the data can be bigger than one.
Theorem 1 shows that in order to achieve consistency, the step-sizes and the running iterations need to be appropriately chosen. For instance, given sample points for SGM with one pass222We slightly abuse the term ‘one pass’, to mean iterations., i.e., , possible choices for the step-sizes are and for some One can also fix the step-sizes a priori, and then run the algorithm with a suitable stopping rule .
These different parameter choices lead to different implicit regularization strategies as we discuss next.
3.2 Finite Sample Bounds for Smooth Loss Functions
In this subsection, we give explicit finite sample bounds for smooth loss functions, considering a suitable assumption on the approximation error.
The approximation error associated to the triplet is defined by
We assume that for some and , the approximation error satisfies
Intuitively, Condition (7) quantifies how hard it is to achieve the infimum of the expected risk. In particular, it is satisfied with when333The existence of at least one minimizer in is met for example when is compact, or finite dimensional. In general, does not necessarily have to be 1, since the hypothesis space may be chosen as a general infinite dimensional space, for example in non-parametric regression. such that
. More formally, the condition is related to classical terminologies in approximation theory, such as K-functionals and interpolation spaces[19, 9]. The following remark is important for later discussions.
Remark 1 (SGM and Implicit Regularization).
Assumption 2 is standard in statistical learning theory when analyzing Tikhonov regularization [9, 19]. Besides, it has been shown that Tikhonov regularization can achieve best performance by choosing an appropriate penalty parameter which depends on the unknown parameter [9, 19]. In other words, in Tikhonov regularization, the penalty parameter plays a role of regularization. In this view, our coming results show that SGM can implicitly implement a form of Tikhonov regularization by controlling the step-size and/or the number of passes.
A further assumption relates to the smoothness of the loss, and is satisfied for example by the logistic loss.
For all , is differentiable and is Lipschitz continuous with a constant , i.e.
The following result characterizes the excess risk of both the last and the average iterate for any fixed step-size and stopping time.
The proof of the above result follows more or less directly from combining ideas and results in [12, 11] and is postponed to the appendix. The constants in the bounds are omitted, but given explicitly in the proof. While the error bound for the weighted average looks more concise than the one for the last iterate, interestingly, both error bounds lead to similar generalization properties.
The error bounds are composed of three terms related to sample error, computational error, and approximation error. Balancing these three error terms to achieve the minimum total error bound leads to optimal choices for the step-sizes and total number of iterations In other words, both the step-sizes and the number of iterations can play the role of a regularization parameter. Using the above theorem, general results for step-size with some can be found in Proposition 3 from the appendix. Here, as corollaries we provide four different parameter choices to obtain the best bounds, corresponding to four different regularization strategies.
The first two corollaries correspond to fixing the step-sizes a priori and using the number of iterations as a regularization parameter. In the first result, the step-size is constant and depends on the number of sample points.
In the second result the step-sizes decay with the iterations.
In both the above corollaries the step-sizes are fixed a priori, and the number of iterations becomes the regularization parameter controlling the total error. Ignoring the logarithmic factor, the dominating terms in the bounds (8), (10) are the sample and approximation errors, corresponding to the first and third terms of RHS. Stopping too late may lead to a large sample error, while stopping too early may lead to a large approximation error. The ideal stopping time arises from a form of bias-variance trade-off and requires in general more than one pass over the data. Indeed, if we reformulate the results in terms of number of passes, we have that passes are needed for the constant step-size , while passes are needed for the decaying step-size . These observations suggest in particular that while both step-size choices achieve the same bounds, the constant step-size can have a computational advantage since it requires less iterations.
Note that one pass over the data suffices only in the limit case when , while in general it will be suboptimal, at least if the step-size is fixed. In fact, Theorem 2 suggests that optimal results could be recovered if the step-size is suitably tuned. The next corollaries show that this is indeed the case. The first result corresponds to a suitably tuned constant step-size.
The second result corresponds to tuning the decay rate for a decaying step-size.
The above two results confirm that good performances can be attained with only one pass over the data, provided the step-sizes are suitably chosen, that is using the step-size as a regularization parameter.
If we further assume that as often done in the literature, the convergence rates from Corollaries 1-4 are of order which are the same as those in, e.g., .
Finally, the following remark relates the above results to data-driven parameter tuning used in practice.
Remark 3 (Bias-Variance and Cross-Validation).
The above results show how the number of iterations/passes controls a bias-variance trade-off, and in this sense acts as a regularization parameter. In practice, the approximation properties of the algorithm are unknown and the question arises of how the parameter can be chosen. As it turns out, cross-validation can be used to achieve adaptively the best rates, in the sense that the rate in (9) is achieved by cross-validation or more precisely by hold-out cross-validation. These results follow by an argument similar to that in Chapter 6 from  and are omitted.
3.3 Finite Sample Bounds for Non-smooth Loss Functions
Theorem 2 holds for smooth loss functions and it is natural to ask if a similar result holds for non-smooth losses such as the hinge loss. Indeed, analogous results hold, albeit current bounds are not as sharp.
Using the above result with concrete step-sizes as those for smooth loss functions, we have the following explicit error bounds and corresponding stopping rules.
From the above two corollaries, we see that the algorithm with constant step-size can stop earlier than the one with decaying step-size when while they have the same convergence rate, since Note that the bound in (11) is slightly worse than that in (9), see Section 3.4 for more discussion.
Similar to the smooth case, we also have the following results for SGM with one pass where regularization is realized by step-size.
3.4 Discussion and Proof Sketch
As mentioned in the introduction, the literature on theoretical properties of the iteration in Algorithm 1 is vast, both in learning theory and in optimization. A first line of works focuses on a single pass and convergence of the expected risk. Approaches in this sense include classical results in optimization (see  and references therein), but also approaches based on so-called “online to batch” conversion (see  and references therein). The latter are based on analyzing a sequential prediction setting and then on considering the averaged iterate to turn regret bounds in expected risk bounds. A second line of works focuses on multiple passes, but measures the quality of the corresponding iteration in terms of the minimization of the empirical risk. In this view, Algorithm 1 is seen as an instance of incremental methods for the minimization of objective functions that are sums of a finite, but possibly large, number of terms . These latter works, while interesting in their own right, do not yield any direct information on the generalization properties of considering multiple passes.
Here, we follow the approach in  advocating the combination of statistical and computational errors. The general proof strategy is to consider several intermediate steps to relate the expected risk of the empirical iteration to the minimal expected risk. The argument we sketch below is a simplified and less sharp version with respect to the one used in the actual proof, but it is easier to illustrate and still carries some important aspects which are useful for comparison with related results.
Consider an intermediate element and decompose the excess risk as
The first term on the right-hand side is the generalization error of the iterate. The second term can be seen as a computational error. To discuss the last term, it is useful to consider a few different choices for . Assuming the empirical and expected risks to have minimizers and , a possibility is to set , this can be seen to be the choice made in . In this case, it is immediate to see that the last term is negligible since,
On the other hand, in this case the computational error depends on the norm which is in general hard to estimate. A more convenient choice is to set . A reasoning similar to the one above shows that the last term is still negligible and the computational error can still be controlled depending on . In a non-parametric setting, the existence of a minimizer is not ensured and corresponds to a limit case where there is small approximation error. Our approach is then to consider an almost minimizer of the expected risk with a prescribed accuracy. Following , we do this introducing Assumption (6) and choosing as the unique minimizer of , . Then the last term in the error decomposition can be upper bounded by the approximation error.
For the generalization error, the stability results from  provide sharp estimates for smooth loss functions and in the ‘capacity independent’ limit, that is under no assumptions on the covering numbers of the considered function space. For this setting, the obtained bound is optimal in the sense that it matches the best available bound for Tikhonov regularization [19, 9]. For the non-smooth case a standard argument based on Rademacher complexity can be used, and easily extended to be capacity dependent. However, the corresponding bound is not sharp and improvements are likely to hinge on deriving better norm estimates for the iterates. The question does not seem to be straightforward and is deferred to a future work.
The computational error for the averaged iterates can be controlled using classic arguments , whereas for the last iterate the arguments in [12, 18] are needed. Finally, Theorems 2, 3 result from estimating and balancing the various error terms with respect to the choice of the step-size and number of passes.
We conclude this section with some perspective on the results in the paper. We note that since the primary goal of this study was to analyze the implicit regularization effect of step-size and number of passes, we have considered a very simple iteration. However, it would be very interesting to consider more sophisticated, ‘accelerated’ iterations , and assess the potential advantages in terms of computational and generalization aspects. Similarly, we chose to keep the analysis in the paper relatively simple, but several improvements can be considered for example deriving high probability bounds and sharper error bounds under further assumptions. Some of these improvements are relatively straightforward, see e.g. , but others will require non-trivial extensions of results developed for Tikhonov regularization in the last few years. Finally, here we only referred to a simple cross-validation approach to parameter tuning, but it would clearly be very interesting to find ways to tune parameters online. A remarkable result in this direction is derived in , where it is shown that, in the capacity independent setting, adaptive online parameter tuning is indeed possible.
4 Numerical Simulations
We carry out some numerical simulations to illustrate our results444Code: lcsl.github.io/MultiplePassesSGM. The experiments are executed 10 times each, on the benchmark datasets555Datasets: archive.ics.uci.edu/ml and www.csie.ntu.edu.tw/~cjlin/libsvmtools/ datasets/ reported in Table 1, in which the Gaussian kernel bandwidths used by SGM and SIGM666In what follows, we name one pass SGM and multiple passes SGM as SGM and SIGM, respectively. for each learning problem are also shown. Here, the loss function is the hinge loss777Experiments with the logistic loss have also been carried out, showing similar empirical results to those considering the hinge loss. The details are not included in this text due to space limit. . The experimental platform is a server with 12 Intel Xeon E5-2620 v2 (2.10GHz) CPUs and 132 GB of RAM. Some of the experimental results, as specified in the following, have been obtained by running the experiments on subsets of the data samples chosen uniformly at random. In order to apply hold-out cross-validation, the training set is split in two parts: one for empirical risk minimization and the other for validation error computation (80% - 20%, respectively). All the samples are randomly shuffled at each repetition.
4.1 Regularization in SGM and SIGM
In this subsection, we illustrate four concrete examples showing different regularization effects of the step-size in SGM and the number of passes in SIGM. In all these four examples, we consider the Adult dataset with sample size .
In the first experiment, the SIGM step-size is fixed as . The test error computed with respect to the hinge loss at each pass is reported in Figure 1. Note that the minimum test error is reached for a number of passes smaller than 20, after which it significantly increases, a so-called overfitting regime. This result clearly illustrates the regularization effect of the number of passes. In the second experiment, we consider SIGM with decaying step-size ( and ). As shown in Figure 1, overfitting is not observed in the first 100 passes. In this case, the convergence to the optimal solution appears slower than that in the fixed step-size case.
In the last two experiments, we consider SGM and show that the step-size plays the role of a regularization parameter. For the fixed step-size case, i.e., , we perform SGM with different (logarithmically scaled). We plot the errors in Figure 2, showing that a large step-size () leads to overfitting, while a smaller one (e.g., ) is associated to oversmoothing. For the decaying step-size case, we fix , and run SGM with different . The errors are plotted in Figure 2, from which we see that the exponent has a regularization effect. In fact, a more ‘aggressive’ choice (e.g., , corresponding to a fixed step-size) leads to overfitting, while for a larger (e.g., ) we observe oversmoothing.
4.2 Accuracy and Computational Time Comparison
In this subsection, we compare SGM with cross-validation and SIGM with benchmark algorithm LIBSVM , both in terms of accuracy and computational time. For SGM, with 30 parameter guesses, we use cross-validation to tune the step-size (either setting while tuning , or setting while tuning ). For SIGM, we use two kinds of step-size suggested by Section 3: and , or and using early stopping via cross-validation. The test errors with respect to the hinge loss, the test relative misclassification errors and the computational times are collected in Table 2.
We first start comparing accuracies. The results in Table 2 indicate that SGM with constant and decaying step-sizes and SIGM with fixed step-size reach comparable test errors, which are in line with the LIBSVM baseline. Observe that SIGM with decaying step-size attains consistently higher test errors, a phenomenon already illustrated in Section 4.1 in theory.
We now compare the computational times for cross-validation. We see from Table 2 that the training times of SIGM and SGM, either with constant or decaying step-sizes, are roughly the same. We also observe that SGM and SIGM are faster than LIBSVM on relatively large datasets (Adult with n = 32562, and Ijcnn1 with n = 49990). Moreover, for small datasets (BreastCancer with n = 400, Adult with n = 1000, and Ijcnn1 with n = 1000), SGM and SIGM are comparable with or slightly slower than LIBSVM.
This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. L. R. acknowledges the financial support of the Italian Ministry of Education, University and Research FIRB project RBFR12M3AC. The authors would like to thank Dr. Francesco Orabona for the fruitful discussions on this research topic, and Dr. Silvia Villa and the referees for their valuable comments.
-  Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. Annals of Statistics, 33(4):1497–1537, 2005.
-  Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3:463–482, 2003.
-  Dimitri P Bertsekas. Nonlinear Programming. Athena scientific, 1999.
-  Dimitri P Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010:1–38, 2011.
-  Olivier Bousquet and Léon Bottou. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, pages 161–168, 2008.
-  Stephen Boyd and Almir Mutapcic. Stochastic subgradient methods. Notes for EE364b, Standford University, Winter 2007.
-  Stephen Boyd, Lin Xiao, and Almir Mutapcic. Subgradient methods. Lecture notes of EE392o, Stanford University, Autumn Quarter 2003.
Chih-Chung Chang and Chih-Jen Lin.
LIBSVM: A library for support vector machines.ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.
-  Felipe Cucker and Ding-Xuan Zhou. Learning Theory: an Approximation Theory Viewpoint, volume 24. Cambridge University Press, 2007.
-  Aymeric Dieuleveut and Francis Bach. Non-parametric stochastic approximation with large step sizes. arXiv preprint arXiv:1408.0361, 2014.
-  Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240, 2016.
-  Junhong Lin, Lorenzo Rosasco, and Ding-Xuan Zhou. Iterative regularization for learning with convex loss functions. The Journal of Machine Learning Research, To appear, 2016.
-  Ron Meir and Tong Zhang. Generalization error bounds for Bayesian mixture algorithms. The Journal of Machine Learning Research, 4:839–860, 2003.
-  Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
-  Francesco Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning. In Advances in Neural Information Processing Systems, pages 1116–1124, 2014.
-  Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization. In Advances in Neural Information Processing Systems, pages 1621–1629, 2015.
-  Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, 2013.
-  Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, pages 71–79, 2013.
-  Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Science Business Media, 2008.
-  Pierre Tarres and Yuan Yao. Online learning as stochastic approximation of regularization paths: Optimality and almost-sure convergence. IEEE Transactions on Information Theory, 60(9):5716–5735, 2014.
-  Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms. Foundations of Computational Mathematics, 8(5):561–596, 2008.
Appendix A Basic Lemmas
The following basic lemma is useful to our proofs, which will be used several times. Its proof follows from the convexity of and the fact that is bounded.
Under Assumption 1, for any and , we have
Taking the expectation of (12) with respect to the random variable , and noting that is independent from given , one can get the following result.
Under Assumption 1, for any fixed given any , assume that is independent of the random variable . Then we have
Appendix B Sample Errors
Note that our goal is to bound the excess generalization error whereas the left-hand side of (13) is related to an empirical error. The difference between the generalization and empirical errors is a so-called sample error. To estimate this sample error, we introduce the following lemma, which gives a uniformly upper bound for sample errors over a ball . Its proof is based on a standard symmetrization technique and Rademacher complexity, e.g. [1, 13]. For completeness, we provide a proof here.
Let be another training sample from , and assume that it is independent from We have
Let be independent random variables drawn from the Rademacher distribution, i.e. for . Using a standard symmetrization technique, for example in , we get
Using Cauchy-Schwartz inequality, we reach
By Jensen’s inequality, we get
The desired result thus follows by introducing (4) to the above. Note that the above procedure also applies if we replace with . The proof is complete. ∎
The following lemma gives upper bounds on the iterated sequence.
Under Assumption 1. Then for any , we have
Using Lemma 1 with we have
Noting that and we thus get
Applying this inequality iteratively for and introducing with , one can get that
which leads to the desired result by taking square root on both sides. ∎
According to the above two lemmas, we can bound the sample errors as follows.
When the loss function is smooth, by Theorems 2.2 and 3.9 from , we can control the sample errors as follows.
Appendix C Excess Errors for Weighted Averages
Under Assumption 1, assume that there exists a non-decreasing sequence such that