1 Introduction
The stochastic gradient method (SGM), often called stochastic gradient descent, has become an algorithm of choice in machine learning, because of its simplicity and small computational cost especially when dealing with big data sets
[5].Despite its widespread use, the generalization properties of the variants of SGM used in practice are relatively little understood. Most previous works consider generalization properties of SGM with only one pass over the data, see e.g. [14] or [15] and references therein, while in practice multiple passes are usually considered. The effect of multiple passes has been studied extensively for the optimization of an empirical objective [6]
, but the role for generalization is less clear. In practice, earlystopping of the number of iterations, for example monitoring a holdout set error, is a strategy often used to regularize. Moreover, the stepsize is typically tuned to obtain the best results. The study in this paper is a step towards grounding theoretically these commonly used heuristics.
Our starting points are a few recent works considering the generalization properties of different variants of SGM. One first series of results focus on least squares, either with one [21, 20, 10], or multiple (deterministic) passes over the data [16]. In the former case it is shown that, in general, if only one pass over the data is considered, then the stepsize needs to be tuned to ensure optimal results. In [16] it is shown that a universal stepsize choice can be taken, if multiple passes are considered. In this case, it is the stopping time that needs to be tuned.
In this paper, we are interested in general, possibly non smooth, convex loss functions. The analysis for least squares heavily exploits properties of the loss and does not generalize to this broader setting. Here, our starting points are the results in [12, 11, 15] considering convex loss functions. In [12], early stopping of a (kernelized) batch subgradient method is analyzed, whereas in [11] the stability properties of SGM for smooth loss functions are considered in a general stochastic optimization setting and certain convergence results are derived. In [15], a more complex variant of SGM is analyzed and shown to achieve optimal rates.
Since we are interested in analyzing regularization and generalization properties of SGM, in this paper we consider a general nonparametric setting. In this latter setting, the effects of regularization are typically more evident since it can directly affect the convergence rates. In this context, the difficulty of a problem is characterized by an assumption on the approximation error. Under this condition, the need for regularization becomes clear. Indeed, in the absence of other constraints, the good performance of the algorithm relies on a biasvariance tradeoff that can be controlled by suitably choosing the stepsize and/or the number of passes. These latter parameters can be seen to act as regularization parameters. Here, we refer to the regularization as ‘implicit’, in the sense that it is achieved neither by penalization nor by adding explicit constraints. The two main variants of the algorithm are the same as in least squares: one pass over the data with tuned stepsize, or, fixed stepsize choice and number of passes appropriately tuned. While in principle optimal parameter tuning requires explicitly solving a biasvariance tradeoff, in practice adaptive choices can be implemented by crossvalidation. In this case, both algorithm variants achieve optimal results, but different computations are entailed. In the first case, multiple single pass SGM need to be considered with different stepsizes, whereas in the second case, early stopping is used. Experimental results, complementing the theoretical analysis, are given and provide further insights on the properties of the algorithms.
The rest of the paper is organized as follows. In Section 2
, we describe the supervised learning setting and the algorithm, and in Section
3, we state and discuss our main results. The proofs are postponed to the supplementary material. In Section 4, we present some numerical experiments on real datasets.Notation. For notational simplicity, denotes for any . The notation means that there exists a universal constant such that for all Denote by the smallest integer greater than for any given
2 Learning with SGM
In this section, we introduce the supervised learning problem and the SGM algorithm.
Learning Setting.
Let
be a probability space and
be a subset of . Let be a probability measure on Given a measurable loss function the associated expected risk is defined asThe distribution is assumed to be fixed, but unknown, and the goal is to find a function minimizing the expected risk given a sample of size independently drawn according to . Many classical examples of learning algorithms are based on empirical risk minimization, that is replacing the expected risk with the empirical risk defined as
In this paper, we consider spaces of functions which are linearly parameterized. Consider a possibly nonlinear data representation/feature map , mapping the data space in , , or more generally in a (real separable) Hilbert space with inner product and norm . Then, for we consider functions of the form
(1) 
Examples of the above setting include the case where we consider infinite dictionaries, , , so that , for all and (1) corresponds to . Also, this setting includes, and indeed is equivalent to considering, functions defined by a positive definite kernel , in which case , for all , the reproducing kernel Hilbert space associated with , and (1) corresponds to the reproducing property
(2) 
In the following, we assume the feature map to be measurable and define expected and empirical risks over functions of the form (1). For notational simplicity, we write as , and as .
Stochastic Gradient Method.
For any fixed , assume the univariate function on to be convex, hence its lefthand derivative exists at every and is nondecreasing.
Algorithm 1.
Given a sample , the stochastic gradient method (SGM) is defined by and
(3) 
for a nonincreasing sequence of stepsizes and a stopping rule . Here,
are independent and identically distributed (i.i.d.) random variables
^{1}^{1}1More precisely, are conditionally independent given any .from the uniform distribution on
. The (weighted) averaged iterates are defined byNote that may be greater than , indicating that we can use the sample more than once. We shall write to mean , which will be also abbreviated as when there is no confusion.
The main purpose of the paper is to estimate the expected excess risk of the last iterate
or similarly the expected excess risk of the averaged iterate , and study how different parameter settings in (1) affect the estimates. Here, the expectation stands for taking the expectation with respect to (given any ) first, and then the expectation with respect to .
3 Implicit Regularization for SGM
In this section, we present and discuss our main results. We begin in Subsection 3.1 with a universal convergence result and then provide finite sample bounds for smooth loss functions in Subsection 3.2, and for nonsmooth functions in Subsection 3.3. As corollaries of these results we derive different implicit regularization strategies for SGM.
3.1 Convergence
We begin presenting a convergence result, involving conditions on both the stepsizes and the number of iterations. We need some basic assumptions.
Assumption 1.
There holds
(4) 
Furthermore, the loss function is convex with respect to its second entry, and . Moreover, its lefthand derivative is bounded:
(5) 
The above conditions are common in statistical learning theory
[19, 9]. For example, they are satisfied for the hinge loss or the logistic loss for all , if is compact and is continuous.The bounded derivative condition (5) is implied by the requirement on the loss function to be Lipschitz in its second entry, when is a bounded domain. Given these assumptions, the following result holds.
Theorem 1.
As seen from the proof in the appendix, Conditions (A) and (B) arise from the analysis of suitable sample, computational, and approximation errors. Condition (B) is similar to the one required by stochastic gradient methods [3, 7, 6]. The difference is that here the limit is taken with respect to the number of points, but the number of passes on the data can be bigger than one.
Theorem 1 shows that in order to achieve consistency, the stepsizes and the running iterations need to be appropriately chosen. For instance, given sample points for SGM with one pass^{2}^{2}2We slightly abuse the term ‘one pass’, to mean iterations., i.e., , possible choices for the stepsizes are and for some One can also fix the stepsizes a priori, and then run the algorithm with a suitable stopping rule .
These different parameter choices lead to different implicit regularization strategies as we discuss next.
3.2 Finite Sample Bounds for Smooth Loss Functions
In this subsection, we give explicit finite sample bounds for smooth loss functions, considering a suitable assumption on the approximation error.
Assumption 2.
The approximation error associated to the triplet is defined by
(6) 
We assume that for some and , the approximation error satisfies
(7) 
Intuitively, Condition (7) quantifies how hard it is to achieve the infimum of the expected risk. In particular, it is satisfied with when^{3}^{3}3The existence of at least one minimizer in is met for example when is compact, or finite dimensional. In general, does not necessarily have to be 1, since the hypothesis space may be chosen as a general infinite dimensional space, for example in nonparametric regression. such that
. More formally, the condition is related to classical terminologies in approximation theory, such as Kfunctionals and interpolation spaces
[19, 9]. The following remark is important for later discussions.Remark 1 (SGM and Implicit Regularization).
Assumption 2 is standard in statistical learning theory when analyzing Tikhonov regularization [9, 19]. Besides, it has been shown that Tikhonov regularization can achieve best performance by choosing an appropriate penalty parameter which depends on the unknown parameter [9, 19]. In other words, in Tikhonov regularization, the penalty parameter plays a role of regularization. In this view, our coming results show that SGM can implicitly implement a form of Tikhonov regularization by controlling the stepsize and/or the number of passes.
A further assumption relates to the smoothness of the loss, and is satisfied for example by the logistic loss.
Assumption 3.
For all , is differentiable and is Lipschitz continuous with a constant , i.e.
The following result characterizes the excess risk of both the last and the average iterate for any fixed stepsize and stopping time.
The proof of the above result follows more or less directly from combining ideas and results in [12, 11] and is postponed to the appendix. The constants in the bounds are omitted, but given explicitly in the proof. While the error bound for the weighted average looks more concise than the one for the last iterate, interestingly, both error bounds lead to similar generalization properties.
The error bounds are composed of three terms related to sample error, computational error, and approximation error. Balancing these three error terms to achieve the minimum total error bound leads to optimal choices for the stepsizes and total number of iterations In other words, both the stepsizes and the number of iterations can play the role of a regularization parameter. Using the above theorem, general results for stepsize with some can be found in Proposition 3 from the appendix. Here, as corollaries we provide four different parameter choices to obtain the best bounds, corresponding to four different regularization strategies.
The first two corollaries correspond to fixing the stepsizes a priori and using the number of iterations as a regularization parameter. In the first result, the stepsize is constant and depends on the number of sample points.
Corollary 1.
In the second result the stepsizes decay with the iterations.
Corollary 2.
In both the above corollaries the stepsizes are fixed a priori, and the number of iterations becomes the regularization parameter controlling the total error. Ignoring the logarithmic factor, the dominating terms in the bounds (8), (10) are the sample and approximation errors, corresponding to the first and third terms of RHS. Stopping too late may lead to a large sample error, while stopping too early may lead to a large approximation error. The ideal stopping time arises from a form of biasvariance tradeoff and requires in general more than one pass over the data. Indeed, if we reformulate the results in terms of number of passes, we have that passes are needed for the constant stepsize , while passes are needed for the decaying stepsize . These observations suggest in particular that while both stepsize choices achieve the same bounds, the constant stepsize can have a computational advantage since it requires less iterations.
Note that one pass over the data suffices only in the limit case when , while in general it will be suboptimal, at least if the stepsize is fixed. In fact, Theorem 2 suggests that optimal results could be recovered if the stepsize is suitably tuned. The next corollaries show that this is indeed the case. The first result corresponds to a suitably tuned constant stepsize.
Corollary 3.
The second result corresponds to tuning the decay rate for a decaying stepsize.
Corollary 4.
The above two results confirm that good performances can be attained with only one pass over the data, provided the stepsizes are suitably chosen, that is using the stepsize as a regularization parameter.
Remark 2.
If we further assume that as often done in the literature, the convergence rates from Corollaries 14 are of order which are the same as those in, e.g., [18].
Finally, the following remark relates the above results to datadriven parameter tuning used in practice.
Remark 3 (BiasVariance and CrossValidation).
The above results show how the number of iterations/passes controls a biasvariance tradeoff, and in this sense acts as a regularization parameter. In practice, the approximation properties of the algorithm are unknown and the question arises of how the parameter can be chosen. As it turns out, crossvalidation can be used to achieve adaptively the best rates, in the sense that the rate in (9) is achieved by crossvalidation or more precisely by holdout crossvalidation. These results follow by an argument similar to that in Chapter 6 from [19] and are omitted.
3.3 Finite Sample Bounds for Nonsmooth Loss Functions
Theorem 2 holds for smooth loss functions and it is natural to ask if a similar result holds for nonsmooth losses such as the hinge loss. Indeed, analogous results hold, albeit current bounds are not as sharp.
The proof of the above theorem is based on ideas from [12], where tools from Rademacher complexity [2, 13] are employed. We postpone the proof in the appendix.
Using the above result with concrete stepsizes as those for smooth loss functions, we have the following explicit error bounds and corresponding stopping rules.
Corollary 5.
Corollary 6.
From the above two corollaries, we see that the algorithm with constant stepsize can stop earlier than the one with decaying stepsize when while they have the same convergence rate, since Note that the bound in (11) is slightly worse than that in (9), see Section 3.4 for more discussion.
Similar to the smooth case, we also have the following results for SGM with one pass where regularization is realized by stepsize.
Corollary 7.
3.4 Discussion and Proof Sketch
As mentioned in the introduction, the literature on theoretical properties of the iteration in Algorithm 1 is vast, both in learning theory and in optimization. A first line of works focuses on a single pass and convergence of the expected risk. Approaches in this sense include classical results in optimization (see [14] and references therein), but also approaches based on socalled “online to batch” conversion (see [15] and references therein). The latter are based on analyzing a sequential prediction setting and then on considering the averaged iterate to turn regret bounds in expected risk bounds. A second line of works focuses on multiple passes, but measures the quality of the corresponding iteration in terms of the minimization of the empirical risk. In this view, Algorithm 1 is seen as an instance of incremental methods for the minimization of objective functions that are sums of a finite, but possibly large, number of terms [4]. These latter works, while interesting in their own right, do not yield any direct information on the generalization properties of considering multiple passes.
Here, we follow the approach in [5] advocating the combination of statistical and computational errors. The general proof strategy is to consider several intermediate steps to relate the expected risk of the empirical iteration to the minimal expected risk. The argument we sketch below is a simplified and less sharp version with respect to the one used in the actual proof, but it is easier to illustrate and still carries some important aspects which are useful for comparison with related results.
Consider an intermediate element and decompose the excess risk as
The first term on the righthand side is the generalization error of the iterate. The second term can be seen as a computational error. To discuss the last term, it is useful to consider a few different choices for . Assuming the empirical and expected risks to have minimizers and , a possibility is to set , this can be seen to be the choice made in [11]. In this case, it is immediate to see that the last term is negligible since,
and hence,
On the other hand, in this case the computational error depends on the norm which is in general hard to estimate. A more convenient choice is to set . A reasoning similar to the one above shows that the last term is still negligible and the computational error can still be controlled depending on . In a nonparametric setting, the existence of a minimizer is not ensured and corresponds to a limit case where there is small approximation error. Our approach is then to consider an almost minimizer of the expected risk with a prescribed accuracy. Following [12], we do this introducing Assumption (6) and choosing as the unique minimizer of , . Then the last term in the error decomposition can be upper bounded by the approximation error.
For the generalization error, the stability results from [11] provide sharp estimates for smooth loss functions and in the ‘capacity independent’ limit, that is under no assumptions on the covering numbers of the considered function space. For this setting, the obtained bound is optimal in the sense that it matches the best available bound for Tikhonov regularization [19, 9]. For the nonsmooth case a standard argument based on Rademacher complexity can be used, and easily extended to be capacity dependent. However, the corresponding bound is not sharp and improvements are likely to hinge on deriving better norm estimates for the iterates. The question does not seem to be straightforward and is deferred to a future work.
The computational error for the averaged iterates can be controlled using classic arguments [6], whereas for the last iterate the arguments in [12, 18] are needed. Finally, Theorems 2, 3 result from estimating and balancing the various error terms with respect to the choice of the stepsize and number of passes.
We conclude this section with some perspective on the results in the paper. We note that since the primary goal of this study was to analyze the implicit regularization effect of stepsize and number of passes, we have considered a very simple iteration. However, it would be very interesting to consider more sophisticated, ‘accelerated’ iterations [17], and assess the potential advantages in terms of computational and generalization aspects. Similarly, we chose to keep the analysis in the paper relatively simple, but several improvements can be considered for example deriving high probability bounds and sharper error bounds under further assumptions. Some of these improvements are relatively straightforward, see e.g. [12], but others will require nontrivial extensions of results developed for Tikhonov regularization in the last few years. Finally, here we only referred to a simple crossvalidation approach to parameter tuning, but it would clearly be very interesting to find ways to tune parameters online. A remarkable result in this direction is derived in [15], where it is shown that, in the capacity independent setting, adaptive online parameter tuning is indeed possible.
4 Numerical Simulations
We carry out some numerical simulations to illustrate our results^{4}^{4}4Code: lcsl.github.io/MultiplePassesSGM. The experiments are executed 10 times each, on the benchmark datasets^{5}^{5}5Datasets: archive.ics.uci.edu/ml and www.csie.ntu.edu.tw/~cjlin/libsvmtools/ datasets/ reported in Table 1, in which the Gaussian kernel bandwidths used by SGM and SIGM^{6}^{6}6In what follows, we name one pass SGM and multiple passes SGM as SGM and SIGM, respectively. for each learning problem are also shown. Here, the loss function is the hinge loss^{7}^{7}7Experiments with the logistic loss have also been carried out, showing similar empirical results to those considering the hinge loss. The details are not included in this text due to space limit. . The experimental platform is a server with 12 Intel Xeon E52620 v2 (2.10GHz) CPUs and 132 GB of RAM. Some of the experimental results, as specified in the following, have been obtained by running the experiments on subsets of the data samples chosen uniformly at random. In order to apply holdout crossvalidation, the training set is split in two parts: one for empirical risk minimization and the other for validation error computation (80%  20%, respectively). All the samples are randomly shuffled at each repetition.
4.1 Regularization in SGM and SIGM
In this subsection, we illustrate four concrete examples showing different regularization effects of the stepsize in SGM and the number of passes in SIGM. In all these four examples, we consider the Adult dataset with sample size .
In the first experiment, the SIGM stepsize is fixed as . The test error computed with respect to the hinge loss at each pass is reported in Figure 1. Note that the minimum test error is reached for a number of passes smaller than 20, after which it significantly increases, a socalled overfitting regime. This result clearly illustrates the regularization effect of the number of passes. In the second experiment, we consider SIGM with decaying stepsize ( and ). As shown in Figure 1, overfitting is not observed in the first 100 passes. In this case, the convergence to the optimal solution appears slower than that in the fixed stepsize case.
In the last two experiments, we consider SGM and show that the stepsize plays the role of a regularization parameter. For the fixed stepsize case, i.e., , we perform SGM with different (logarithmically scaled). We plot the errors in Figure 2, showing that a large stepsize () leads to overfitting, while a smaller one (e.g., ) is associated to oversmoothing. For the decaying stepsize case, we fix , and run SGM with different . The errors are plotted in Figure 2, from which we see that the exponent has a regularization effect. In fact, a more ‘aggressive’ choice (e.g., , corresponding to a fixed stepsize) leads to overfitting, while for a larger (e.g., ) we observe oversmoothing.
4.2 Accuracy and Computational Time Comparison
In this subsection, we compare SGM with crossvalidation and SIGM with benchmark algorithm LIBSVM [8], both in terms of accuracy and computational time. For SGM, with 30 parameter guesses, we use crossvalidation to tune the stepsize (either setting while tuning , or setting while tuning ). For SIGM, we use two kinds of stepsize suggested by Section 3: and , or and using early stopping via crossvalidation. The test errors with respect to the hinge loss, the test relative misclassification errors and the computational times are collected in Table 2.
We first start comparing accuracies. The results in Table 2 indicate that SGM with constant and decaying stepsizes and SIGM with fixed stepsize reach comparable test errors, which are in line with the LIBSVM baseline. Observe that SIGM with decaying stepsize attains consistently higher test errors, a phenomenon already illustrated in Section 4.1 in theory.
We now compare the computational times for crossvalidation. We see from Table 2 that the training times of SIGM and SGM, either with constant or decaying stepsizes, are roughly the same. We also observe that SGM and SIGM are faster than LIBSVM on relatively large datasets (Adult with n = 32562, and Ijcnn1 with n = 49990). Moreover, for small datasets (BreastCancer with n = 400, Adult with n = 1000, and Ijcnn1 with n = 1000), SGM and SIGM are comparable with or slightly slower than LIBSVM.
Acknowledgments
This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF1231216. L. R. acknowledges the financial support of the Italian Ministry of Education, University and Research FIRB project RBFR12M3AC. The authors would like to thank Dr. Francesco Orabona for the fruitful discussions on this research topic, and Dr. Silvia Villa and the referees for their valuable comments.
References
 [1] Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. Annals of Statistics, 33(4):1497–1537, 2005.
 [2] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3:463–482, 2003.
 [3] Dimitri P Bertsekas. Nonlinear Programming. Athena scientific, 1999.
 [4] Dimitri P Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010:1–38, 2011.
 [5] Olivier Bousquet and Léon Bottou. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, pages 161–168, 2008.
 [6] Stephen Boyd and Almir Mutapcic. Stochastic subgradient methods. Notes for EE364b, Standford University, Winter 2007.
 [7] Stephen Boyd, Lin Xiao, and Almir Mutapcic. Subgradient methods. Lecture notes of EE392o, Stanford University, Autumn Quarter 2003.

[8]
ChihChung Chang and ChihJen Lin.
LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.  [9] Felipe Cucker and DingXuan Zhou. Learning Theory: an Approximation Theory Viewpoint, volume 24. Cambridge University Press, 2007.
 [10] Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step sizes. arXiv preprint arXiv:1408.0361, 2014.
 [11] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240, 2016.
 [12] Junhong Lin, Lorenzo Rosasco, and DingXuan Zhou. Iterative regularization for learning with convex loss functions. The Journal of Machine Learning Research, To appear, 2016.
 [13] Ron Meir and Tong Zhang. Generalization error bounds for Bayesian mixture algorithms. The Journal of Machine Learning Research, 4:839–860, 2003.
 [14] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
 [15] Francesco Orabona. Simultaneous model selection and optimization through parameterfree stochastic learning. In Advances in Neural Information Processing Systems, pages 1116–1124, 2014.
 [16] Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization. In Advances in Neural Information Processing Systems, pages 1621–1629, 2015.
 [17] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, 2013.
 [18] Ohad Shamir and Tong Zhang. Stochastic gradient descent for nonsmooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, pages 71–79, 2013.
 [19] Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Science Business Media, 2008.
 [20] Pierre Tarres and Yuan Yao. Online learning as stochastic approximation of regularization paths: Optimality and almostsure convergence. IEEE Transactions on Information Theory, 60(9):5716–5735, 2014.
 [21] Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms. Foundations of Computational Mathematics, 8(5):561–596, 2008.
Appendix A Basic Lemmas
The following basic lemma is useful to our proofs, which will be used several times. Its proof follows from the convexity of and the fact that is bounded.
Lemma 1.
Under Assumption 1, for any and , we have
(12) 
Proof.
Taking the expectation of (12) with respect to the random variable , and noting that is independent from given , one can get the following result.
Lemma 2.
Under Assumption 1, for any fixed given any , assume that is independent of the random variable . Then we have
(13) 
Appendix B Sample Errors
Note that our goal is to bound the excess generalization error whereas the lefthand side of (13) is related to an empirical error. The difference between the generalization and empirical errors is a socalled sample error. To estimate this sample error, we introduce the following lemma, which gives a uniformly upper bound for sample errors over a ball . Its proof is based on a standard symmetrization technique and Rademacher complexity, e.g. [1, 13]. For completeness, we provide a proof here.
Proof.
Let be another training sample from , and assume that it is independent from We have
Let be independent random variables drawn from the Rademacher distribution, i.e. for . Using a standard symmetrization technique, for example in [13], we get
With (5), by applying Talagrand’s contraction lemma, see e.g. [1], we derive
Using CauchySchwartz inequality, we reach
By Jensen’s inequality, we get
The desired result thus follows by introducing (4) to the above. Note that the above procedure also applies if we replace with . The proof is complete. ∎
The following lemma gives upper bounds on the iterated sequence.
Lemma 4.
Under Assumption 1. Then for any , we have
Proof.
Using Lemma 1 with we have
Noting that and we thus get
Applying this inequality iteratively for and introducing with , one can get that
which leads to the desired result by taking square root on both sides. ∎
According to the above two lemmas, we can bound the sample errors as follows.
When the loss function is smooth, by Theorems 2.2 and 3.9 from [11], we can control the sample errors as follows.
Appendix C Excess Errors for Weighted Averages
Lemma 7.
Under Assumption 1, assume that there exists a nondecreasing sequence such that
Comments
There are no comments yet.