# Revisiting SGD with Increasingly Weighted Averaging: Optimization and Generalization Perspectives

Stochastic gradient descent (SGD) has been widely studied in the literature from different angles, and is commonly employed for solving many big data machine learning problems. However, the averaging technique, which combines all iterative solutions into a single solution, is still under-explored. While some increasingly weighted averaging schemes have been considered in the literature, existing works are mostly restricted to strongly convex objective functions and the convergence of optimization error. It remains unclear how these averaging schemes affect the convergence of both optimization error and generalization error (two equally important components of testing error) for non-strongly convex objectives, including non-convex problems. In this paper, we fill the gap by comprehensively analyzing the increasingly weighted averaging on convex, strongly convex and non-convex objective functions in terms of both optimization error and generalization error. In particular, we analyze a family of increasingly weighted averaging, where the weight for the solution at iteration t is proportional to t^α (α > 0). We show how α affects the optimization error and the generalization error, and exhibit the trade-off caused by α. Experiments have demonstrated this trade-off and the effectiveness of polynomially increased weighted averaging compared with other averaging schemes for a wide range of problems including deep learning.

## Authors

• 5 publications
• 5 publications
• 76 publications
• 27 publications
• 58 publications
• ### Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes

Stochastic Gradient Descent (SGD) is one of the simplest and most popula...
12/08/2012 ∙ by Ohad Shamir, et al. ∙ 0

• ### Gradient Descent Averaging and Primal-dual Averaging for Strongly Convex Optimization

Averaging scheme has attracted extensive attention in deep learning as w...
12/29/2020 ∙ by Wei Tao, et al. ∙ 0

• ### Parallel SGD: When does averaging help?

Consider a number of workers running SGD independently on the same pool ...
06/23/2016 ∙ by Jian Zhang, et al. ∙ 0

Regularization for optimization is a crucial technique to avoid overfitt...
08/15/2020 ∙ by Jingfeng Wu, et al. ∙ 0

06/30/2020 ∙ by Jiaxuan Wang, et al. ∙ 0

• ### Stability and Optimization Error of Stochastic Gradient Descent for Pairwise Learning

In this paper we study the stability and its trade-off with optimization...
04/25/2019 ∙ by Wei Shen, et al. ∙ 0

• ### Beating SGD Saturation with Tail-Averaging and Minibatching

While stochastic gradient descent (SGD) is one of the major workhorses i...
02/22/2019 ∙ by Nicole Mücke, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In machine learning, the task of learning a predictive model is usually formulated as the following empirical risk minimization (ERM) problem:

 x∗=argminx∈ΩFS(x)=1nn∑i=1f(x;zi), (1)

where

is a loss function of the model

on a data , is a closed convex set, and denotes a set of observed data points that are sampled from an underlying unknown distribution with support on . For a function , denotes a subgradient of at .

To solve the ERM problem, stochastic gradient descent (SGD) is usually employed, which updates the solution according to

 xt+1=ΠΩ[xt−ηt∇f(xt;zit)], (2)

for , where is randomly sampled, is the step size, and is a projection operator, i.e., , where denotes the Euclidean norm. After all the iterations, a solution is output as the prediction model.

SGD has been widely studied in the literature from different angles, and is commonly employed for solving many big data machine learning problems. However, the averaging technique that combines all iterative solutions into a single solution is still under-explored. Most studies simply output the last solution or adopt the uniform averaging that computes a uniformly averaged solution based on all iterative solutions, i.e. . Nonetheless, both approaches could lead to unsatisfactory performance regardless of their benefits. The last solution could have a faster convergence in terms of optimization error in practice, but is less stable Bottou (2010); Hardt et al. (2015). Uniform averaging can improve the stability of the output solution, but may slow down the convergence of optimization error in some cases Hazan et al. (2007); Hazan and Kale (2014). Although some non-uniform averaging has been considered in the literature Rakhlin et al. (2011); Shamir and Zhang (2013); Lacoste-Julien et al. (2012), most of their analysis is focused on the effect on the convergence of optimization error. Their impact on the tradeoff between optimization error and the generalization error is unclear.

In order to understand this tradeoff, we need to consider the following risk minimization that is of utmost interest:

 minx∈ΩF(x)=Ez∼P[f(x;z)]. (3)

Regarding an iterative algorithm that outputs solution by solving (1), the central concern is its generalization performance, which can be measured by , where the expectation is taken over all randomness in the algorithm and the training dataset . To analyze the generalization performance (referred to as the testing error hereafter) of a random solution , we use the following decomposition of testing error:

 EA,S[F(xS)]=ES[FS(x∗)]+ESEA[FS(xS)−FS(x∗)]εopt% +EA,S[F(xS)−FS(xS)]εgen,

where denotes the optimization error of the algorithm and denotes the generalization error of the solution .

Most existing analysis about the non-uniform averaging is restricted to the convergence analysis of optimization error for strongly convex objectives . This paper aims to fill the gap by comprehensively analyzing a polynomially increased weighted averaging (PIWA) scheme where the weight of the solution of iteration is proportional to (). We analyze SGD with PIWA for general convex, strongly convex and non-convex objective functions in terms of both optimization error and generalization error. We prove that SGD with PIWA has the optimal convergence rate in term of optimization error in both general convex and strongly convex cases, i.e., for the convex case, and for the strongly convex case. For non-convex case, we employ the PIWA in a stagewise algorithm, which uses SGD with the averaging for solving a convex subproblem at each stage. We establish a convergence rate of in terms of the optimization error for a family of weakly-convex functions that satisfies the Polyak-Łojasiewicz condition. Moreover, we analyze the generalization error of SGD with PIWA following the analysis framework in Hardt et al. (2015) that uses the uniform stability tool. We show that SGD with PIWA may have smaller generalization error than SGD using the last solution. We also show how affects optimization error and generalization error, and thus exhibits their trade-off caused by . We have also conducted extensive experiments on convex, strongly convex and non-convex functions. The experimental results demonstrate the trade-off caused by and the effectiveness of PIWA compared with other commonly used averaging schemes.

## 2 Related work

There are a lot of works that analyze SGD with uniform averaging Polyak (1990); Polyak and Juditsky (1992); Zinkevich (2003); Bottou (2010); Chen et al. (2019). Uniform averaging can help to improve the stability in terms of generalization Hardt et al. (2015). It can also help to get an optimal convergence rate in the convex case Polyak and Juditsky (1992); Zinkevich (2003). However, as it gives equal weight to every solution, it could actually slow down the convergence in many cases as later solutions are usually more accurate than earlier solutions Chen et al. (2019); Shamir and Zhang (2013). For example, in strongly convex case, the uniformly averaged solution has an convergence rate Hazan et al. (2007); Hazan and Kale (2014); Shamir and Zhang (2013), which is suboptimal.

In order to improve the convergence rate for optimization of strongly convex functions, many non-uniform averaging schemes have been proposed Rakhlin et al. (2011); Shamir and Zhang (2013); Lacoste-Julien et al. (2012). Rakhlin et al. (2011) considered suffix averaging that takes average of the last solutions (where ). However, suffix averaging cannot be updated online and thus are computationally expensive. Shamir and Zhang (2013) proposed polynomial-decay averaging for minimizing strongly convex functions. Lacoste-Julien et al. (2012) consider a simple polynomially increased weighted averaging, where the weight for the solution of the -th iteration is proportional to . SGD with these averaging schemes have been shown to achieve the optimal convergence rate of for minimizing a strongly convex objective.

However, these existing works restrict their attention for minimizing a strongly convex objective and only analyze the optimization error. Thus, the theory for these non-uniform averaging schemes in convex and non-convex cases are lacked and their impact on the generalization error is unclear. It should be mentioned that exponential moving averaging technique has also been widely used  Kingma and Ba (2014); Zhang et al. (2015), which maintains a moving averaging by with . However, as the weights for previous solution decay exponentially, its performance is close to last solution. What is more, we do not know any existing theoretical guarantee on the performance of moving averaging.

Studies on the non-convex case used to analyze a randomly sampled solution Ghadimi and Lan (2013); Yan et al. (2018); Davis and Drusvyatskiy (2018). Recently, stagewise algorithms enable the use of averaging for a class of non-convex functions, namely weakly convex functions Chen et al. (2019); Davis and Grimmer (2017). These stagewise algorithms construct a convex objective function as a subproblem for each stage. By solving these subproblems in a stagewise manner, it can guarantee the convergence for the original problem. In Chen et al. (2019), they use uniform averaging, thus it may also suffer from slow convergence in practice. In Davis and Grimmer (2017), the weight of the solution of iteration at each stage is proportional to . Hence, it may not achieve the best trade-off between the optimization error and the generalization error.

In Hardt et al. (2015), they have shown that uniform averaging can improve the generalization stability. In Yuan et al. (2018), they have derived the bound of generalization error for stagewise algorithm with uniformly averaging. The analysis of generalization error of SGD in this work can be considered as extensions of these works Hardt et al. (2015); Yuan et al. (2018) by analyzing the impact of PIWA on the generalization error.

## 3 Preliminaries

A function is G-Lipchitz continuous if , i.e., , and is -smooth if it is differentiable and its gradient is -Lipchitz continuous.

A function is -strongly convex for , if for all ,

 f(y)≥f(x)+⟨∇f(x),y−x⟩+λ2∥x−y∥2

A non-convex function is called -weakly convex for if is convex.

For the analysis of the generalization error, we will use the uniform stability tool Bousquet and Elisseeff (2002). The definition of uniform stability is given below.

A randomized algorithm is called -uniformly stable if for all datasets that differ at most one example, the following holds:

 supzEA[f(A(S);z)−f(A(S′);z)]≤ϵ.

where denotes the random solution returned by algorithm based on the dataset .

The relation between uniform stability and generalization error is given in the following lemma. (Bousquet and Elisseeff (2002)) If is -uniformly stable, we have .

Therefore, in order to compare the testing error of different randomized algorithms, it suffices to analyze their convergence in terms of optimization error and their uniform stability.

## 4 Main Theoretical Results

In this section, we will analyze SGD with PIWA in terms of both optimization and generalization error. We denote the algorithm by SGD-PIWA and present it in Algorithm 1. We particularly use where to control the averaging weights of the solution at the -th iteration. It should be noticed that SGD with uniform averaging is a special case of this algorithm by taking , and the averaging scheme proposed in Lacoste-Julien et al. (2012) is also a special case by setting . The step size at the -th iteration will be different for different classes of functions, which will be exhibited later.

It is notable that the final averaged solution can be computed online by updating an averaged sequence :

 ¯xT=((T−1∑t=1wt)¯xT−1+wTxT)/(T−1∑t=1wt+wT).

Before showing the trade-off of , we would emphasize that most proof of convergence of optimization error in existing works for averaging schemes uses the following Jensen’s inequality to first upper bound the objective gap:

 FS(¯xT)−FS(x∗)≤1T∑t=1tαT∑t=1(tαFS(xt))−FS(x∗), (4)

and then further upper bound the right hand side (RHS) of (4). In this way, the objective gap, i.e., , is relaxed twice.

However, it may not be precise to only focus on the effect of on the two-time relaxation of . Instead, to investigate the benefit of PIWA on the convergence of optimization error, we propose to additionally inspect the first relaxation, which we refer to the RHS of (4). Specifically, we present the following lemma to illustrate how affects the RHS of (4).

Assume . The function

 HT(α)=1T∑t=1tαT∑t=1(tαFS(xt)) (5)

is non-increasing in for .

We can see under the condition that the sequence of solutions yield non-increasing objective values, a larger in PIWA will make smaller, which indicates that the RHS of (4) is smaller. The assumption may not hold in practice, but, to some degree, it explains the effect of on the first relaxation of the objective gap.

In the subsequent subsections, we provide convergence analysis of PIWA in optimization and generalization error in different conditions, i.e., general convex, strongly convex and non-convex cases. We reveal how affects the upper bound of and how causes tradeoff in optimization error.

### 4.1 General Convex Case

#### 4.1.1 Optimization Error

In the general convex case, we need the following assumptions.

###### Assumption 1

(i) is a convex function in terms of for any ;
(ii) , for any ;
(iii) there exists such that for any .

Based on Assumption 1, we have the following theorem. Suppose Assumption 1 holds and by setting , we have

 1T∑t=1tαE[T∑t=1tα(FS(xt)−FS(x∗))]≤⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩(α+1)D22η1T1/2+(α+1)η1G2(2α+1)T1/2,0≤α<1/2,(α+1)D22η1T1/2+(α+1)η1G2T1/2,α≥1/2.

Remark. The convergence rate for different value of is in the same order of , which is optimal Polyak and Juditsky (1992); Zinkevich (2003). One may notice that a larger would yield worse convergence bound for . It, however, does not necessarily indicate a worse optimization error in light of Lemma 4. In practice, there will be a trade-off by using different values of .

#### 4.1.2 Generalization Error

In this subsection, we analyze the uniform stability of SGD-PIWA. Our analysis closely follows the route in Hardt et al. (2015). By showing , we show the generalization error is bounded by , where is learned by SGD-PIWA on a dataset and is learned by SGD-PIWA on a dataset with differing from at most one example.

Similar to Theorem 3.8 in Hardt et al. (2015), Lemma 4.1.2 provides a bound of the deviation between the two sequences and of SGD-PIWA that runs on and separately.

Assume that the loss function is convex, -Lipchitz, -smooth for every . Suppose we run SGD-PIWA with step sizes for steps. Then,

 E[∥xT−x′T∥]≤2GnT−1∑t=1ηt. (6)

Suppose Assumption 1 holds, and we further assume is a -Lipschitz and -smooth function. Set . Then the Algorithm 1 has the uniform stability of

 ϵstab≤4η1G2(α+1)(T+1)α+1.5n(α+1.5)Tα+1 (7)

Remark. When is large, . Then we can see that the bound of generalization error is increasing in . Taking , which reduces the PIWA to uniform averaging, has the smallest generalization error. Even if , generalization error of PIWA is still smaller than that of last solution Hardt et al. (2015), since is bounded by . It is therefore smaller than the last solution even when is very large.

### 4.2 Strongly Convex Case

In this subsection, we are going to analyze our algorithm for the strongly convex objective function. In this case, we need the following assumptions:

###### Assumption 2

(i) is a -strongly convex function;
(ii) , for any .

#### 4.2.1 Optimization Error

Suppose Assumption 2 holds, by setting , we have

 1T∑t=1tαE[T∑t=1tα(FS(xt)−FS(x∗))]≤⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩G2λT(1+log(T)),α=0(α+1)2G2αλT,0<α<1(α+1)2G2(T+1)ααλTα+1,α≥1

Remark. When , the algorithm degenerates to a uniform average with a convergence. By taking , the order is improved to . The algorithm in Lacoste-Julien et al. (2012) is special case with , while we have generated it to any . Note that when is close to 0, although the order is , the convergence is not significantly better compared with because of the on the denominator.

#### 4.2.2 Generalization Error

Then we can get the bound of the generalization error in the following theorem. Suppose Assumption 2 holds, and we further assume that , and is -smooth. Then by taking and , the Algorithm 1 has uniform stability of

 ϵstab≤t0n+4G2(α+1)λn (8)

Remark. This result is similar as the Theorem 3.10 in Hardt et al. (2015). Differently, this result depends on . The bound of generalization error is polynomial to . Again, we see that the generalization error of PIWA is worse than uniform averaging. Thus, we cannot take to be very large in experiments.

### 4.3 Non-Convex Case

In this section, we are going to analyze the optimization and generalization error of SGD-PIWA for the non-convex case. We need the following assumptions:

###### Assumption 3

(i) For any and .
(ii) is -weakly convex for any ;
(iii) is -smooth;
(iv) For an initial solution , there exists such that ;
(v) satisfies the PL condition, i.e., there exists

 2μ(FS(x)−FS(x∗))≤∥∇FS(x)∥2,∀x∈Ω. (9)

These conditions have been used in Karimi et al. (2016); Lei et al. (2017); Bassily et al. (2018); Reddi et al. (2016). The algorithm we use for the non-convex objective function is shown in Algorithm 2, where . The algorithm uses a stagewise training strategy. The objective function for each stage is a sum of the loss function and an regularization term referring to the output of previous stage. At each stage, the objective function is optimized by SGD-PIWA. The step size within a stage is set to be a constant, and is decreased after a stage.

#### 4.3.1 Optimization Error

We need the following lemmas.

If satisfies the PL condition, then for any we have

 ∥x−x∗∥2≤12μ(FS(x)−FS(x∗)). (10)

Remark. Please refer to Bolte et al. (2017); Karimi et al. (2016) for a proof.

We first have the following lemma for each stage of the algorithm. To this end, we let denote the -th solution at the -th stage.

Suppose Assumption 3 holds. By applying SGD to with , . And we further assume that . Moreover the solution of each iteration in -th stage is projected to the Euclidean ball , where

. Then with a probability at least

, we have

 Fk(xk)−Fk(x∗)≤1Tk∑t=1tαE[Tk∑t=1tα(Fk(xkt)−Fk(x∗))]≤2(α+1)ϵk−1ηkμT+ηk^G22+2(α+1)^GDk√2ln(1/δ)√T,

where is the solution of -thm iteration at stage , is the upper bound of , which exists and can be set to .

Remark. Combining the above result with Lemma 4.3.1, we can see that is also bounded, which indicates that if the optimal point is within the constraint ball of the initial solution, then after this stage, the optimal point is still in a constraint ball of the averaged solution with high probability. In the following theorem, we are going to make the bounded ball smaller after each stage.

Suppose Assumption 3 holds, and is -weakly convex with . Then by setting and and , where , , and after stages, with probability we have

 FS(xK)−FS(x∗))≤ϵ.

Remark. It is easy to see that the total iteration complexity is in the order , which is similar as in  Yuan et al. (2018) that uses a uniform averaging.

#### 4.3.2 Generalization Error

We now establish the uniform stability in the following theorem. Let and . By the same assumptions and setting as Theorem 4.3.1, we have

 ϵstab≤SK−1n+1+1Lcn−1(2(α+1)cL2)11+LcTLc1+Lc

Remark. We apply the similar conditional analysis for the non-convex objective function as in Hardt et al. (2015). In particular, we condition on , i.e., the different example will be used within the last stage, and prove the bound for . We can see the bound is increasing in .

## 5 Experiments

In this section, we demonstrate the effectiveness of PIWA. First, we compare PIWA with different averaging baselines for convex, strongly convex and non-convex objective functions. The baselines are the last solution, uniform averaging and moving averaging. The goal is to show desirable generalization performance of PIWA. Second, to reveal why PIWA achieves better generalization performance, we show that there is trade-off of between optimization and generalization error by comparing different variants of PIWA. We set various values of to show how it affects optimization and generalization error for convex, strongly convex and non-convex objective functions.

Settings. For a dataset , each , where

is the feature vector and

is the label. Hinge loss is used for the general convex case: For strongly convex case, we add an norm regularization, i.e, into the above objective. For non-convex case, we learn a ResNet20 network with softmax loss He et al. (2016). We compare PIWA with the last solution, uniform averaging and exponential moving averaging.

Datasets. For convex and strongly convex cases, we perform experiments on covtype, splice, svmguide1 and skin-nonskin from the LIBSVM binary classification datasets Chang and Lin (2011). For covtype and skin-nonskin, as no testing datasets were provided, we randomly sampled 50,000 data points as testing data. For other datasets used in convex/strongly convex experiments, we use the training/testing protocol provided by LIBSVM data Chang and Lin (2011). For non-convex case, we use CIFAR-10 and CIFAR-100 datasets Krizhevsky et al. (2014).

Parameters. In the convex and strongly convex cases, the initial step size is tuned in . In strongly convex case, the regularization weight is tuned in . In convex and strongly convex cases, we sample one instance at each iteration. In the non-convex case, the initial step size is tuned in , is tuned in . The number of iterations is set to be 40k for the first stage, and 20k for the second stage, which is the same as in He et al. (2016). The step size is decayed by a constant factor of 10 after each stage. For exponential moving average, the decay coefficient is tuned in following Kingma and Ba (2014); Zhang et al. (2015). The batch size used in non-convex case is 128. For experiments on all three cases, is tuned in .

Results. The experimental results are shown in Figure 1, Figure 3 and Figure 3. For convex case (Figure 1) and strongly convex case (Figure 3), we plot the curves of the values of objective functions and testing error. Note that the testing error in the experiments refer to the rate of wrong classification. For non-convex case (Figure 3), we plot the curves of the training error and testing error. For all three figures, the first two columns compare PIWA with the three baselines. The last two columns compare variants of PIWA with different values of to demonstrate the trade-off of .

From the first two columns of three figures, we can see that PIWA often achieves the best performance in testing error among all the averaging schemes, even if PIWA could not always outperforms other averaging baselines in training. In addition, PIWA tends to make the output solution stable, since the curves of PIWA rarely fluctuate dramatically. In contrast, other baselines often return unstable solutions. Particularly, the curves of the last solution method often encounter sharp fluctuation, specially in convex and strongly convex settings, where we sample one instance per iteration. In contrast, we use a batch size of 128 in the non-convex setting, where the last solution method keeps more stable.

The last two columns of three tables show the trade-off of between the optimization and generalization error. Specifically, a larger usually leads to faster training convergence, but it may make the solution more unstable. On the other hand, a larger often leads to larger testing error, even if the training performance is better.

## 6 Conclusion

In this paper, we have comprehensively analyzed SGD with PIWA in terms of both optimization error and generalization error for convex, strongly convex and non-convex problems. We have shown in theory why PIWA with a proper can improve the optimization error. We have also shown that in PIWA, a larger usually leads to a worse generalization error. Thus, there is a trade-off caused by between optimization error and generalization error. Experiments on benchmark datasets have demonstrated this trade-off and effectiveness of PIWA compared with other averaging schemes.

## References

• R. Bassily, M. Belkin, and S. Ma (2018) On exponential convergence of sgd in non-convex over-parametrized learning. arXiv preprint arXiv:1811.02564. Cited by: §4.3.
• J. Bolte, T. P. Nguyen, J. Peypouquet, and B. W. Suter (2017) From error bounds to the complexity of first-order descent methods for convex functions. Mathematical Programming 165 (2), pp. 471–507. Cited by: §4.3.1.
• L. Bottou (2010) Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §1, §2.
• O. Bousquet and A. Elisseeff (2002) Stability and generalization. Journal of machine learning research 2 (Mar), pp. 499–526. Cited by: §3, §3.
• C. Chang and C. Lin (2011)

LIBSVM: a library for support vector machines

.
ACM transactions on intelligent systems and technology (TIST) 2 (3), pp. 27. Cited by: §5.
• Z. Chen, Z. Yuan, J. Yi, B. Zhou, E. Chen, and T. Yang (2019) Universal stagewise learning for non-convex problems with convergence on averaged solutions. In International Conference on Learning Representations, Cited by: §2, §2.
• D. Davis and D. Drusvyatskiy (2018) Stochastic subgradient method converges at the rate on weakly convex functions. arXiv preprint arXiv:1802.02988. Cited by: §2.
• D. Davis and B. Grimmer (2017) Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. arXiv preprint arXiv:1707.03505. Cited by: §2.
• S. Ghadimi and G. Lan (2013) Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23 (4), pp. 2341–2368. Cited by: §2.
• M. Hardt, B. Recht, and Y. Singer (2015) Train faster, generalize better: stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240.
• E. Hazan, A. Agarwal, and S. Kale (2007) Logarithmic regret algorithms for online convex optimization. Machine Learning 69 (2-3), pp. 169–192. Cited by: §1, §2.
• E. Hazan and S. Kale (2014) Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research 15 (1), pp. 2489–2512. Cited by: §1, §2.
• K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition

,
pp. 770–778. Cited by: §5, §5.
• H. Karimi, J. Nutini, and M. Schmidt (2016) Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Cited by: §4.3.1, §4.3.
• D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2, §5.
• A. Krizhevsky, V. Nair, and G. Hinton (2014) The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, pp. 4. Cited by: §5.
• S. Lacoste-Julien, M. Schmidt, and F. Bach (2012) A simpler approach to obtaining an o (1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002. Cited by: §1, §2, §4.2.1, §4.
• L. Lei, C. Ju, J. Chen, and M. I. Jordan (2017) Non-convex finite-sum optimization via scsg methods. In Advances in Neural Information Processing Systems, pp. 2348–2358. Cited by: §4.3.
• B. T. Polyak and A. B. Juditsky (1992) Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization 30 (4), pp. 838–855. Cited by: §2, §4.1.1.
• B. T. Polyak (1990) A new method of stochastic approximation type. Avtomatika i telemekhanika, pp. 98–107. Cited by: §2.
• A. Rakhlin, O. Shamir, and K. Sridharan (2011) Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647. Cited by: §1, §2.
• S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola (2016)

Stochastic variance reduction for nonconvex optimization

.
In International conference on machine learning, pp. 314–323. Cited by: §4.3.
• O. Shamir and T. Zhang (2013) Stochastic gradient descent for non-smooth optimization: convergence results and optimal averaging schemes. In International Conference on Machine Learning, pp. 71–79. Cited by: §1, §2, §2.
• Y. Yan, T. Yang, Z. Li, Q. Lin, and Y. Yang (2018) A unified analysis of stochastic momentum methods for deep learning. In

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden.

,
pp. 2955–2961. External Links: Cited by: §2.
• Z. Yuan, Y. Yan, R. J. Jin, and T. Y. Yang (2018) Stagewise training accelerates convergence of testing error over sgd. arXiv preprint arXiv:1812.03934v3. Cited by: §2, §4.3.1.
• S. Zhang, A. E. Choromanska, and Y. LeCun (2015) Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pp. 685–693. Cited by: §2, §5.
• M. Zinkevich (2003) Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 928–936. Cited by: §2, §4.1.1.

## Appendix A Proof of Lemma 2

Proof. To show that is non-increasing in , it suffices to show that .

 (11)

where the inequality is from the assumption that .

## Appendix B Proof of Theorem 1

Proof. We have

 1T∑t=1tαE[T∑t=1tα(FS(xt)−FS(x∗))]=1T∑t=1tαT∑t=1tαE[FS(xt)−FS(x∗)]=1T∑t=1tαT∑t=1tαE[f(xt;zit)−f(x∗;zit)], (12)

where denotes the sampled data point in the -th iteration.

According to standard analysis of SGD on convex functions, we have

 E[f(xt;zit)−f(x∗;zit)]≤E[∇f(xt;zit)T(xt−x∗)]≤E[∥xt−x∗∥22ηt−∥xt+1−x∗∥22ηt+ηtG22]. (13)

Multiplying both sides by , and taking summation from to , we have

 T∑t=1wtE[f(xt;zit)−f(x∗;zit)]≤T∑t=1[wt2ηtE[∥xt−x∗∥2]−wt−12ηt−1E[∥xt−x∗∥2]+wt−12ηt−1E[∥xt−x∗∥2]−wt2ηtE(∥xt+1−x∗∥2)+wtηtG22]=T∑t=1(wt2ηt−wt−12ηt−1)E[∥xt−x∗∥2]+w02η0E[∥x1−x∗∥2]−wT2ηTE[∥xT+1−x∗∥2]+T∑t=1wtηtG22≤(wT2ηT−w02η0)D2+T∑t=1wtηtG22=Tα+1/22η1D2+η1G22T∑t=1tα−1/2, (14)

where we let and since these variables just exist in the analysis and not used in the algorithm.

Then we can get the claimed bound by dividing both sides with and using the following standard calculus which is going to be used frequently in this paper:

 S∑s=1sα≥∫S0xαdx=1α+1Sα+1,∀α>0,               (5−1)S∑s=1sα≤∫S+11xαdx≤(S+1)α+1α+1,∀α>0,            (5−2)S∑s=1sα−1≤SSα−1=Sα,∀α≥1,                          (5−3)S∑s=1sα−1≤∫S0xα−1dx=Sαα,∀0<α<1,              (5−4)S∑s=1s−1≤lnS+1.                                              (5−5) (15)

Plug (5-1), (5-3) and (5-4) into (14), we get

 1T∑t=1tαE[T∑t=1tα(FS(xt)−FS(x∗))]≤⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩(α+1)D22η1T1/2+(α+1)η1G2(2α+1)T1/2,0≤α<1/2,(α+1)D22η1T1/2+(α+1)η1G2T1/2,α≥1/2. (16)

## Appendix C Proof of Theorem 2

Proof. and are two sequences generated by SGD using two different data sets that differ only in one location, and they have the same initial point. and are the weighted average of these two sequences, respectively.

 (17)

Applying Jensen’s inequality, we have

 E[∥¯xT−¯x′T∥]≤1T∑t=1wtT∑t=1wtE[∥xt−x′t∥]. (18)

From Theorem 3.8 in (Hardt et al. (2015)), we have,

 E[∥xt−x′t∥]≤2Gnt−1∑i=1ηi. (19)

Then,

 E[∥¯xT−¯x′T∥]≤1T∑t=1tαT∑t=1[tα2Gnt−1∑i=1ηi]≤1T∑t=1tαT∑t=1[tα4Gη1n(t−1)1/2]≤1T∑t=1tαT∑t=1[4Gη1nt1/2+α]≤4Gη1n(α+1)(T+1)α+3/2(α+3/2)Tα+1, (20)

where the last inequality is from (5-1) and (5-2).

Then by the -Lipschitz of , it follows that for any fixed ,

 E[|f(¯xT;z)−f(¯x′T;z)|]≤4η1G2(α+1)(T+1)α+1.5n(α+1.5)Tα+1. (21)

Since this bound holds for all and , we get the bound of the uniform stability .

## Appendix D Proof of Theorem 3

Proof. We have

 1T∑t=1tαE[T∑t=1tα(FS(xt)−FS(x∗))]=1T∑t=1tαT∑t=1tαE[FS(xt)−FS(x∗)]=1T∑t=1tαT∑t=1tαE[f(xt;zit)−f(x∗;zit)]. (22)

By standard analysis of SGD on strongly convex function, we have

 (23)

Multiplying to both sides and taking summation from to , we have

 T∑t=1wtE[f(xt;zit)−f(x∗;zit)]≤T∑t=1[wtE[∥xt−x∗