1 Introduction
In machine learning, the task of learning a predictive model is usually formulated as the following empirical risk minimization (ERM) problem:
(1) 
where
is a loss function of the model
on a data , is a closed convex set, and denotes a set of observed data points that are sampled from an underlying unknown distribution with support on . For a function , denotes a subgradient of at .To solve the ERM problem, stochastic gradient descent (SGD) is usually employed, which updates the solution according to
(2) 
for , where is randomly sampled, is the step size, and is a projection operator, i.e., , where denotes the Euclidean norm. After all the iterations, a solution is output as the prediction model.
SGD has been widely studied in the literature from different angles, and is commonly employed for solving many big data machine learning problems. However, the averaging technique that combines all iterative solutions into a single solution is still underexplored. Most studies simply output the last solution or adopt the uniform averaging that computes a uniformly averaged solution based on all iterative solutions, i.e. . Nonetheless, both approaches could lead to unsatisfactory performance regardless of their benefits. The last solution could have a faster convergence in terms of optimization error in practice, but is less stable Bottou (2010); Hardt et al. (2015). Uniform averaging can improve the stability of the output solution, but may slow down the convergence of optimization error in some cases Hazan et al. (2007); Hazan and Kale (2014). Although some nonuniform averaging has been considered in the literature Rakhlin et al. (2011); Shamir and Zhang (2013); LacosteJulien et al. (2012), most of their analysis is focused on the effect on the convergence of optimization error. Their impact on the tradeoff between optimization error and the generalization error is unclear.
In order to understand this tradeoff, we need to consider the following risk minimization that is of utmost interest:
(3) 
Regarding an iterative algorithm that outputs solution by solving (1), the central concern is its generalization performance, which can be measured by , where the expectation is taken over all randomness in the algorithm and the training dataset . To analyze the generalization performance (referred to as the testing error hereafter) of a random solution , we use the following decomposition of testing error:
where denotes the optimization error of the algorithm and denotes the generalization error of the solution .
Most existing analysis about the nonuniform averaging is restricted to the convergence analysis of optimization error for strongly convex objectives . This paper aims to fill the gap by comprehensively analyzing a polynomially increased weighted averaging (PIWA) scheme where the weight of the solution of iteration is proportional to (). We analyze SGD with PIWA for general convex, strongly convex and nonconvex objective functions in terms of both optimization error and generalization error. We prove that SGD with PIWA has the optimal convergence rate in term of optimization error in both general convex and strongly convex cases, i.e., for the convex case, and for the strongly convex case. For nonconvex case, we employ the PIWA in a stagewise algorithm, which uses SGD with the averaging for solving a convex subproblem at each stage. We establish a convergence rate of in terms of the optimization error for a family of weaklyconvex functions that satisfies the PolyakŁojasiewicz condition. Moreover, we analyze the generalization error of SGD with PIWA following the analysis framework in Hardt et al. (2015) that uses the uniform stability tool. We show that SGD with PIWA may have smaller generalization error than SGD using the last solution. We also show how affects optimization error and generalization error, and thus exhibits their tradeoff caused by . We have also conducted extensive experiments on convex, strongly convex and nonconvex functions. The experimental results demonstrate the tradeoff caused by and the effectiveness of PIWA compared with other commonly used averaging schemes.
2 Related work
There are a lot of works that analyze SGD with uniform averaging Polyak (1990); Polyak and Juditsky (1992); Zinkevich (2003); Bottou (2010); Chen et al. (2019). Uniform averaging can help to improve the stability in terms of generalization Hardt et al. (2015). It can also help to get an optimal convergence rate in the convex case Polyak and Juditsky (1992); Zinkevich (2003). However, as it gives equal weight to every solution, it could actually slow down the convergence in many cases as later solutions are usually more accurate than earlier solutions Chen et al. (2019); Shamir and Zhang (2013). For example, in strongly convex case, the uniformly averaged solution has an convergence rate Hazan et al. (2007); Hazan and Kale (2014); Shamir and Zhang (2013), which is suboptimal.
In order to improve the convergence rate for optimization of strongly convex functions, many nonuniform averaging schemes have been proposed Rakhlin et al. (2011); Shamir and Zhang (2013); LacosteJulien et al. (2012). Rakhlin et al. (2011) considered suffix averaging that takes average of the last solutions (where ). However, suffix averaging cannot be updated online and thus are computationally expensive. Shamir and Zhang (2013) proposed polynomialdecay averaging for minimizing strongly convex functions. LacosteJulien et al. (2012) consider a simple polynomially increased weighted averaging, where the weight for the solution of the th iteration is proportional to . SGD with these averaging schemes have been shown to achieve the optimal convergence rate of for minimizing a strongly convex objective.
However, these existing works restrict their attention for minimizing a strongly convex objective and only analyze the optimization error. Thus, the theory for these nonuniform averaging schemes in convex and nonconvex cases are lacked and their impact on the generalization error is unclear. It should be mentioned that exponential moving averaging technique has also been widely used Kingma and Ba (2014); Zhang et al. (2015), which maintains a moving averaging by with . However, as the weights for previous solution decay exponentially, its performance is close to last solution. What is more, we do not know any existing theoretical guarantee on the performance of moving averaging.
Studies on the nonconvex case used to analyze a randomly sampled solution Ghadimi and Lan (2013); Yan et al. (2018); Davis and Drusvyatskiy (2018). Recently, stagewise algorithms enable the use of averaging for a class of nonconvex functions, namely weakly convex functions Chen et al. (2019); Davis and Grimmer (2017). These stagewise algorithms construct a convex objective function as a subproblem for each stage. By solving these subproblems in a stagewise manner, it can guarantee the convergence for the original problem. In Chen et al. (2019), they use uniform averaging, thus it may also suffer from slow convergence in practice. In Davis and Grimmer (2017), the weight of the solution of iteration at each stage is proportional to . Hence, it may not achieve the best tradeoff between the optimization error and the generalization error.
In Hardt et al. (2015), they have shown that uniform averaging can improve the generalization stability. In Yuan et al. (2018), they have derived the bound of generalization error for stagewise algorithm with uniformly averaging. The analysis of generalization error of SGD in this work can be considered as extensions of these works Hardt et al. (2015); Yuan et al. (2018) by analyzing the impact of PIWA on the generalization error.
3 Preliminaries
A function is GLipchitz continuous if , i.e., , and is smooth if it is differentiable and its gradient is Lipchitz continuous.
A function is strongly convex for , if for all ,
A nonconvex function is called weakly convex for if is convex.
For the analysis of the generalization error, we will use the uniform stability tool Bousquet and Elisseeff (2002). The definition of uniform stability is given below.
A randomized algorithm is called uniformly stable if for all datasets that differ at most one example, the following holds:
where denotes the random solution returned by algorithm based on the dataset .
The relation between uniform stability and generalization error is given in the following lemma. (Bousquet and Elisseeff (2002)) If is uniformly stable, we have .
Therefore, in order to compare the testing error of different randomized algorithms, it suffices to analyze their convergence in terms of optimization error and their uniform stability.
4 Main Theoretical Results
In this section, we will analyze SGD with PIWA in terms of both optimization and generalization error. We denote the algorithm by SGDPIWA and present it in Algorithm 1. We particularly use where to control the averaging weights of the solution at the th iteration. It should be noticed that SGD with uniform averaging is a special case of this algorithm by taking , and the averaging scheme proposed in LacosteJulien et al. (2012) is also a special case by setting . The step size at the th iteration will be different for different classes of functions, which will be exhibited later.
It is notable that the final averaged solution can be computed online by updating an averaged sequence :
Before showing the tradeoff of , we would emphasize that most proof of convergence of optimization error in existing works for averaging schemes uses the following Jensen’s inequality to first upper bound the objective gap:
(4) 
and then further upper bound the right hand side (RHS) of (4). In this way, the objective gap, i.e., , is relaxed twice.
However, it may not be precise to only focus on the effect of on the twotime relaxation of . Instead, to investigate the benefit of PIWA on the convergence of optimization error, we propose to additionally inspect the first relaxation, which we refer to the RHS of (4). Specifically, we present the following lemma to illustrate how affects the RHS of (4).
Assume . The function
(5) 
is nonincreasing in for .
We can see under the condition that the sequence of solutions yield nonincreasing objective values, a larger in PIWA will make smaller, which indicates that the RHS of (4) is smaller. The assumption may not hold in practice, but, to some degree, it explains the effect of on the first relaxation of the objective gap.
In the subsequent subsections, we provide convergence analysis of PIWA in optimization and generalization error in different conditions, i.e., general convex, strongly convex and nonconvex cases. We reveal how affects the upper bound of and how causes tradeoff in optimization error.
4.1 General Convex Case
4.1.1 Optimization Error
In the general convex case, we need the following assumptions.
Assumption 1
(i) is a convex function in terms of for any ;
(ii) , for any ;
(iii) there exists such that for any .
Based on Assumption 1, we have the following theorem. Suppose Assumption 1 holds and by setting , we have
Remark. The convergence rate for different value of is in the same order of , which is optimal Polyak and Juditsky (1992); Zinkevich (2003). One may notice that a larger would yield worse convergence bound for . It, however, does not necessarily indicate a worse optimization error in light of Lemma 4. In practice, there will be a tradeoff by using different values of .
4.1.2 Generalization Error
In this subsection, we analyze the uniform stability of SGDPIWA. Our analysis closely follows the route in Hardt et al. (2015). By showing , we show the generalization error is bounded by , where is learned by SGDPIWA on a dataset and is learned by SGDPIWA on a dataset with differing from at most one example.
Similar to Theorem 3.8 in Hardt et al. (2015), Lemma 4.1.2 provides a bound of the deviation between the two sequences and of SGDPIWA that runs on and separately.
Assume that the loss function is convex, Lipchitz, smooth for every . Suppose we run SGDPIWA with step sizes for steps. Then,
(6) 
Suppose Assumption 1 holds, and we further assume is a Lipschitz and smooth function. Set . Then the Algorithm 1 has the uniform stability of
(7) 
Remark. When is large, . Then we can see that the bound of generalization error is increasing in . Taking , which reduces the PIWA to uniform averaging, has the smallest generalization error. Even if , generalization error of PIWA is still smaller than that of last solution Hardt et al. (2015), since is bounded by . It is therefore smaller than the last solution even when is very large.
4.2 Strongly Convex Case
In this subsection, we are going to analyze our algorithm for the strongly convex objective function. In this case, we need the following assumptions:
Assumption 2
(i) is a strongly convex function;
(ii) , for any .
4.2.1 Optimization Error
Suppose Assumption 2 holds, by setting , we have
Remark. When , the algorithm degenerates to a uniform average with a convergence. By taking , the order is improved to . The algorithm in LacosteJulien et al. (2012) is special case with , while we have generated it to any . Note that when is close to 0, although the order is , the convergence is not significantly better compared with because of the on the denominator.
4.2.2 Generalization Error
Then we can get the bound of the generalization error in the following theorem. Suppose Assumption 2 holds, and we further assume that , and is smooth. Then by taking and , the Algorithm 1 has uniform stability of
(8) 
Remark. This result is similar as the Theorem 3.10 in Hardt et al. (2015). Differently, this result depends on . The bound of generalization error is polynomial to . Again, we see that the generalization error of PIWA is worse than uniform averaging. Thus, we cannot take to be very large in experiments.
4.3 NonConvex Case
In this section, we are going to analyze the optimization and generalization error of SGDPIWA for the nonconvex case. We need the following assumptions:
Assumption 3
(i) For any and .
(ii) is weakly convex for any ;
(iii) is smooth;
(iv) For an initial solution , there exists such that ;
(v) satisfies the PL condition, i.e., there exists
(9) 
These conditions have been used in Karimi et al. (2016); Lei et al. (2017); Bassily et al. (2018); Reddi et al. (2016). The algorithm we use for the nonconvex objective function is shown in Algorithm 2, where . The algorithm uses a stagewise training strategy. The objective function for each stage is a sum of the loss function and an regularization term referring to the output of previous stage. At each stage, the objective function is optimized by SGDPIWA. The step size within a stage is set to be a constant, and is decreased after a stage.
4.3.1 Optimization Error
We need the following lemmas.
If satisfies the PL condition, then for any we have
(10) 
We first have the following lemma for each stage of the algorithm. To this end, we let denote the th solution at the th stage.
Suppose Assumption 3 holds. By applying SGD to with , . And we further assume that . Moreover the solution of each iteration in th stage is projected to the Euclidean ball , where
. Then with a probability at least
, we havewhere is the solution of thm iteration at stage , is the upper bound of , which exists and can be set to .
Remark. Combining the above result with Lemma 4.3.1, we can see that is also bounded, which indicates that if the optimal point is within the constraint ball of the initial solution, then after this stage, the optimal point is still in a constraint ball of the averaged solution with high probability. In the following theorem, we are going to make the bounded ball smaller after each stage.
Suppose Assumption 3 holds, and is weakly convex with . Then by setting and and , where , , and after stages, with probability we have
Remark. It is easy to see that the total iteration complexity is in the order , which is similar as in Yuan et al. (2018) that uses a uniform averaging.
4.3.2 Generalization Error
We now establish the uniform stability in the following theorem. Let and . By the same assumptions and setting as Theorem 4.3.1, we have
Remark. We apply the similar conditional analysis for the nonconvex objective function as in Hardt et al. (2015). In particular, we condition on , i.e., the different example will be used within the last stage, and prove the bound for . We can see the bound is increasing in .
5 Experiments
In this section, we demonstrate the effectiveness of PIWA. First, we compare PIWA with different averaging baselines for convex, strongly convex and nonconvex objective functions. The baselines are the last solution, uniform averaging and moving averaging. The goal is to show desirable generalization performance of PIWA. Second, to reveal why PIWA achieves better generalization performance, we show that there is tradeoff of between optimization and generalization error by comparing different variants of PIWA. We set various values of to show how it affects optimization and generalization error for convex, strongly convex and nonconvex objective functions.
Settings. For a dataset , each , where
is the feature vector and
is the label. Hinge loss is used for the general convex case: For strongly convex case, we add an norm regularization, i.e, into the above objective. For nonconvex case, we learn a ResNet20 network with softmax loss He et al. (2016). We compare PIWA with the last solution, uniform averaging and exponential moving averaging.Datasets. For convex and strongly convex cases, we perform experiments on covtype, splice, svmguide1 and skinnonskin from the LIBSVM binary classification datasets Chang and Lin (2011). For covtype and skinnonskin, as no testing datasets were provided, we randomly sampled 50,000 data points as testing data. For other datasets used in convex/strongly convex experiments, we use the training/testing protocol provided by LIBSVM data Chang and Lin (2011). For nonconvex case, we use CIFAR10 and CIFAR100 datasets Krizhevsky et al. (2014).
Parameters. In the convex and strongly convex cases, the initial step size is tuned in . In strongly convex case, the regularization weight is tuned in . In convex and strongly convex cases, we sample one instance at each iteration. In the nonconvex case, the initial step size is tuned in , is tuned in . The number of iterations is set to be 40k for the first stage, and 20k for the second stage, which is the same as in He et al. (2016). The step size is decayed by a constant factor of 10 after each stage. For exponential moving average, the decay coefficient is tuned in following Kingma and Ba (2014); Zhang et al. (2015). The batch size used in nonconvex case is 128. For experiments on all three cases, is tuned in .
Results. The experimental results are shown in Figure 1, Figure 3 and Figure 3. For convex case (Figure 1) and strongly convex case (Figure 3), we plot the curves of the values of objective functions and testing error. Note that the testing error in the experiments refer to the rate of wrong classification. For nonconvex case (Figure 3), we plot the curves of the training error and testing error. For all three figures, the first two columns compare PIWA with the three baselines. The last two columns compare variants of PIWA with different values of to demonstrate the tradeoff of .
From the first two columns of three figures, we can see that PIWA often achieves the best performance in testing error among all the averaging schemes, even if PIWA could not always outperforms other averaging baselines in training. In addition, PIWA tends to make the output solution stable, since the curves of PIWA rarely fluctuate dramatically. In contrast, other baselines often return unstable solutions. Particularly, the curves of the last solution method often encounter sharp fluctuation, specially in convex and strongly convex settings, where we sample one instance per iteration. In contrast, we use a batch size of 128 in the nonconvex setting, where the last solution method keeps more stable.
The last two columns of three tables show the tradeoff of between the optimization and generalization error. Specifically, a larger usually leads to faster training convergence, but it may make the solution more unstable. On the other hand, a larger often leads to larger testing error, even if the training performance is better.
6 Conclusion
In this paper, we have comprehensively analyzed SGD with PIWA in terms of both optimization error and generalization error for convex, strongly convex and nonconvex problems. We have shown in theory why PIWA with a proper can improve the optimization error. We have also shown that in PIWA, a larger usually leads to a worse generalization error. Thus, there is a tradeoff caused by between optimization error and generalization error. Experiments on benchmark datasets have demonstrated this tradeoff and effectiveness of PIWA compared with other averaging schemes.
References
 On exponential convergence of sgd in nonconvex overparametrized learning. arXiv preprint arXiv:1811.02564. Cited by: §4.3.
 From error bounds to the complexity of firstorder descent methods for convex functions. Mathematical Programming 165 (2), pp. 471–507. Cited by: §4.3.1.
 Largescale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §1, §2.
 Stability and generalization. Journal of machine learning research 2 (Mar), pp. 499–526. Cited by: §3, §3.

LIBSVM: a library for support vector machines
. ACM transactions on intelligent systems and technology (TIST) 2 (3), pp. 27. Cited by: §5.  Universal stagewise learning for nonconvex problems with convergence on averaged solutions. In International Conference on Learning Representations, Cited by: §2, §2.
 Stochastic subgradient method converges at the rate on weakly convex functions. arXiv preprint arXiv:1802.02988. Cited by: §2.
 Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. arXiv preprint arXiv:1707.03505. Cited by: §2.
 Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization 23 (4), pp. 2341–2368. Cited by: §2.
 Train faster, generalize better: stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240. Cited by: Appendix C, Appendix E, Appendix E, Appendix E, Appendix H, Appendix H, §1, §1, §2, §2, §4.1.2, §4.1.2, §4.1.2, §4.2.2, §4.3.2.
 Logarithmic regret algorithms for online convex optimization. Machine Learning 69 (23), pp. 169–192. Cited by: §1, §2.
 Beyond the regret minimization barrier: optimal algorithms for stochastic stronglyconvex optimization. The Journal of Machine Learning Research 15 (1), pp. 2489–2512. Cited by: §1, §2.

Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §5, §5.  Linear convergence of gradient and proximalgradient methods under the polyakłojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Cited by: §4.3.1, §4.3.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2, §5.
 The cifar10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, pp. 4. Cited by: §5.
 A simpler approach to obtaining an o (1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002. Cited by: §1, §2, §4.2.1, §4.
 Nonconvex finitesum optimization via scsg methods. In Advances in Neural Information Processing Systems, pp. 2348–2358. Cited by: §4.3.
 Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization 30 (4), pp. 838–855. Cited by: §2, §4.1.1.
 A new method of stochastic approximation type. Avtomatika i telemekhanika, pp. 98–107. Cited by: §2.
 Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647. Cited by: §1, §2.

Stochastic variance reduction for nonconvex optimization
. In International conference on machine learning, pp. 314–323. Cited by: §4.3.  Stochastic gradient descent for nonsmooth optimization: convergence results and optimal averaging schemes. In International Conference on Machine Learning, pp. 71–79. Cited by: §1, §2, §2.

A unified analysis of stochastic momentum methods for deep learning.
In
Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 1319, 2018, Stockholm, Sweden.
, pp. 2955–2961. External Links: Link, Document Cited by: §2.  Stagewise training accelerates convergence of testing error over sgd. arXiv preprint arXiv:1812.03934v3. Cited by: §2, §4.3.1.
 Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pp. 685–693. Cited by: §2, §5.
 Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pp. 928–936. Cited by: §2, §4.1.1.
Appendix A Proof of Lemma 2
Proof. To show that is nonincreasing in , it suffices to show that .
(11) 
where the inequality is from the assumption that .
Appendix B Proof of Theorem 1
Proof. We have
(12) 
where denotes the sampled data point in the th iteration.
According to standard analysis of SGD on convex functions, we have
(13) 
Multiplying both sides by , and taking summation from to , we have
(14) 
where we let and since these variables just exist in the analysis and not used in the algorithm.
Then we can get the claimed bound by dividing both sides with and using the following standard calculus which is going to be used frequently in this paper:
(15) 
Plug (51), (53) and (54) into (14), we get
(16) 
Appendix C Proof of Theorem 2
Proof. and are two sequences generated by SGD using two different data sets that differ only in one location, and they have the same initial point. and are the weighted average of these two sequences, respectively.
(17) 
Applying Jensen’s inequality, we have
(18) 
From Theorem 3.8 in (Hardt et al. (2015)), we have,
(19) 
Then,
(20) 
where the last inequality is from (51) and (52).
Then by the Lipschitz of , it follows that for any fixed ,
(21) 
Since this bound holds for all and , we get the bound of the uniform stability .
Appendix D Proof of Theorem 3
Proof. We have
(22) 
By standard analysis of SGD on strongly convex function, we have
(23) 
Multiplying to both sides and taking summation from to , we have
Comments
There are no comments yet.