1 Introduction
Modern machine learning models are typically trained with iterative stochastic firstorder methods [7, 32, 12, 24, 11, 6]. Stochastic gradient descent (SGD) and related methods such as Adagrad [7] or Adam [12] compute the gradient with respect to one or a minibatch of training examples in each iteration and take a descent step using this gradient. Since these methods use only a small part of the data in each iteration, they are the preferred way for training models on large datasets. However, in order to converge to the solution, these methods require the stepsize to decay to zero in terms of the number of iterations. This implies that the gradient descent procedure takes smaller steps as the training progresses. Consequently, these methods result in slow sublinear rates of convergence. Specifically, if is the number of iterations, then SGDlike methods achieve a convergence rate of and for stronglyconvex and convex functions respectively [16]. In practice, these methods are augmented with some form of momentum or acceleration [20, 18] that results in faster empirical convergence [28]. Recently, there has been some theoretical analysis for the use of such acceleration in the stochastic setting [5]. Other related work includes algorithms specifically designed to achieve an accelerated rate of convergence in the stochastic setting [1, 13, 8].
Another recent trend in the literature has been to use variancereduction techniques
[24, 11, 6] that exploit the finitesum structure of the loss function in machinelearning applications. These methods do not require the stepsize to decay to zero and are able to achieve the optimal rate of convergence. However, they require additional bookkeeping [24, 6] or need to compute the full gradient periodically [11], both of which are difficult in the context of training complex models on large datasets.In this paper, we take further advantage of the optimization properties specific to modern machine learning models. In particular, we make use of the fact that models such as nonparametric regression or overparameterized deep neural networks are expressive enough to fit or
interpolate the training dataset completely [33, 15]. For an SGDlike algorithm, this implies that the gradient with respect to each training example converges to zero at the optimal solution. This property of interpolation is also true for boosting [23]and for simple linear classifiers on separable data. For example, the perceptron algorithm
[22] was first shown to converge to the optimal solution under a linear separability assumption on the data [19]. This assumption implies that the linear perceptron is able to fit the complete dataset.There has been some related work that takes advantage of the interpolation property in order to obtain faster rates of convergence for SGD [25, 15, 4]. Specifically, Schmidt and Le Roux [25] assume a strong growth condition on the stochastic gradients. This condition relates the norms of the stochastic gradients to that of the full gradient. Under this assumption, they prove that constant stepsize SGD can attain the same convergence rates as full gradient descent in both the stronglyconvex and convex cases. Other related work has used the strong growth condition to prove convergence rates for incremental gradient methods [27, 29]. Ma et al. [15] show that under weaker conditions, SGD with constant stepsize results in linear convergence for stronglyconvex functions. They also investigate the effect of batchsize on the convergence and theoretically justify the linearscaling rule
used for training deep learning models in practice
[10]. Recently, Cevher and Vũ showed the linear convergence of proximal stochastic gradient descent under a weaker growth condition for restricted strongly convex functions [4]. They also analyse the effect of an additive error term on the convergence rate.In contrast to the above mentioned work, we first show that the strong growth condition (SGC) [25]
implies that SGD with a constant stepsize and Nesterov momentum
[18] achieves the accelerated convergence rate of the deterministic setting for both stronglyconvex and convex functions (Section 3). Our result gives some theoretical justification behind the empirical success of using Nesterov acceleration with SGD [28]. In Section 4, we prove that under the SGC, constant stepsize SGD is able to find a firstorder stationary point as efficiently as deterministic gradient descent. To the best of our knowledge, this is the first work to study accelerated and nonconvex rates under the SGC. Next, we relax the strong growth condition to a more practical weak growth condition (WGC). In Section 5, we prove that the weak growth condition is sufficient to obtain the optimal convergence of constant stepsize SGD for smooth stronglyconvex and convex functions.To demonstrate the applicability of our growth conditions in practice, we first show that for models interpolating the data, the WGC is satisfied for all smooth loss functions with a finitesum structure (Section 6.1). Furthermore, we prove that functions satisfying the WGC and the Polyak Lojasiewicz inequality [21] also satisfy the SGC. Under additional assumptions, we show that it is also satisfied for the squaredhinge loss. This result enables us to prove an mistake bound for iterations of the stochastic perceptron algorithm using the squaredhinge loss (Section 7). Finally, in Section 8, we evaluate our claims with experiments on synthetic and real datasets.
2 Background
In this section, we give the required background and set up the necessary notation. Our aim is to minimize a differentiable function . Depending on the context, this function can be stronglyconvex, convex or nonconvex. We assume that we have access to noisy gradients for the function and use stochastic gradient descent (SGD) for iterations in order to minimize it. The SGD update rule in iteration can be written as: . Here, and are the SGD iterates, is the gradient noise and is the stepsize at iteration . We assume that the gradients are unbiased, implying that for all and that .
While most of our results apply for general SGD methods, a subset of our results rely on the function having a finitesum structure meaning that . In the context of supervised machine learning, given a training dataset of points, the term corresponds to the loss function for the point when the model parameters are equal to . Here and
refer to the feature vector and label for point
respectively. Common choices of the loss function include the squared loss where , the hinge loss where or the squaredhinge loss where. The finite sum setting includes both simple models such as logistic regression or least squares and more complex models like nonparametric regression and deep neural networks.
In the finitesum setting, SGD consists of choosing a point and its corresponding loss function (typically uniformly) at random and evaluating the gradient with respect to that function. It then performs a gradient descent step: where is the random loss function selected at iteration . The unbiasedness property is automatically satisfied in this case, i.e. for all . Note that in this case, the random selection of points for computing the gradient is the source of the noise . In order to converge to the optimum, SGD requires the stepsize to decrease with ; specifically at a rate of for convex functions and at a rate for stronglyconvex functions. Decreasing the stepsize with results in sublinear rates of convergence for SGD.
In order to derive convergence rates, we need to make additional assumptions about the function [16]. Beyond differentiability, our results assume that the function satisfies some or all of the following common assumptions. For all points , and for constants , , and ;
(Bounded below)  
(Convexity)  
( Strongconvexity)  
( Smoothness) 
Note that some of our results in Section 6 rely on the finitesum structure and we explicitly state when we need this additional assumption.
In this paper, we consider the case where the model is able to interpolate or fit the labelled training data completely. This is true for expressive models such as nonparametric regression and overparametrized deep neural networks. For common loss functions that are lowerbounded by zero, interpolating the data results in zero training loss. Interpolation also implies that the gradient with respect to each point converges to zero at the optimum. Formally, in the finitesum setting, if the function is minimized at , i.e., if , then for all functions , .
The strong growth condition (SGC) used connects the rates at which the stochastic gradients shrink relative to the full gradient. Formally, for any point
and the noise random variable
, the function satisfies the strong growth condition with constant if,(1)  
Equivalently, in the finitesum setting,  
(2) 
For this inequality to hold, if , then for all . Thus, functions satisfying the SGC necessarily satisfy the above interpolation property. Schmidt and Le Roux’s work [25] derives optimal convergence rates for constant stepsize SGD under the above condition for both convex and stronglyconvex functions. In the next section, we show that the SGC implies the accelerated rate of convergence for constant stepsize SGD with Nesterov momentum.
3 SGD with Nesterov acceleration under the strong growth condition
We first describe constant stepsize SGD with Nesterov acceleration. The algorithm consists of three sequences () updated in each iteration [17]. Specifically, it consists of the following update rules:
(3)  
(4)  
(5) 
Here, is the constant stepsize for the SGD step and , , are tunable parameters to be set according to the properties of .
In order to derive a convergence rate for the above algorithm under the SGC, we first observe that a form of the SGC is satisfied in the case of coordinate descent [30]. In this case, we choose a coordinate (typically at random) and perform a gradient descent step with respect to that coordinate. The notion of a coordinate in this case is analogous to that of an individual loss function in the finite sum case. For coordinate descent, a zero gradient at the optimal solution implies that the partial derivative with respect to each coordinate is also equal to zero. This is analogous to the SGC in the finitesum case, although we note the results in this section do not require the finitesum assumption.
We use this analogy formally in order to extend the proof of Nesterov’s accelerated coordinate descent [17] to derive convergence rates for the above algorithm when using the SGC. This enables us to prove the following theorems (with proofs in Appendices B.1.1 and B.1.3) in both the stronglyconvex and convex settings.
Theorem 1 (Strongly convex).
Under smoothness and strongconvexity, if satisfies the SGC with constant , then SGD with Nesterov acceleration with the following choice of parameters,
results in the following convergence rate:
Theorem 2 (Convex).
Under smoothness and convexity, if satisfies the SGC with constant , then SGD with Nesterov acceleration with the following choice of parameters,
results in the following convergence rate:
The above theorems show that constant stepsize SGD with Nesterov momentum achieves the accelerated rate of convergence up to a factor for both stronglyconvex and convex functions.
4 SGD under the strong growth condition
In this section, we show that the SGC results in an improvement over the rate for SGD in the nonconvex setting [9]. In particular, we show that under the strong growth condition, constant stepsize SGD is able to find a firstorder stationary point as efficiently as deterministic gradient descent. We prove the following theorem (with the proof in Appendix B.2),
Theorem 3 (NonConvex).
Under smoothness, if satisfies SGC with constant , then SGD with a constant stepsize attains the following convergence rate:
The above theorem shows that under the SGC, SGD with a constant stepsize can attain the optimal rate for nonconvex functions. To the best of our knowledge, this is the first result for nonconvex functions under interpolationlike conditions. Under these conditions, constant stepsize SGD has a better convergence rate than algorithms which have recently been proposed to improve on SGD [2, 3]. Our results also provide some theoretical justification for the effectiveness of SGD for nonconvex overparameterized models like deep neural networks.
5 Weak growth condition
In this section, we relax the strong growth condition to a more practical condition which we refer to as the weak growth condition (WGC). Formally, if the function is smooth and has a minima at , then it satisfies the WGC with constant , if for all points and noise random variable ,
(6)  
Equivalently, in the finitesum setting,  
(7) 
In the above condition, notice that if , then for all points . Thus, the WGC implies the interpolation property explained in Section 2.
5.1 Relation between WGC and SGC
In this section, we relate the two growth conditions. We first prove that SGC implies WGC with the same without any additional assumptions, formally showing that the WGC is indeed weaker than the corresponding SGC. For the converse, a function satisfying the WGC satisfies the SGC with a worse constant if it also satisfies the Polyak Lojasiewicz (PL) inequality [21]. The above relations are captured by the following proposition, proved in Appendix B.5
Proposition 1.
If is smooth, satisfies the WGC with constant and the PL inequality with constant , then it satisfies the SGC with constant .
Conversely, if is smooth and satisfies the SGC with constant , then it also satisfies the WGC with the same constant .
5.2 SGD under the weak growth condition
Using the WGC, we obtain the following convergence rates for SGD with a constant stepsize.
Theorem 4 (Stronglyconvex).
Under smoothness and strongconvexity, if satisfies the WGC with constant , then SGD with a constant stepsize achieves the following rate:
Theorem 5 (Convex).
Under smoothness and convexity, if satisfies the WGC with constant , then SGD with a constant stepsize and iterate averaging achieves the following rate:
Here, is the averaged iterate after iterations.
In these cases, the WGC is sufficient to show that constant stepsize SGD can attain the deterministic rates up to a factor of . Since this condition is weaker than the corresponding strong growth condition, our results subsume the SGC results [25]. In the next section, we characterize the functions satisfying the growth conditions in practice.
6 Growth conditions in practice
In this section, we give examples of functions that satisfy the weak and strong growth conditions. In Section 6.1, we first show that for models interpolating the data, the WGC is satisfied by all smooth functions with a finitesum structure. In section 6.2, we show that the SGC is satisfied by the squaredhinge loss under additional assumptions.
6.1 Functions satisfying WGC
To characterize the functions satisfying the WGC, we first prove the following proposition (with the proof in Appendix B.6):
Proposition 2.
If the function has a finitesum structure for a model that interpolates the data and is the maximum smoothness constant amongst the functions , then for all ,
(8) 
Comparing the above equation to Equation 7, we see that any smooth finitesum problem under interpolation satisfies the WGC with . The WGC is thus satisfied by common loss functions such as the squared and squaredhinge loss. For these loss functions, if for all , then Theorem 4 implies that SGD with results in linear convergence for stronglyconvex functions. This matches the recently proved result of Ma et al. [15], whereas Theorem 5 allows us to generalize their result beyond stronglyconvex functions.
6.2 Functions satisfying SGC
We now show that under additional assumptions on the data, the squaredhinge loss also satisfies the SGC. We first assume that the data is linearly separable with a margin equal to , implying that for all , . Here, is the support of the distribution of the features . Note that the above assumption implies the existence of a classifier such that . In addition to this, we assume that the features have a finite support, meaning that the set is finite and has a cardinality equal to . Under these assumptions, we prove the following lemma in Appendix B.7,
Lemma 1.
For linearly separable data with margin and a finite support of size , the squaredhinge loss satisfies the SGC with the constant .
In the next section, we use the above lemma to prove a mistake bound for the perceptron algorithm using the squaredhinge loss.
7 Implication for Faster Perceptron
In this section, we use the strong growth property of the squaredhinge function in order to prove a bound on the number of mistakes made by the perceptron algorithm [22] using a squaredhinge loss. The perceptron algorithm is used for training a linear classifier for binary classification and is guaranteed to converge for linearly separable data [19]. It can be considered as stochastic gradient descent on the loss .
The common way to characterize the performance of a perceptron is by bounding the number of mistakes (in the binary classification setting) after iterations of the algorithm. In other words, we care about the quantity . Assuming linear separability of the data and that for all points , the perceptron achieves a mistake bound of [19].
In this paper, we consider a modified perceptron algorithm using the squaredhinge function as the loss. Note that since we assume the data to be linearly separable, a linear classifier is able to fit all the training data. Since the squaredhinge loss function is smooth, the conditions of Proposition 2 are satisfied, which implies that it satisfies the WGC with . Also observe that since we assume that , . Using these facts with Theorem 5 and assuming that we start the optimization with , we obtain the following convergence rate using SGD with ,
To see this, recall that and the loss is equal to zero at the optima, implying that .
The above result gives us a bound on the training loss. We use the following lemma (proved using the Markov inequality in Appendix B.8) to relate the mistake bound to the training loss.
Lemma 2.
If represents the loss on the point , then
Combining the above results, we obtain a mistake bound of when using the squaredhinge loss on linearly separable data. We thus recover the standard results for the stochastic perceptron.
Note that for a finite amount of data (when the expectation is with respect to a discrete distribution), if we use batch accelerated gradient descent (which is not one of the stochastic gradient algorithms studied in this paper, and for which no growth condition is needed), we obtain a mistake bound that decreases as . This improves on existing mistake bounds that scale as [26, 31]. Note that both sets of algorithms have the same dependence on the margin , but this deterministic accelerated method would require evaluating gradients on each iteration.
From Lemma B.8, we know that the squaredhinge loss satisfies the SGC with . Under the same conditions as above, this lemma along with the result of Theorem 2 gives us the following bound:
Using the result from Lemma 2, this results in a mistake bound of the order while only requiring one gradient per iteration. Hence, the use of acceleration leads to an improved novel dependence of , but requires the additional assumptions of Lemma B.8 and has a worse dependence on the margin .
8 Experiments
. Accelerated SGD with a linesearch to estimate
is stable and results in good performance.In this section, we empirically validate our theoretical results. For the first set of experiments (Figures 1(a)1(d)), we generate a synthetic binary classification dataset with and the dimension . We ensure that the data is linearly separable with a margin , thus satisfying the interpolation property for training a linear classifier. We seek to minimize the finitesum squaredhinge loss, . In Figure 1, we vary the margin and plot the logarithm of the loss with the number of effective passes (one pass is equal to iterations of SGD) over the data. In all of our experiments, we estimate the value of the smoothness parameter
as the maximum eigenvalue of the Gram matrix
.We evaluate the performance of constant stepsize SGD with and without acceleration. Since the squaredhinge loss satisfies the WGC with (Proposition 2), we use SGD with a constant stepsize ^{1}^{1}1Note that using lead to consistently better results as compared to using as suggested by Theorem 5. (denoted as SGD(T) in the plots). For using Nesterov acceleration, we experimented with the dependence of the margin on the constant in the SGC. We found that setting results in consistently stable but fast convergence across different choices of . We thus use a stepsize and set the tunable parameters in the update Equations 35 as specified by Theorem 2. We denote this variant of accelerated SGD as AccSGD(T) in the subsequent plots.
In addition, we propose a linesearch heuristic to dynamically estimate the value of
. Our heuristic is inspired from the linesearch used in SAG [24] and can be described as follows: we start with an initial estimate and in each iteration, we halve the estimate when the condition is not satisfied. We denote this variant as AccSGD(LS) in the plots.In each of the Figures 1(a)1(d), we make the following observations: (i) SGD(T) results in reasonably slow convergence. This observation is in line with other SGD methods using as the stepsize [24]. (ii) AccSGD(T) with is consistently stable and as suggested by the theory, it results in faster convergence as compared to using SGD. (iii) AccSGD(LS) either matches or outperforms AccSGD(T). We plan to investigate better linesearch methods for both SGD [24] and AccSGD [14] in the future. (iv) For larger values of (Figures 1(a) 1(b)), the training loss becomes equal to zero, verifying the interpolation property.
The next set of experiments (Figure 2) considers binary classification on the CovType^{2}^{2}2http://osmot.cs.cornell.edu/kddcup and Protein^{3}^{3}3http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets datasets. For this, we train a linear classifer using the radial basis (nonparametric) features. Nonparametric regression models of this form are capable of interpolating the data [15] and thus satisfies our assumptions. We subsample random points from the datasets and use the squaredhinge loss as above. Note that we do not have a good estimate of in this case and only compare the performance of SGD(T) and AccSGD(LS).
From Figures 2(a) and 2(b), we make the following observations: (i) In Figure 2(a), both variants have similar performance. (ii) In Figure 2(b), the AccSGD(LS) leads to considerably faster convergence as compared to SGD(T). (iii) Accelerated SGD in conjunction with our linesearch heuristic is stable across datasets. These experiments show that in cases where the interpolation property is satisfied, both SGD and accelerated SGD with a constant stepsize can result in good empirical performance.
9 Conclusion
In this paper, we showed that under interpolation, the stochastic gradients of common loss functions satisfy specific growth conditions. Under these conditions, we proved that it is possible for constant stepsize SGD (with and without Nesterov acceleration) to achieve the convergence rates of the corresponding deterministic settings. These are the first results achieving optimal rates in the accelerated and nonconvex settings under interpolationlike conditions. We used these results to demonstrate the fast convergence of the stochastic perceptron algorithm employing the squaredhinge loss. We showed that both SGD and accelerated SGD with a constant stepsize can lead to good empirical performance when the interpolation property is satisfied. As opposed to determining the stepsize and the schedule for annealing it for current SGDlike methods, our results imply that under interpolation, we only need to automatically determine the constant stepsize for SGD. In the future, we hope to develop linesearch techniques for automatically determining this stepsize for both the accelerated and nonaccelerated variants.
10 Acknowledgements
We acknowledge support from the European Research Council (grant SEQUOIA 724063) and the CIFAR program on Learning with Machines and Brains. We also thank Nicolas Flammarion for discussions related to this work.
References

[1]
Zeyuan AllenZhu.
Katyusha: The first direct acceleration of stochastic gradient
methods.
In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
, pages 1200–1205. ACM, 2017.  [2] Zeyuan AllenZhu. Natasha 2: Faster nonconvex optimization than sgd. arXiv preprint arXiv:1708.08694, 2017.
 [3] Yair Carmon, Oliver Hinder, John C Duchi, and Aaron Sidford. Convex until proven guilty: Dimensionfree acceleration of gradient descent on nonconvex functions. arXiv preprint arXiv:1705.02766, 2017.
 [4] Volkan Cevher and Bằng Công Vũ. On the linear convergence of the stochastic gradient method with constant stepsize. Optimization Letters, pages 1–11, 2018.
 [5] Michael Cohen, Jelena Diakonikolas, and Lorenzo Orecchia. On acceleration with noisecorrupted gradients. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pages 1018–1027, 2018.
 [6] Aaron Defazio, Francis Bach, and Simon LacosteJulien. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
 [7] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 [8] Roy Frostig, Rong Ge, Sham Kakade, and Aaron Sidford. Unregularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In International Conference on Machine Learning, pages 2540–2548, 2015.
 [9] Saeed Ghadimi and Guanghui Lan. Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
 [10] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 [11] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
 [12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [13] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for firstorder optimization. In Advances in Neural Information Processing Systems, pages 3384–3392, 2015.
 [14] Jun Liu, Jianhui Chen, and Jieping Ye. Largescale sparse logistic regression. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 547–556. ACM, 2009.
 [15] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern overparametrized learning. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pages 3331–3340, 2018.
 [16] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
 [17] Yu Nesterov. Efficiency of coordinate descent methods on hugescale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
 [18] Yurii Nesterov et al. Gradient methods for minimizing composite objective function, 2007.
 [19] Albert B Novikoff. On convergence proofs for perceptrons. Technical report, STANFORD RESEARCH INST MENLO PARK CA, 1963.
 [20] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
 [21] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3(4):643–653, 1963.
 [22] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
 [23] Robert E Schapire, Yoav Freund, Peter Bartlett, Wee Sun Lee, et al. Boosting the margin: A new explanation for the effectiveness of voting methods. The annals of statistics, 26(5):1651–1686, 1998.
 [24] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(12):83–112, 2017.
 [25] Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370, 2013.
 [26] Negar Soheili and Javier Pena. A primal–dual smooth perceptron–von neumann algorithm. In Discrete Geometry and Optimization, pages 303–320. Springer, 2013.
 [27] Mikhail V Solodov. Incremental gradient algorithms with stepsizes bounded away from zero. Computational Optimization and Applications, 11(1):23–35, 1998.
 [28] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013.
 [29] Paul Tseng. An incremental gradient (projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506–531, 1998.
 [30] Stephen J Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, 2015.
 [31] Adams Wei Yu, Fatma KilincKarzan, and Jaime Carbonell. Saddle points and accelerated perceptron algorithms. In International Conference on Machine Learning, pages 1827–1835, 2014.
 [32] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
 [33] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
Appendix A Incorporating additive error for Nesterov acceleration
For this section, we assume an additive error in the the strong growth condition implying that the following equation is satisfied for all , .
In this case, we have the counterparts of Theorems 1 and 2 as follows:
Theorem 6 (Strongly convex).
Under smoothness and stronglyconvexity, if satisfies SGC with constant and an additive error , then SGD with Nesterov acceleration with the following choice of parameters,
results in the following convergence rate:
Theorem 7 (Convex).
Under smoothness and convexity, if satisfies SGC with constant and an additive error , then SGD with Nesterov acceleration with the following choice of parameters,
results in the following convergence rate:
Appendix B Proofs
b.1 Proofs for SGD with Nesterov Acceleration
Recall the update equations for SGD with Nesterov acceleration as follows:
Since the stochastic gradients are unbiased, we obtain the following equation,  
(10)  
For the proof, we consider the more general stronggrowth condition with an additive error .  
(11)  
We choose the parameters , , , , such that the following equations are satisfied:  
(12)  
(13)  
(14)  
(15)  
(16) 
We now prove the following lemma assuming that the function is smooth and stronglyconvex.
Comments
There are no comments yet.