Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron

Modern machine learning focuses on highly expressive models that are able to fit or interpolate the data completely, resulting in zero training loss. For such models, we show that the stochastic gradients of common loss functions satisfy a strong growth condition. Under this condition, we prove that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic setting for both convex and strongly-convex functions. In the non-convex setting, this condition implies that SGD can find a first-order stationary point as efficiently as full gradient descent. Under interpolation, we also show that all smooth loss functions with a finite-sum structure satisfy a weaker growth condition. Given this weaker condition, we prove that SGD with a constant step-size attains the deterministic convergence rate in both the strongly-convex and convex settings. Under additional assumptions, the above results enable us to prove an O(1/k^2) mistake bound for k iterations of a stochastic perceptron algorithm using the squared-hinge loss. Finally, we validate our theoretical findings with experiments on synthetic and real datasets.

Authors

• 15 publications
• 138 publications
• 45 publications
02/24/2020

Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence

We propose a stochastic variant of the classical Polyak step-size (Polya...
11/06/2018

On exponential convergence of SGD in non-convex over-parametrized learning

Large over-parametrized models learned via stochastic gradient descent (...
06/18/2020

SGD for Structured Nonconvex Functions: Learning Rates, Minibatching and Interpolation

We provide several convergence theorems for SGD for two large classes of...
01/30/2022

Faster Convergence of Local SGD for Over-Parameterized Models

Modern machine learning architectures are often highly expressive. They ...
01/26/2022

As one of the most fundamental stochastic optimization algorithms, stoch...
12/10/2018

Why Does Stagewise Training Accelerate Convergence of Testing Error Over SGD?

Stagewise training strategy is commonly used for learning neural network...
10/13/2021

On the Double Descent of Random Features Models Trained with SGD

We study generalization properties of random features (RF) regression in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern machine learning models are typically trained with iterative stochastic first-order methods [7, 32, 12, 24, 11, 6]. Stochastic gradient descent (SGD) and related methods such as Adagrad [7] or Adam [12] compute the gradient with respect to one or a mini-batch of training examples in each iteration and take a descent step using this gradient. Since these methods use only a small part of the data in each iteration, they are the preferred way for training models on large datasets. However, in order to converge to the solution, these methods require the step-size to decay to zero in terms of the number of iterations. This implies that the gradient descent procedure takes smaller steps as the training progresses. Consequently, these methods result in slow sub-linear rates of convergence. Specifically, if is the number of iterations, then SGD-like methods achieve a convergence rate of and for strongly-convex and convex functions respectively [16]. In practice, these methods are augmented with some form of momentum or acceleration [20, 18] that results in faster empirical convergence [28]. Recently, there has been some theoretical analysis for the use of such acceleration in the stochastic setting [5]. Other related work includes algorithms specifically designed to achieve an accelerated rate of convergence in the stochastic setting [1, 13, 8].

Another recent trend in the literature has been to use variance-reduction techniques

[24, 11, 6] that exploit the finite-sum structure of the loss function in machine-learning applications. These methods do not require the step-size to decay to zero and are able to achieve the optimal rate of convergence. However, they require additional bookkeeping [24, 6] or need to compute the full gradient periodically [11], both of which are difficult in the context of training complex models on large datasets.

In this paper, we take further advantage of the optimization properties specific to modern machine learning models. In particular, we make use of the fact that models such as non-parametric regression or over-parameterized deep neural networks are expressive enough to fit or

interpolate the training dataset completely [33, 15]. For an SGD-like algorithm, this implies that the gradient with respect to each training example converges to zero at the optimal solution. This property of interpolation is also true for boosting [23]

and for simple linear classifiers on separable data. For example, the perceptron algorithm

[22] was first shown to converge to the optimal solution under a linear separability assumption on the data [19]. This assumption implies that the linear perceptron is able to fit the complete dataset.

There has been some related work that takes advantage of the interpolation property in order to obtain faster rates of convergence for SGD [25, 15, 4]. Specifically, Schmidt and Le Roux [25] assume a strong growth condition on the stochastic gradients. This condition relates the norms of the stochastic gradients to that of the full gradient. Under this assumption, they prove that constant step-size SGD can attain the same convergence rates as full gradient descent in both the strongly-convex and convex cases. Other related work has used the strong growth condition to prove convergence rates for incremental gradient methods [27, 29]. Ma et al. [15] show that under weaker conditions, SGD with constant step-size results in linear convergence for strongly-convex functions. They also investigate the effect of batch-size on the convergence and theoretically justify the linear-scaling rule

used for training deep learning models in practice

[10]. Recently, Cevher and Vũ showed the linear convergence of proximal stochastic gradient descent under a weaker growth condition for restricted strongly convex functions [4]. They also analyse the effect of an additive error term on the convergence rate.

In contrast to the above mentioned work, we first show that the strong growth condition (SGC) [25]

implies that SGD with a constant step-size and Nesterov momentum

[18] achieves the accelerated convergence rate of the deterministic setting for both strongly-convex and convex functions (Section 3). Our result gives some theoretical justification behind the empirical success of using Nesterov acceleration with SGD [28]. In Section 4, we prove that under the SGC, constant step-size SGD is able to find a first-order stationary point as efficiently as deterministic gradient descent. To the best of our knowledge, this is the first work to study accelerated and non-convex rates under the SGC. Next, we relax the strong growth condition to a more practical weak growth condition (WGC). In Section 5, we prove that the weak growth condition is sufficient to obtain the optimal convergence of constant step-size SGD for smooth strongly-convex and convex functions.

To demonstrate the applicability of our growth conditions in practice, we first show that for models interpolating the data, the WGC is satisfied for all smooth loss functions with a finite-sum structure (Section 6.1). Furthermore, we prove that functions satisfying the WGC and the Polyak- Lojasiewicz inequality [21] also satisfy the SGC. Under additional assumptions, we show that it is also satisfied for the squared-hinge loss. This result enables us to prove an mistake bound for iterations of the stochastic perceptron algorithm using the squared-hinge loss (Section 7). Finally, in Section 8, we evaluate our claims with experiments on synthetic and real datasets.

2 Background

In this section, we give the required background and set up the necessary notation. Our aim is to minimize a differentiable function . Depending on the context, this function can be strongly-convex, convex or non-convex. We assume that we have access to noisy gradients for the function and use stochastic gradient descent (SGD) for iterations in order to minimize it. The SGD update rule in iteration can be written as: . Here, and are the SGD iterates, is the gradient noise and is the step-size at iteration . We assume that the gradients are unbiased, implying that for all and that .

While most of our results apply for general SGD methods, a subset of our results rely on the function having a finite-sum structure meaning that . In the context of supervised machine learning, given a training dataset of points, the term corresponds to the loss function for the point when the model parameters are equal to . Here and

refer to the feature vector and label for point

respectively. Common choices of the loss function include the squared loss where , the hinge loss where or the squared-hinge loss where

. The finite sum setting includes both simple models such as logistic regression or least squares and more complex models like non-parametric regression and deep neural networks.

In the finite-sum setting, SGD consists of choosing a point and its corresponding loss function (typically uniformly) at random and evaluating the gradient with respect to that function. It then performs a gradient descent step: where is the random loss function selected at iteration . The unbiasedness property is automatically satisfied in this case, i.e. for all . Note that in this case, the random selection of points for computing the gradient is the source of the noise . In order to converge to the optimum, SGD requires the step-size to decrease with ; specifically at a rate of for convex functions and at a rate for strongly-convex functions. Decreasing the step-size with results in sub-linear rates of convergence for SGD.

In order to derive convergence rates, we need to make additional assumptions about the function  [16]. Beyond differentiability, our results assume that the function satisfies some or all of the following common assumptions. For all points , and for constants , , and ;

 f(w) ≥f∗ (Bounded below) f(v) ≥f(w)+⟨∇f(w),v−w⟩ (Convexity) f(v) ≥f(w)+⟨∇f(w),v−w⟩+μ2∥v−w∥2 (μ Strong-convexity) f(v) ≤f(w)+⟨∇f(w),v−w⟩+L2∥v−w∥2 (L Smoothness)

Note that some of our results in Section 6 rely on the finite-sum structure and we explicitly state when we need this additional assumption.

In this paper, we consider the case where the model is able to interpolate or fit the labelled training data completely. This is true for expressive models such as non-parametric regression and over-parametrized deep neural networks. For common loss functions that are lower-bounded by zero, interpolating the data results in zero training loss. Interpolation also implies that the gradient with respect to each point converges to zero at the optimum. Formally, in the finite-sum setting, if the function is minimized at , i.e., if , then for all functions , .

The strong growth condition (SGC) used connects the rates at which the stochastic gradients shrink relative to the full gradient. Formally, for any point

and the noise random variable

, the function satisfies the strong growth condition with constant if,

 Ez∥∇f(w,z)∥2 ≤ρ∥∇f(w)∥2. (1) Equivalently, in the finite-sum setting, Ei∥∇fi(w)∥2 ≤ρ∥∇f(w)∥2. (2)

For this inequality to hold, if , then for all . Thus, functions satisfying the SGC necessarily satisfy the above interpolation property. Schmidt and Le Roux’s work [25] derives optimal convergence rates for constant step-size SGD under the above condition for both convex and strongly-convex functions. In the next section, we show that the SGC implies the accelerated rate of convergence for constant step-size SGD with Nesterov momentum.

3 SGD with Nesterov acceleration under the strong growth condition

We first describe constant step-size SGD with Nesterov acceleration. The algorithm consists of three sequences () updated in each iteration [17]. Specifically, it consists of the following update rules:

 wk+1 =ζk−η∇f(ζk,zk) (3) ζk =αkvk+(1−αk)wk (4) vk+1 =βkvk+(1−βk)ζk−γkη∇f(ζk,zk). (5)

Here, is the constant step-size for the SGD step and , , are tunable parameters to be set according to the properties of .

In order to derive a convergence rate for the above algorithm under the SGC, we first observe that a form of the SGC is satisfied in the case of coordinate descent [30]. In this case, we choose a coordinate (typically at random) and perform a gradient descent step with respect to that coordinate. The notion of a coordinate in this case is analogous to that of an individual loss function in the finite sum case. For coordinate descent, a zero gradient at the optimal solution implies that the partial derivative with respect to each coordinate is also equal to zero. This is analogous to the SGC in the finite-sum case, although we note the results in this section do not require the finite-sum assumption.

We use this analogy formally in order to extend the proof of Nesterov’s accelerated coordinate descent [17] to derive convergence rates for the above algorithm when using the SGC. This enables us to prove the following theorems (with proofs in Appendices B.1.1 and B.1.3) in both the strongly-convex and convex settings.

Theorem 1 (Strongly convex).

Under -smoothness and strong-convexity, if satisfies the SGC with constant , then SGD with Nesterov acceleration with the following choice of parameters,

 γk =1√μηρ;βk=1−√μηρ bk+1 =√μ(1−√μηρ)(k+1)/2 ak+1 =1(1−√μηρ)(k+1)/2 αk =γkβkb2k+1ηγkβkb2k+1η+a2k;η=1ρL

results in the following convergence rate:

 Ef(wk+1)−f(w∗) ≤⎛⎝1−√μρ2L⎞⎠k[f(w0)−f(w∗)+μ2∥w0−w∗∥2].
Theorem 2 (Convex).

Under -smoothness and convexity, if satisfies the SGC with constant , then SGD with Nesterov acceleration with the following choice of parameters,

 γk =1ρ+√1ρ2+4γ2k−12 ak+1 =γk√ηρ αk =γkηγkη+a2k;η=1ρL

results in the following convergence rate:

 Ef(wk+1)−f(w∗) ≤2ρ2Lk2∥w0−w∗∥2.

The above theorems show that constant step-size SGD with Nesterov momentum achieves the accelerated rate of convergence up to a factor for both strongly-convex and convex functions.

In Appendix A, we consider the SGC with an extra additive error term, resulting in the following condition: . We analyse the rate of convergence of the above algorithm under this modified condition and obtain a similar dependence on as in Cohen et al. [5].

4 SGD under the strong growth condition

In this section, we show that the SGC results in an improvement over the rate for SGD in the non-convex setting [9]. In particular, we show that under the strong growth condition, constant step-size SGD is able to find a first-order stationary point as efficiently as deterministic gradient descent. We prove the following theorem (with the proof in Appendix B.2),

Theorem 3 (Non-Convex).

Under -smoothness, if satisfies SGC with constant , then SGD with a constant step-size attains the following convergence rate:

 mini=0,1,…k−1[∥∇f(wi)∥2]≤(2ρLk)[f(w0)−f∗]

The above theorem shows that under the SGC, SGD with a constant step-size can attain the optimal rate for non-convex functions. To the best of our knowledge, this is the first result for non-convex functions under interpolation-like conditions. Under these conditions, constant step-size SGD has a better convergence rate than algorithms which have recently been proposed to improve on SGD [2, 3]. Our results also provide some theoretical justification for the effectiveness of SGD for non-convex over-parameterized models like deep neural networks.

5 Weak growth condition

In this section, we relax the strong growth condition to a more practical condition which we refer to as the weak growth condition (WGC). Formally, if the function is -smooth and has a minima at , then it satisfies the WGC with constant , if for all points and noise random variable ,

 Ez∥∇f(w,z)∥2 ≤2ρL[f(w)−f(w∗)]. (6) Equivalently, in the finite-sum setting, Ei∥∇fi(w)∥2 ≤2ρL[f(w)−f(w∗)]. (7)

In the above condition, notice that if , then for all points . Thus, the WGC implies the interpolation property explained in Section 2.

5.1 Relation between WGC and SGC

In this section, we relate the two growth conditions. We first prove that SGC implies WGC with the same without any additional assumptions, formally showing that the WGC is indeed weaker than the corresponding SGC. For the converse, a function satisfying the WGC satisfies the SGC with a worse constant if it also satisfies the Polyak- Lojasiewicz (PL) inequality [21]. The above relations are captured by the following proposition, proved in Appendix B.5

Proposition 1.

If is -smooth, satisfies the WGC with constant and the PL inequality with constant , then it satisfies the SGC with constant .

Conversely, if is -smooth and satisfies the SGC with constant , then it also satisfies the WGC with the same constant .

5.2 SGD under the weak growth condition

Using the WGC, we obtain the following convergence rates for SGD with a constant step-size.

Theorem 4 (Strongly-convex).

Under -smoothness and strong-convexity, if satisfies the WGC with constant , then SGD with a constant step-size achieves the following rate:

 E∥wk+1−w∗∥2
Theorem 5 (Convex).

Under -smoothness and convexity, if satisfies the WGC with constant , then SGD with a constant step-size and iterate averaging achieves the following rate:

 E[f(¯wk)]−f(w∗) ≤4L(1+ρ)∥w0−w∗∥2k.

Here, is the averaged iterate after iterations.

The proofs for Theorems 4 and 5 are deferred to Appendices B.3 and B.4 respectively.

In these cases, the WGC is sufficient to show that constant step-size SGD can attain the deterministic rates up to a factor of . Since this condition is weaker than the corresponding strong growth condition, our results subsume the SGC results [25]. In the next section, we characterize the functions satisfying the growth conditions in practice.

6 Growth conditions in practice

In this section, we give examples of functions that satisfy the weak and strong growth conditions. In Section 6.1, we first show that for models interpolating the data, the WGC is satisfied by all smooth functions with a finite-sum structure. In section 6.2, we show that the SGC is satisfied by the squared-hinge loss under additional assumptions.

6.1 Functions satisfying WGC

To characterize the functions satisfying the WGC, we first prove the following proposition (with the proof in Appendix B.6):

Proposition 2.

If the function has a finite-sum structure for a model that interpolates the data and is the maximum smoothness constant amongst the functions , then for all ,

 Ei∥∇fi(w)∥2 ≤2Lmax[f(w)−f(w∗)]. (8)

Comparing the above equation to Equation 7, we see that any smooth finite-sum problem under interpolation satisfies the WGC with . The WGC is thus satisfied by common loss functions such as the squared and squared-hinge loss. For these loss functions, if for all , then Theorem 4 implies that SGD with results in linear convergence for strongly-convex functions. This matches the recently proved result of Ma et al. [15], whereas Theorem 5 allows us to generalize their result beyond strongly-convex functions.

6.2 Functions satisfying SGC

We now show that under additional assumptions on the data, the squared-hinge loss also satisfies the SGC. We first assume that the data is linearly separable with a margin equal to , implying that for all , . Here, is the support of the distribution of the features . Note that the above assumption implies the existence of a classifier such that . In addition to this, we assume that the features have a finite support, meaning that the set is finite and has a cardinality equal to . Under these assumptions, we prove the following lemma in Appendix B.7,

Lemma 1.

For linearly separable data with margin and a finite support of size , the squared-hinge loss satisfies the SGC with the constant .

In the next section, we use the above lemma to prove a mistake bound for the perceptron algorithm using the squared-hinge loss.

7 Implication for Faster Perceptron

In this section, we use the strong growth property of the squared-hinge function in order to prove a bound on the number of mistakes made by the perceptron algorithm [22] using a squared-hinge loss. The perceptron algorithm is used for training a linear classifier for binary classification and is guaranteed to converge for linearly separable data [19]. It can be considered as stochastic gradient descent on the loss .

The common way to characterize the performance of a perceptron is by bounding the number of mistakes (in the binary classification setting) after iterations of the algorithm. In other words, we care about the quantity . Assuming linear separability of the data and that for all points , the perceptron achieves a mistake bound of  [19].

In this paper, we consider a modified perceptron algorithm using the squared-hinge function as the loss. Note that since we assume the data to be linearly separable, a linear classifier is able to fit all the training data. Since the squared-hinge loss function is smooth, the conditions of Proposition 2 are satisfied, which implies that it satisfies the WGC with . Also observe that since we assume that , . Using these facts with Theorem 5 and assuming that we start the optimization with , we obtain the following convergence rate using SGD with ,

 E[f(wk+1)] ≤8τ2k.

To see this, recall that and the loss is equal to zero at the optima, implying that .

The above result gives us a bound on the training loss. We use the following lemma (proved using the Markov inequality in Appendix B.8) to relate the mistake bound to the training loss.

Lemma 2.

If represents the loss on the point , then

 P(yx⊤w⩽0)⩽Ex,yf(w,x,y).

Combining the above results, we obtain a mistake bound of when using the squared-hinge loss on linearly separable data. We thus recover the standard results for the stochastic perceptron.

Note that for a finite amount of data (when the expectation is with respect to a discrete distribution), if we use batch accelerated gradient descent (which is not one of the stochastic gradient algorithms studied in this paper, and for which no growth condition is needed), we obtain a mistake bound that decreases as . This improves on existing mistake bounds that scale as  [26, 31]. Note that both sets of algorithms have the same dependence on the margin , but this deterministic accelerated method would require evaluating gradients on each iteration.

From Lemma B.8, we know that the squared-hinge loss satisfies the SGC with . Under the same conditions as above, this lemma along with the result of Theorem 2 gives us the following bound:

 Ef(wk+1) ≤2c2τ6k2.

Using the result from Lemma 2, this results in a mistake bound of the order while only requiring one gradient per iteration. Hence, the use of acceleration leads to an improved novel dependence of , but requires the additional assumptions of Lemma B.8 and has a worse dependence on the margin .

8 Experiments

In this section, we empirically validate our theoretical results. For the first set of experiments (Figures 1(a)-1(d)), we generate a synthetic binary classification dataset with and the dimension . We ensure that the data is linearly separable with a margin , thus satisfying the interpolation property for training a linear classifier. We seek to minimize the finite-sum squared-hinge loss, . In Figure 1, we vary the margin and plot the logarithm of the loss with the number of effective passes (one pass is equal to iterations of SGD) over the data. In all of our experiments, we estimate the value of the smoothness parameter

as the maximum eigenvalue of the Gram matrix

.

We evaluate the performance of constant step-size SGD with and without acceleration. Since the squared-hinge loss satisfies the WGC with (Proposition 2), we use SGD with a constant step-size 111Note that using lead to consistently better results as compared to using as suggested by Theorem 5. (denoted as SGD(T) in the plots). For using Nesterov acceleration, we experimented with the dependence of the margin on the constant in the SGC. We found that setting results in consistently stable but fast convergence across different choices of . We thus use a step-size and set the tunable parameters in the update Equations 3-5 as specified by Theorem 2. We denote this variant of accelerated SGD as Acc-SGD(T) in the subsequent plots.

In addition, we propose a line-search heuristic to dynamically estimate the value of

. Our heuristic is inspired from the line-search used in SAG [24] and can be described as follows: we start with an initial estimate and in each iteration, we halve the estimate when the condition is not satisfied. We denote this variant as Acc-SGD(LS) in the plots.

In each of the Figures 1(a)-1(d), we make the following observations: (i) SGD(T) results in reasonably slow convergence. This observation is in line with other SGD methods using as the step-size [24]. (ii) Acc-SGD(T) with is consistently stable and as suggested by the theory, it results in faster convergence as compared to using SGD. (iii) Acc-SGD(LS) either matches or outperforms Acc-SGD(T). We plan to investigate better line-search methods for both SGD [24] and Acc-SGD [14] in the future. (iv) For larger values of (Figures 1(a)1(b)), the training loss becomes equal to zero, verifying the interpolation property.

The next set of experiments (Figure 2) considers binary classification on the CovType and Protein datasets. For this, we train a linear classifer using the radial basis (non-parametric) features. Non-parametric regression models of this form are capable of interpolating the data [15] and thus satisfies our assumptions. We subsample random points from the datasets and use the squared-hinge loss as above. Note that we do not have a good estimate of in this case and only compare the performance of SGD(T) and Acc-SGD(LS).

From Figures 2(a) and 2(b), we make the following observations: (i) In Figure 2(a), both variants have similar performance. (ii) In Figure 2(b), the Acc-SGD(LS) leads to considerably faster convergence as compared to SGD(T). (iii) Accelerated SGD in conjunction with our line-search heuristic is stable across datasets. These experiments show that in cases where the interpolation property is satisfied, both SGD and accelerated SGD with a constant step-size can result in good empirical performance.

9 Conclusion

In this paper, we showed that under interpolation, the stochastic gradients of common loss functions satisfy specific growth conditions. Under these conditions, we proved that it is possible for constant step-size SGD (with and without Nesterov acceleration) to achieve the convergence rates of the corresponding deterministic settings. These are the first results achieving optimal rates in the accelerated and non-convex settings under interpolation-like conditions. We used these results to demonstrate the fast convergence of the stochastic perceptron algorithm employing the squared-hinge loss. We showed that both SGD and accelerated SGD with a constant step-size can lead to good empirical performance when the interpolation property is satisfied. As opposed to determining the step-size and the schedule for annealing it for current SGD-like methods, our results imply that under interpolation, we only need to automatically determine the constant step-size for SGD. In the future, we hope to develop line-search techniques for automatically determining this step-size for both the accelerated and non-accelerated variants.

10 Acknowledgements

We acknowledge support from the European Research Council (grant SEQUOIA 724063) and the CIFAR program on Learning with Machines and Brains. We also thank Nicolas Flammarion for discussions related to this work.

References

• [1] Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

, pages 1200–1205. ACM, 2017.
• [2] Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. arXiv preprint arXiv:1708.08694, 2017.
• [3] Yair Carmon, Oliver Hinder, John C Duchi, and Aaron Sidford. Convex until proven guilty: Dimension-free acceleration of gradient descent on non-convex functions. arXiv preprint arXiv:1705.02766, 2017.
• [4] Volkan Cevher and Bằng Công Vũ. On the linear convergence of the stochastic gradient method with constant step-size. Optimization Letters, pages 1–11, 2018.
• [5] Michael Cohen, Jelena Diakonikolas, and Lorenzo Orecchia. On acceleration with noise-corrupted gradients. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 1018–1027, 2018.
• [6] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
• [7] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
• [8] Roy Frostig, Rong Ge, Sham Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In International Conference on Machine Learning, pages 2540–2548, 2015.
• [9] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
• [10] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
• [11] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
• [12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
• [13] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pages 3384–3392, 2015.
• [14] Jun Liu, Jianhui Chen, and Jieping Ye. Large-scale sparse logistic regression. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 547–556. ACM, 2009.
• [15] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 3331–3340, 2018.
• [16] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
• [17] Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
• [18] Yurii Nesterov et al. Gradient methods for minimizing composite objective function, 2007.
• [19] Albert B Novikoff. On convergence proofs for perceptrons. Technical report, STANFORD RESEARCH INST MENLO PARK CA, 1963.
• [20] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
• [21] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3(4):643–653, 1963.
• [22] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
• [23] Robert E Schapire, Yoav Freund, Peter Bartlett, Wee Sun Lee, et al. Boosting the margin: A new explanation for the effectiveness of voting methods. The annals of statistics, 26(5):1651–1686, 1998.
• [24] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
• [25] Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370, 2013.
• [26] Negar Soheili and Javier Pena. A primal–dual smooth perceptron–von neumann algorithm. In Discrete Geometry and Optimization, pages 303–320. Springer, 2013.
• [27] Mikhail V Solodov. Incremental gradient algorithms with stepsizes bounded away from zero. Computational Optimization and Applications, 11(1):23–35, 1998.
• [28] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013.
• [29] Paul Tseng. An incremental gradient (-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506–531, 1998.
• [30] Stephen J Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, 2015.
• [31] Adams Wei Yu, Fatma Kilinc-Karzan, and Jaime Carbonell. Saddle points and accelerated perceptron algorithms. In International Conference on Machine Learning, pages 1827–1835, 2014.
• [33] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

Appendix A Incorporating additive error for Nesterov acceleration

For this section, we assume an additive error in the the strong growth condition implying that the following equation is satisfied for all , .

 Ez∥∇f(w,z)∥2 ≤ρ∥∇f(w)∥2+σ2

In this case, we have the counterparts of Theorems 1 and 2 as follows:

Theorem 6 (Strongly convex).

Under -smoothness and strongly-convexity, if satisfies SGC with constant and an additive error , then SGD with Nesterov acceleration with the following choice of parameters,

 γk =1√μηρ;βk=1−√μηρ bk+1 =√μ(1−√μηρ)(k+1)/2 ak+1 =1(1−√μηρ)(k+1)/2 αk =γkβkb2k+1ηγkβkb2k+1η+a2k;η=1ρL

results in the following convergence rate:

 [E[f(wk+1)]−f(w∗)] ≤(1−√μηρ)k[f(x0)−f(w∗)+μ2∥x0−w∗∥2]+σ2√η√ρμ
Theorem 7 (Convex).

Under -smoothness and convexity, if satisfies SGC with constant and an additive error , then SGD with Nesterov acceleration with the following choice of parameters,

 γk =1ρ+√1ρ2+4γ2k−12 ak+1 =γk√ηρ αk =γkηγkη+a2k;η=1ρL

results in the following convergence rate:

 [Ef(wk+1)−f(w∗)] ≤2ρk2η∥x0−w∗∥2+kσ2ηρ

The above theorems are proved in appendices B.1.1 and B.1.3

Appendix B Proofs

b.1 Proofs for SGD with Nesterov Acceleration

Recall the update equations for SGD with Nesterov acceleration as follows:

 wk+1 =ζk−η∇f(ζk,zk) ζk =αkvk+(1−αk)wk vk+1 =βkvk+(1−βk)ζk−γkη∇f(ζk,zk)
 Since the stochastic gradients are unbiased, we obtain the following equation, Ez[∇f(y,z)] =∇f(y) (10) For the proof, we consider the more general strong-growth condition with an additive error σ2. Ez∥∇f(w,z)∥2 ≤ρ∥∇f(w)∥2 (11) We choose the parameters γk, αk, βk, ak, bk such that the following equations are satisfied: γk =1ρ⋅[1+βk(1−αk)αk] (12) αk =γkβkb2k+1ηγkβkb2k+1η+a2k (13) βk ≥1−γkμη (14) ak+1 =γk√ηρbk+1 (15) bk+1 ≤bk√βk (16)

We now prove the following lemma assuming that the function is -smooth and strongly-convex.

Lemma 3.

Assume that the function is -smooth and strongly-convex and satisfies the strong-growth condition in Equation 11. Then, using the updates in Equation 3-5 and setting the parameters according to Equations 1216, if , then the following relation holds:

 b2k+1γ2k[Ef(wk+1)−f∗] ≤a20ρη[f(x0)−f∗]+b202ρη∥x0−w∗∥2+σ2ηρk∑i=0[γ2ib2i+1]
Proof.
 Let rk+1=∥vk+1−w∗∥, then using equation 5 r2k+1=∥βkvk+(1−βk)ζk−w∗−γkη∇f(ζk,zk)∥2 r2k+1 =∥βkvk+(1−βk)ζk−w∗∥2+γ2kη2∥∇f(ζk,zk)∥2+2γkη⟨w∗−βkvk−(1−βk)ζk,∇f(ζk,zk)⟩
 Taking expecation wrt to zk, E[r2k+1]=E[∥βkvk+(1−βk)ζk−w∗∥2]+γ2kη2E∥∇f(ζk,zk)∥2+2γkη[E⟨w∗−βkvk−(1−βk)ζk,∇f(ζk,zk)⟩] ≤∥βkvk+(1−βk)ζk−w∗∥2+γ2kη2ρ∥∇f(ζk)∥2+2γkη[⟨w∗−βkvk−(1−βk)ζk,∇f(ζk)⟩]+γ2kη2σ2 =∥βk(vk−w∗)+(1−βk)(ζk−w∗)∥2+γ2kη2ρ∥∇f(ζk)∥2+2γkη[⟨w∗−βkvk−(1−βk)ζk,∇f(ζk)⟩]+γ2kη2σ2 ≤βk∥vk−w∗∥2+(1−βk)∥ζk−w∗∥2+γ2kη2ρ∥∇f(ζk)∥2+2γkη[⟨w∗−βkvk−(1−βk)ζk,∇f(ζk)⟩]+γ2kη2σ2 (By convexity of ∥⋅∥2) =βkr2k+(1−βk)∥ζk−w∗∥2+γ2kη2ρ∥∇f(ζk)∥2+2γkη[⟨w∗−βkvk−(1−βk)ζk,∇f(ζk)⟩]+γ2kη2σ2 =βkr2k+(1−βk)∥ζk−w∗∥2+γ2kη2ρ∥∇f(ζk)∥2+2γkη[⟨βk(ζk−vk)+w∗−ζk,∇f(ζk)⟩]+γ2kη2σ2 =βkr2k+(1−βk)∥ζk−w∗∥2+γ2kη2ρ∥∇f(ζk)∥2+2γkη[⟨βk(1−αk)αk(wk−ζk)+w∗−ζk,∇f(ζk)⟩]+γ2kη2σ2 (From equation 4) =βkr2k+(1−βk)∥ζk−w∗∥2+γ2kη2ρ∥∇f(ζk)∥2+2γkη[βk(1−αk)αk⟨∇f(ζk),(wk−ζk)⟩+⟨∇f(ζk),w∗−ζk⟩]+γ2kη2σ2 (By convexity)
 By strong convexity, +2γkη[βk(1−αk)αk(f(wk)−f(ζk))+f∗−f(ζk)−μ2∥ζk−w∗∥2]+γ2kη2σ2 (18) By Lipschitz continuity, f(wk+1)−f(ζk)≤⟨∇f(ζk),wk+1−ζk⟩+L2∥wk+1−ζk∥2 ≤−η⟨∇f(ζk),∇f(ζk,zk)⟩+Lη22∥∇f(ζk,zk)∥2 Taking expectation wrt zk and using equations 10, 11 E[f(wk+1)−f(ζk)]≤−η∥∇f(ζk)∥2+Lρη22∥∇f(ζk)∥2+Lη2σ22 E[f(wk+1)−f(ζk)]≤[−η+Lρη22]∥∇f(ζk)∥2+Lη2σ22 If η≤1ρL, E[f(wk+1)−f(ζk)]≤(−η2)∥∇f(ζk)∥2+Lη2σ22 ⟹∥∇f(ζk)∥2≤(2η)E[f(ζk)−f(wk+1)]+Lησ2 (19)
 From equations 18 and 19, E[r2k+1] +2γkη[βk(1−αk)αk(f(wk)−f(ζk))+f∗−f(ζk)−μ2∥ζk−w∗∥2]+γ2kη2σ2+Lγ2kη3ρσ2 +2γkη[βk(1−αk)αk(f(wk)−f(ζk))+f∗−f(ζk)−μ2∥ζk−w∗∥2]+2γ2kη2σ2 (Since η≤1ρL) =βkr2k+∥ζk−w∗∥2[(1−βk)−γkμη]+f(ζk)[2γ2kηρ−2γkη⋅βk(1−αk)αk−2γkη] −2γ2kηρEf(wk+1)+2γkηf∗+[2γkη⋅βk(1−αk)αk]f(wk)+2γ2kη2σ2 Since βk≥1−γkμη and γk=1ρ⋅(1+βk(1−αk)αk), E[r2k+1] ≤βkr2k−2γ2kηρEf(wk+1)+2γkηf∗+[2γkη⋅βk(1−αk)αk]f(wk)+2γ2kη2σ2 Multiplying by b2k+1, b2k+1E[r2k+1] ≤b2k+1βkr2k−2b2k+1γ2kηρEf(wk+1)+2b2k+1γkηf∗+[2b2k+1γkη⋅βk(1−αk)αk]f(wk)+2b2k+1γ2kη2σ2 Since b2k+1βk≤b2k, b2k+1γ2kηρ=a2k+1, γkηβk(1−αk)αk=a2kb2k+1 b2k+1E[r2k+1]