1 Introduction
In this work we study the complexity of stochastic gradient descent (SGD) for solving unconstrained optimization problems of the form
(1) 
where is possibly nonconvex and satisfying the following smoothness and regularity condition.
Function is bounded from below by an infimum , differentiable, and is Lipschitz:
Motivating this problem is perhaps unnecessary. Indeed, the training of modern deep learning models reduces to nonconvex optimization problems, and the stateoftheart methods for solving them are all variants of SGD
Sun (2019); Goodfellow et al. (2016). SGD is a randomized firstorder method performing iterations of the form(2) 
where
is an unbiased estimator of the gradient
(i.e., ), and is an appropriately chosen learning rate. Since can have many local minima and/or saddle points, solving (1) to global optimality is intractable Nemirovsky and Yudin (1983); Vavasis (1995). However, the problem becomes tractable if one scales down the requirements on the point of interest from global optimality to some relaxed version thereof, such as stationarity or local optimality. In this paper we are interested in the fundamental problem of finding anstationary point, i.e., we wish to find a random vector
for which where is the expectation over the randomness of the algorithm.1.1 Modelling stochasticity
Since unbiasedness alone is not enough to conduct a complexity analysis of SGD, it is necessary to impart further assumptions on the connection between the stochastic gradient and the true gradient . The most commonly used assumptions take the form of various structured bounds on the second moment of . We argue (see Section 3) that bounds proposed in the literature are often too strong and unrealistic as they do not fully capture how randomness in arises in practice. Indeed, existing bounds are primarily constructed in order to facilitate analysis, and their match with reality often takes the back seat. In order to obtain meaningful theoretical insights into the workings of SGD, it is very important to model this randomness both correctly, so that the assumptions we impart are provably satisfied, and accurately, so as to obtain as tight bounds as possible.
1.2 Sources of stochasticity
Practical applications of SGD typically involve the training of supervised machine learning models via empirical risk minimization ShalevShwartz and BenDavid (2014), which leads to optimization problems of a finitesum structure:
(3) 
In a singlemachine setup, is the number of training data points, and represents the loss of model on data point . In this setting, data access is expensive, and is typically constructed via subsampling techniques such as minibatching (Dekel et al., 2012) and importance sampling Needell et al. (2016). In the rather general arbitrary sampling paradigm Gower et al. (2019), one may choose an arbitrary random subset of examples, and subsequently is assembled from the information stored in the gradients for only. This leads to formulas of the form
(4) 
where
are appropriately defined random variables ensuring unbiasedness.
In a distributed setting, corresponds to the number of machines (e.g., number of mobile devices in federated learning) and represents the loss of model on all the training data stored on machine . In this setting, communication is expensive, and modern gradienttype methods therefore rely on various randomized gradient compression mechanisms such as quantization (Gupta et al., 2015), sparsification (Wangni et al., 2018), and dithering (Alistarh et al., 2017). Given an appropriately chosen (unbiased) randomized compression map , the local gradients are first compressed to , where is an independent instantiation of sampled by machine in each iteration, and subsequently communicated to a master node, which performs aggregation Khirirat et al. (2018). This gives rise to SGD with stochastic gradient of the form
(5) 
In many applications, each has a finite sum structure of its own, reflecting the empirical risk composed of the training data stored on that device. In such situations, it is often assumed that compression is not applied to exact gradients, but to stochastic gradients coming from subsampling (Gupta et al., 2015; BenNun and Hoefler, 2019; Horváth et al., 2019). This further complicates the structure of the stochastic gradient.
2 Contributions
The highly specific and elaborate structure of the stochastic gradient used in practice, such as that coming from subsampling as in (4) or compression as in (5), raises questions about appropriate theoretical modelling of its second moment. As we shall explain in Section 3, existing approaches do not offer a satisfactory treatment. Indeed, as we show through a simple example, none of the existing assumptions are satisfied even in the simple scenario of subsampling from the sum of two quadratic functions (see Proposition 3).
Our work is motivated by the need of a more accurate modelling of the stochastic gradient for nonconvex optimization problems, which we argue would lead to a more accurate and informative analysis of SGD in the nonconvex world–a problem of the highest importance in modern deep learning.
The key contributions of our work are:
Inspired by recent developments in the analysis of SGD in the (strongly) convex setting Richtárik and Takáč (2017); Gower et al. (2018, 2019), we propose a new assumption, which we call expected smoothness (ES), for modelling the second moment of the stochastic gradient, specifically focusing on nonconvex problems (see Section 4). In particular, we assume that there exist constants such that
We show in Section 4.3 that (ES) is the weakest, and hence the most general, among all assumptions in the existing literature we are aware of (see Figure 1
), including assumptions such as bounded variance (BV)
Ghadimi and Lan (2013), uniform strong growth (USG) Schmidt and Roux (2013), strong growth (SG) Vaswani et al. (2019), relaxed growth (RG) Bottou et al. (2018), and gradient confusion (GC) Sankararaman et al. (2019), which we review in Section 3.Moreover, we prove that unlike existing assumptions, which typically implicitly assume that stochasticity comes from perturbation (see Section 4.4
), (ES) automatically holds under standard and weak assumptions made on the loss function in settings such as
subsampling (see Section 4.5) and compression (see Section 4.6). In this sense, (ES) not an assumption but an inequality which provably holds and can be used to accurately and more precisely capture the convergence of SGD. For instance, to the best of our knowledge, while the combination of gradient compression and subsampling is not covered by any prior analysis of SGD for nonconvex objectives, our results can be applied to this setting.We recover the optimal rate for general smooth nonconvex problems and rate under the PL condition (see Section 5
). However, our rates are informative enough for us to be able to deduce, for the first time in the literature on nonconvex SGD, importance sampling probabilities and formulas for the optimal minibatch size (see Section
6).3 Existing Models of Stochastic Gradient
Ghadimi and Lan (2013) analyze SGD under the assumption that is lower bounded, that the stochastic gradients are unbiased and have bounded variance
(BV) 
Note that due to unbiasedness, this is equivalent to
(6) 
With an appropriately chosen constant stepsize , their results imply a rate of convergence.
In the context of finitesum problems with uniform sampling, where at each step one is sampled uniformly and the stochastic gradient estimator used is for a randomly selected index , Schmidt and Roux (2013) introduced the maximal strong growth condition, which requires the inequality
(MSG) 
to hold almost surely. They use (MSG) to prove linear convergence of SGD for strongly convex objectives.
Vaswani et al. (2019) assume (MSG) to hold in expectation rather than uniformly, leading the expected strong growth:
(ESG) 
and prove that SGD converges to an stationary point in
steps. This assumption is quite strong and necessitates interpolation: if
, then almost surely. This is typically not true in the distributed, finitesum case where the functions can be very different; see e.g., (McMahan et al., 2017).Bottou et al. (2018) consider the relaxed growth condition which is a version of (ESG) featuring and additive constant:
(RG) 
In view of (6), (RG) can be also seen as a slight generalization of (BV). Using (RG), they proved an convergence rate for nonconvex objectives to a neighborhood of a stationary point of size that is a function of . Unfortunately, (RG) is quite difficult to verify in practice and can be shown not to hold for some simple problems, as the following proposition shows. There is a simple finitesum minimization problem for which (RG) is not satisfied. The proof of this proposition and all subsequent proofs are relegated to the supplementary material.
In recent development, Sankararaman et al. (2019) postulate a gradient confusion bound for finitesum problems and SGD with uniform singleelement sampling. This bound requires the existence of such that
(GC) 
holds for all (and all ). For general nonconvex objectives, they prove convergence to a neighborhood only. For functions satisfying the PL condition (Assumption 5.2), they prove linear convergence to a neighborhood of a stationary point.
Lei et al. (2019) analyze SGD for of the form
(7) 
They assume that is nonnegative and almostsurely Höldercontinuous in and then use for sampled i.i.d. from . Specialized to the case of smoothness, their assumption reads
(SS) 
almost surely for all . We term this condition suresmoothness (SS). They establish the sample complexity . Unfortunately, their results do not recover full gradient descent and their analysis is not easily extendable to compression and subsampling.
4 ES in the Nonconvex World
In this section, we first briefly review the notion of expected smoothness as recently proposed in several contexts different from ours, and use this development to motivate our definition of expected smoothness (ES) for nonconvex problems. We then proceed to show that (ES) is the weakest from all previous assumptions modelling the behaviour of the second moment of stochastic gradient for nonconvex problems reviewed in Section 3, thus substantiating Figure 1. Finally, we show how (ES) provides a correct and accurate model for the behaviour of the stochastic gradient arising not only from classical perturbation, but also from subsampling and compression.
4.1 Brief history of expected smoothness: from convex quadratic to convex optimization
Our starting point is the work of Richtárik and Takáč (2017) who, motivated by the desire to obtain deeper insights into the workings of the sketch and project algorithms developed by Gower and Richtárik (2015), study the behaviour of SGD applied to a reformulation of a consistent linear system as a stochastic convex quadratic optimization problem of the form
(8) 
The above problem encodes a linear system in the sense that is nonnegative and equal to zero if and only if solves the linear system. The distribution behind the randomness in their reformulation (8) plays the role of a parameter which can be tuned in order to target specific properties of SGD, such as convergence rate or cost per iteration. The stochastic gradient satisfies the identity which plays a key role in the analysis. Since in their setting almost surely for any minimizer (which suggests that their problem is overparameterized), the above identity can be written in the equivalent form
(9) 
Eq (9) is the first instance of the expected smoothness property/inequality we are aware of. Using tools from matrix analysis, Richtárik and Takáč (2017) are able to obtain identities for the expected iterates of SGD. Kovalev et al. (2018) study the same method for spectral distributions and for some of these establish stronger identities of the form suggesting that the property (9) has the capacity to achieve a perfectly precise meansquare analysis of SGD in this setting.
Expected smoothness as an inequality was later used to analyze the JacSketch method (Gower et al., 2018), which is a general variancereduced SGD that includes the widelyused SAGA algorithm (Defazio et al., 2014) as a special case. By carefully considering and optimizing over smoothness, Gower et al. (2018) obtain the currently bestknown convergence rate for SAGA. Assuming strong convexity and the existence a global minimizer , their assumption in our language reads
(CES) 
where , is a stochastic gradient and the expectation is w.r.t. the randomness embedded in . We refer to the above condition by the name convex expected smoothness (CES) as it provides a good model of the stochastic gradient of convex objectives. (CES) was subsequently used to analyze SGD for quasi stronglyconvex functions by Gower et al. (2019), which allowed the authors to study a wide array of subsampling strategies with great accuracy, which allowed them to provide the first formulas for the optimal minibatch size for SGD in the strongly convex regime. These rates are tight up to absolute (nonproblem specific) constants in the setting of convex stochastic optimization (Nguyen et al., 2019).
4.2 Expected smoothness for nonconvex optimization
Given the utility of (CES), it is then natural to ask: can we extend (CES) beyond convexity? The first problem we are faced with is that is illdefined for nonconvex optimization problems, which may not have any global minima. In fact, Gower et al. (2019) use (CES) through following direct consequence of (CES) only:
(10) 
where . We thus propose to remove and merely ask for a global lower bound on the function rather than a global minimizer, dispense with the interpretation of as the variance at the optimum and merely ask for the existence of any such constant. This yields the new condition
(11) 
for some .
While (11) may be satisfactory for the analysis of convex problems, it does not enable us to easily recover the convergence of full gradient descent or SGD under strong growth in the case of nonconvex objectives. As we shall see in further sections, the fix is to add a third term to the bound, which finally leads to our (ES) assumption:
[Expected smoothness] The second moment of the stochastic gradient satisfies
(ES) 
for some and all .
In the rest of this section, we turn to the generality of Assumption 4.2 and its use in correct and accurate modelling of sources of stochasticity arising in practice.
4.3 Expected smoothness as the weakest assumption
As discussed in Section 3, assumptions on the stochastic gradients abound in the literature on SGD. If we hope for a correct and tighter theory, then we may ask that we at least recover the convergence of SGD under those assumptions. Our next theorem, described informally below and stated and proved formally in the supplementary material, proves exactly this.
4.4 Perturbation
One of the simplest models of stochasticity is the case of additive zeromean noise with bounded variance, that is
where is a random variable satisfying and . Because the bounded variance condition (BV) is clearly satisfied, then by Theorem 4.3, our Assumption 4.2 is also satisfied. While this model can be useful for modelling artificially injected noise into the full gradient Ge et al. (2015); Fang et al. (2019), it is unreasonably strong for practical sources of noise: indeed, as we saw in Proposition 3, it does not hold for subsampling with just two functions. It is furthermore unable to model sources of rather simple multiplicative noise which arises in the case of gradient compression operators.
4.5 Subsampling
Now consider having a finitesum structure (3). In order to develop a general theory of SGD for a wide array of subsampling strategies, we follow the stochastic reformulation formalism pioneered by Richtárik and Takáč (2017); Gower et al. (2018) in the form proposed in Gower et al. (2019). Given a sampling vector drawn from some userdefined distribution (where a sampling vector is one such that for all ), we define the random function Noting that , we reformulate (3) as a
(12) 
where we assume only access to unbiased estimates of through the stochastic realizations
(13) 
That is, given current point , we sample and set We will now show that (ES) is satisfied under very mild and natural assumptions on the functions and the sampling vectors . In that sense, (ES) is not an additional assumption, it is an inequality that is automatically satisfied.
Each is bounded from below by and is smooth: That is, for all we have
To show that Assumption 4.2 is an automatic consequence of Assumption 4.5, we rely on the following crucial lemma.
Let be a function for which Assumption 1 is satisfied. Then for all we have
(14) 
This lemma shows up in several recent works and is often used in conjunction with other assumptions such as bounded variance (Li and Orabona, 2019) and convexity (Stich and Karimireddy, 2019). Lei et al. (2019) also use a version of it to prove the convergence of SGD for nonconvex objectives, and we compare our results against theirs in Section 5.
Suppose that Assumption 4.5 holds and that is finite for all . Let . Then and Assumption 4.2 holds with , and .
The condition that is finite is a very mild condition on and is satisfied for virtually all practical subsampling schemes in the literature. However, the generality of Proposition 4.5 comes at a cost: the bounds are too pessimistic. By making more specific (and practical) choices of the sampling distribution , we can get much tighter bounds. We do this by considering some representative sampling distributions next, without aiming to be exhaustive.
Sampling with replacement. An sided die is rolled a total of times and the number of times the number shows up is recorded as . We can then define
(15) 
where is the probability that the th side of the die comes up and . In this case, the number of stochastic gradients queried is always .
Independent sampling without replacement. We generate a random subset and define
(16) 
where . We assume that each number is included in with probability independently of all the others. In this case, the number of stochastic gradients queried is not fixed but has expectation .
nice sampling without replacement. This is similar to the previous sampling, but we generate a random subset by choosing uniformly from all subsets of size for integer . We define as in (16) and it is easy to see that for all .
These sampling distributions were considered in the context of SGD for convex objective functions in (Gorbunov et al., 2019; Gower et al., 2019). We show next that Assumption 4.2 is satisfied for these distributions with much better constants than the generic Proposition 4.5 would suggest.
Suppose that Assumptions 1 and 4.5 hold and let . Then:
(i) For independent sampling with replacement, Assumption 4.2 is satisfied with , , and .
(ii) For independent sampling without replacement, Assumption 4.2 is satisfied with , , and .
(iii) For nice sampling without replacement, Assumption 4.2 is satisfied with , , and .
4.6 Compression
We now further show that our framework is general enough to capture the convergence of stochastic gradient quantization or compression schemes. Consider the finitesum problem (3) and let be stochastic gradients such that . We construct an estimator via
(17) 
where the are sampled independently for all and across all iterations. Clearly, this generalizes (5). We consider the class of compression operators:
We say that a stochastic operator is an compression operator if
(18) 
Assumption 4.6 is mild and is satisfied by many compression operators in the literature, including random dithering (Alistarh et al., 2017), random sparsification, block quantization (Horváth et al., 2019), and others. The next proposition then shows that if the stochastic gradients themselves satisfy Assumption 4.2 with their respective functions , then also satisfies Assumption 4.2.
Suppose that a stochastic gradient estimator is constructed via (17) such that each is a compressor satisfying Assumption 4.6. Suppose further that the stochastic gradient is such that and that each satisfies Assumption 4.2 with constants . Then there exists constants such that satisfies Assumption 4.2 with .
5 SGD in the Nonconvex World
5.1 General convergence theory
Our main convergence result relies on the following key lemma. Suppose that Assumptions 1 and 4.2 are satisfied. Choose constant stepsize such that . Then,
where , for arbitrary, and .
Lemma 5.1 bounds a weighted sum of stochastic gradients over the entire run of the algorithm. This idea of weighting different iterates has been used in the analysis of SGD in the convex case (Rakhlin et al., 2012; Shamir and Zhang, 2013; Stich, 2019) typically with the goal of returning a weighted average of the iterates at the end. In contrast, we only use the weighting to facilitate the proof.
Suppose that Assumptions 1 and 4.2 hold. Suppose that a stepsize is chosen such that . Letting , we have
While the bound of Theorem 5.1 shows possible exponential blowup, we can show that by carefully controlling the stepsize we can nevertheless attain stationary point given stochastic gradient evaluations. This dependence is in fact optimal for SGD without extra assumptions on secondorder smoothness or disruptiveness of the stochastic gradient noise (Drori and Shamir, 2019). We use a similar stepsize to Ghadimi and Lan (2013).
Fix . Choose the stepsize as Then provided that
(19) 
we have
As a start, the iteration complexity given by (19) recovers full gradient descent: plugging in and shows that we require a total of iterations in required to a reach an stationary point. This is the standard rate of convergence for gradient descent on nonconvex objectives (Beck, 2017), up to absolute (nonproblemspecific) constants.
Plugging in and to be any nonnegative constant recovers the fast convergence of SGD under strong growth (ESG). Our bounds are similar to Lei et al. (2019) but improve upon them by recovering full gradient descent, assuming smoothness only in expectation, and attaining the optimal rate without logarithmic terms.
5.2 Convergence under the PL condition
One of the popular generalizations of strong convexity in the literature is the PolyakŁojasiewicz condition (Karimi et al., 2016; Lei et al., 2019). We first define this condition and then establish convergence of SGD for functions satisfying it and our (ES) assumption. In the rest of this section, we denote , where .
We say that a differentiable function satisfies the PolyakŁojasiewicz condition if for all ,
We rely on the following lemma where we use the stepsize sequence recently introduced by Stich (2019) but without iterate averaging, as averaging in general may not make sense for nonconvex models.
Consider a sequence satisfying
(20) 
where for all and with . Fix and let . Then choosing the stepsize as
with gives
Using the stepsize scheme of Lemma 5.2, we can show that SGD finds an optimal global solution at a rate, where is the total number of iterations.
Suppose that Assumptions 1, 4.2, and 5.2 hold. Suppose that SGD is run for iterations with the stepsize sequence of Lemma 5.2 with for all . Then
where is the condition number of and is the stochastic condition number.
The next corollary recovers the convergence rate for strongly convex functions, which is the optimal dependence on the accuracy Nguyen et al. (2019).
In the same setting of Theorem 5.2, fix . Then as long as
While the dependence on is optimal, the situation is different when we consider the dependence on problem constants, and in particular the dependence on : Corollary 5.2 shows a possibly multiplicative dependence . This is different for objectives where we assume convexity, and we show this next. It is known that the PLcondition implies the quadratic functional growth (QFG) condition (Necoara et al., 2019), and is in fact equivalent to it for convex and smooth objectives (Karimi et al., 2016). We will adopt this assumption in conjunction with the convexity of for our next result.
We say that a convex function satisfies the quadratic functional growth condition if
(21) 
for all where is the minimum value of and where is the projection on the set of minima .
There are only a handful of results under QFG (Drusvyatskiy and Lewis, 2018; Necoara et al., 2019; Sun, 2019), none of which is for SGD with smooth losses. For QFG in conjunction with convexity and expected smoothness, we can prove convergence in function values similar to Theorem 5.2,
Theorem 5.2 allows stepsizes , which are much larger than the stepsizes in Nguyen et al. (2019). Hence, it improves upon the results of Nguyen et al. (2019) in the context of finitesum problems where the individual functions are smooth but possibly nonconvex, and the average is strongly convex.
The following straightforward corollary of Theorem 5.2 shows that when convexity is assumed, we can get a dependence on the sum of the condition numbers rather than their product. This is a significant difference from the nonconvex setting, and it is not known whether it is an artifact of our analysis or an inherent difference.
6 Importance Sampling and Optimal Minibatch Size
As an example application of our results, we consider importance sampling: choosing the sampling distribution to maximize convergence speed. We consider independent sampling with replacement with minibatch size . Plugging the bound on from Proposition 4.5 into the sample complexity from Corollary 5.1 yields:
(22) 
where . Optimizing (22) over yields the sampling distribution
(23) 
The same sampling distribution has appeared in the literature before (Zhao and Zhang, 2015; Needell et al., 2016), and our work is the first to give it justification for SGD on nonconvex objectives. Plugging the distribution of (23) into (22) and considering the total number of stochastic gradient evaluations we get,
where . This is minimized over the minibatch size whenever . Similar expressions for importance sampling and minibatch size can be obtained for other sampling distributions as in (Gower et al., 2019).
7 Experiments
7.1 Linear regression with a nonconvex regularizer
We first consider a linear regression problem with nonconvex regularization to test the importance sampling scheme given in Section
6,(24) 
where , are generated, and . We use and and initialize . We sample minibatches of size with replacement and use , where is the number of iterations and is as in Proposition 4.5. Similar to Needell and Ward (2017), we illustrate the utility of importance sampling by sampling from a zeromean Gaussian of variance without normalizing the . Since , we can expect importance sampling to outperform uniform sampling in this case. However, when we normalize, the two methods should not be very different, and Figure 2 (of a single evaluation run) shows this.
7.2 Logistic regression with a nonconvex regularizer
Residual  

ES, Predicted  
ES, Fit  
  Residual  
RG, Fit   
Fitted constants in the regularized logistic regression problem. Predicted constants are per Proposition
4.5. The residual is the mean square error, see Section 14.1 in the supplementary for more discussion of this table.We now consider the regularized logistic regression problem from TranDinh et al. (2019) with the aim of testing the fit of our Assumption 4.2 compared to other assumptions. The problem has the same form as (24) but with the logistic loss for given and . We run experiments on the dataset ( and ) from LIBSVM (Chang and Lin, 2011). We fix , and run SGD for iterations with a stepsize as in the previous experiment. We use uniform sampling with replacement and measure the average squared stochastic gradient norm every five iterations, in addition to the loss and the squared gradient norm. We then run nonnegative linear least squares regression to fit the data for expected smoothness (ES) and compare to relaxed growth (RG). We also compare with theoretically estimated constants for (ES). The result is in Table 1, where we see a tight fit between our theory and the observed. The experimental setup and estimation details are explained more thoroughly in Section 14.1 in the supplementary material.
References
 QSGD: CommunicationEfficient SGD via Gradient Quantization and Encoding. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 1709–1720. Cited by: §1.2, §4.6.
 FirstOrder Methods in Optimization. edition, Society for Industrial and Applied Mathematics, Philadelphia, PA. External Links: Document Cited by: §5.1.
 Demystifying Parallel and Distributed Deep Learning: An InDepth Concurrency Analysis. ACM Comput. Surv. 52 (4). External Links: ISSN 03600300, Document Cited by: §1.2.
 Optimization Methods for LargeScale Machine Learning. SIAM Review 60 (2), pp. 223–311. External Links: Document Cited by: §2, §3.

LIBSVM: A library for support vector machines
. ACM Transactions on Intelligent Systems and Technology (TIST) 2 (3), pp. 27. Cited by: §7.2.  SAGA: A Fast Incremental Gradient Method with Support for NonStrongly Convex Composite Objectives. In Proceedings of the 27th International Conference on Neural Information Processing Systems  Volume 1, NIPS’14, Cambridge, MA, USA, pp. 1646–1654. Cited by: §4.1.
 Optimal Distributed Online Prediction Using MiniBatches. J. Mach. Learn. Res. 13 (null), pp. 165–202. External Links: ISSN 15324435 Cited by: §1.2.
 The Complexity of Finding Stationary Points with Stochastic Gradient Descent. arXiv preprint arXiv:1910.01845. Cited by: §5.1.
 Error Bounds, Quadratic Growth, and Linear Convergence of Proximal Methods. Mathematics of Operations Research 43 (3), pp. 919–948. Cited by: §5.2.
 Sharp Analysis for Nonconvex SGD Escaping from Saddle Points. In Proceedings of the ThirtySecond Conference on Learning Theory, A. Beygelzimer and D. Hsu (Eds.), Proceedings of Machine Learning Research, Vol. 99, Phoenix, USA, pp. 1192–1234. Cited by: §4.4.

Escaping From Saddle Points — Online Stochastic Gradient for Tensor Decomposition
. In Proceedings of The 28th Conference on Learning Theory, P. Grünwald, E. Hazan, and S. Kale (Eds.), Proceedings of Machine Learning Research, Vol. 40, Paris, France, pp. 797–842. Cited by: §4.4.  Stochastic First and ZerothOrder Methods for Nonconvex Stochastic Programming. SIAM Journal on Optimization 23 (4), pp. 2341–2368. External Links: Document Cited by: §2, §3, §5.1.
 Deep Learning. The MIT Press. External Links: ISBN 0262035618 Cited by: §1.
 A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent. arXiv preprint arXiv:1905.11261. Cited by: §4.5.
 Stochastic quasigradient methods: variance reduction via Jacobian sketching. arXiv preprint arXiv:1805.02632. Cited by: §2, §4.1, §4.5.
 SGD: General Analysis and Improved Rates. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 5200–5209. Cited by: §1.2, §2, §4.1, §4.2, §4.5, §4.5, §6.
 Randomized iterative methods for linear systems. SIAM Journal on Matrix Analysis and Applications 36 (4), pp. 1660–1690. Cited by: §4.1.
 Deep Learning with Limited Numerical Precision. In Proceedings of the 32nd International Conference on International Conference on Machine Learning  Volume 37, ICML’15, pp. 1737–1746. Cited by: §1.2, §1.2.
 Stochastic distributed learning with gradient quantization and variance reduction. arXiv preprint arXiv:1904.05115. Cited by: §1.2, §4.6.
 Linear Convergence of Gradient and ProximalGradient Methods Under the PolyakLojasiewicz Condition. In European Conference on Machine Learning and Knowledge Discovery in Databases  Volume 9851, ECML PKDD 2016, Berlin, Heidelberg, pp. 795–811. Cited by: §5.2, §5.2.
 Distributed learning with compressed gradients. arXiv preprint arXiv:1806.06573. Cited by: §1.2.
 Stochastic spectral and conjugate descent methods. In Advances in Neural Information Processing Systems, Vol. 31, pp. 3358–3367. Cited by: §4.1.

Stochastic Gradient Descent for Nonconvex Learning Without Bounded Gradient Assumptions.
IEEE Transactions on Neural Networks and Learning Systems
, pp. 1–7. External Links: Document, ISSN 21622388 Cited by: §3, §4.5, §5.1, §5.2.  On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes. In Proceedings of Machine Learning Research, K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, , pp. 983–992. Cited by: §4.5.

CommunicationEfficient Learning of Deep Networks from Decentralized Data.
In
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics
, A. Singh and J. Zhu (Eds.), Proceedings of Machine Learning Research, Vol. 54, Fort Lauderdale, FL, USA, pp. 1273–1282. Cited by: §3.  Linear Convergence of First Order Methods for NonStrongly Convex Optimization. Mathematical Programming 175 (1–2), pp. 69–107. External Links: ISSN 00255610 Cited by: §13.1, §5.2, §5.2.
 Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Mathematical Programming 155 (1), pp. 549–573. External Links: ISSN 14364646, Document Cited by: §1.2, §6.
 Batched Stochastic Gradient Descent with Weighted Sampling. In Approximation Theory XV: San Antonio 2016, G. E. Fasshauer and L. L. Schumaker (Eds.), Cham, pp. 279–306. Cited by: §7.1.
 Problem Complexity and Method Efficiency in Optimization. Wiley, New York. Cited by: §1.
 Tight Dimension Independent Lower Bound on the Expected Convergence Rate for Diminishing Step Sizes in SGD. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ AlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 3660–3669. Cited by: §4.1, §5.2, §5.2.
 Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, Madison, WI, USA, pp. 1571–1578. External Links: ISBN 9781450312851 Cited by: §5.1.
 Stochastic Reformulations of Linear Systems: Algorithms and Convergence Theory. arXiv preprint arXiv:1706.01108. Cited by: §2, §4.1, §4.1, §4.5.
 The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent. arXiv preprint arXiv:1904.06963. Cited by: §2, §3.
 Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition. arXiv preprint arXiv:1308.6370. Cited by: §2, §3.
 Understanding machine learning: from theory to algorithms. Cambridge University Press. Cited by: §1.2.
 Stochastic Gradient Descent for NonSmooth Optimization: Convergence Results and Optimal Averaging Schemes. In Proceedings of the 30th International Conference on International Conference on Machine Learning  Volume 28, ICML’13, pp. I–71–I–79. Cited by: §5.1.
 The ErrorFeedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication. arXiv preprint arXiv:1909.05350. Cited by: §4.5.
 Unified Optimal Analysis of the (Stochastic) Gradient Method. arXiv preprint arXiv:1907.04232. Cited by: §11.2, §12.1, §5.1, §5.2.
 Optimization for deep learning: theory and algorithms. arXiv preprint arXiv:1912.08957. Cited by: §1, §5.2.
 Hybrid Stochastic Gradient Descent Algorithms for Stochastic Nonconvex Optimization. arXiv preprint arXiv:1905.05920. Cited by: §7.2.

Fast and Faster Convergence of SGD for OverParameterized Models and an Accelerated Perceptron
. In Proceedings of Machine Learning Research, K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, pp. 1195–1204. Cited by: §2, §3.  Complexity Issues in Global Optimization: A Survey. In Handbook of Global Optimization. Nonconvex Optimization and Its Applications, H. R. and P. P. M. (Eds.), Vol. 2. External Links: Document Cited by: §1.
 Gradient Sparsification for CommunicationEfficient Distributed Optimization. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 1299–1309. Cited by: §1.2.
 Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In Proceedings of the 32nd International Conference on International Conference on Machine Learning  Volume 37, ICML’15, pp. 1–9. Cited by: §6.
8 Basic Facts and Notation
We will use the following facts from probability theory: If
is a random variable and is a constant vector, then(25)  
(26) 
A consequence of (26) is the following,
(27) 
We will also make use of the following facts from linear algebra: for any and any ,
(28)  
(29) 
For vectors all in , the convexity of the squared norm and a trivial application of Jensen’s inequality yields the following inequality:
(30) 
For an smooth function we have that for all ,
(31)  
(32) 
Given a point and a stepsize , we define the one step gradient descent mapping as,
(33) 
9 Relations Between Assumptions
9.1 Proof of Proposition 3
Proof.
Let be a positive definite matrix. Let and set and . For simplicity, consider using uniform sampling of a single function. That is, we choose with probability for each , and define
Then for ,
(34)  
It is easy to see that,
Using this in (34),
Suppose that (RG) holds. Then there exist such that
where we used that for all . Choosing such that yields a contradiction. ∎
9.2 Formal Statement and Proof of Theorem 4.3
Theorem 1 (Formal) Suppose that is smooth. Then the following relations hold:

The relaxed growth condition implies the expected smoothness condition (ES).
Proof.

Putting , and shows that (ES) is satisfied.
∎
10 Proofs for Section 4.5 and 4.6
10.1 Proof of Proposition 4.5
Proof.
We start out with the definition of then use the convexity of the squared norm , the linearity of expectation: