1 Introduction
We consider the optimization problem
(1) 
where each is smooth (but not necessarily convex). Further, we assume that has a unique^{1}^{1}1This assumption can be relaxed; but for simplicity of exposition we enforce it. global minimizer and is –strongly quasiconvex Karimi et al. (2016); Necoara et al. (2018):
(2) 
for all .
1.1 Background and contributions
Stochastic gradient descent (SGD) Robbins & Monro (1951); Nemirovski & Yudin (1978, 1983); ShalevShwartz et al. (2007); Nemirovski et al. (2009); Hardt et al. (2016)
, has become the workhorse for training supervised machine learning problems which have the generic form (
1).Linear convergence of SGD. Among the first nonasymptotic analyses of SGD was done by Moulines & Bach (2011) who provided, among other results, a linear convergence analysis for strongly convex up to a certain noise level. Needell et al. (2016) improved upon these results by removing the quadratic dependency on the condition number in the iteration complexity results, and considered importance sampling. The analysis of Needell et al. (2016) was later extended to a minibatch variant where the minibatches are formed by partitioning the data Needell & Ward (2017). These works are the main starting point for ours.
Contributions: We further tighten and generalize these results to virtually all forms of sampling. We introduce an expected smoothness assumption (Assumption 2.1), first introduced in Gower et al. (2018) in the context of a certain class of variancereduced methods. This assumption is a joint property of and the sampling scheme utilized by an SGD method, and allows us prove a generic complexity result (Theorem 3.1) that holds for arbitrary sampling schemes . Our work is the first time SGD is analysed under this assumption. We obtain linear convergence rates without strong convexity; in particular, assuming strong quasiconvexity (this class includes some nonconvex functions as well). Furthermore, we do not require the functions to be convex.
Gradient noise assumptions. Shamir & Zhang (2013) extended the analysis of SGD to convex nonsmooth optimization (including the strongly convex case). However, their proofs still rely on the assumption that the variance of the stochastic gradient is bounded for all iterates of the algorithm: there exists such that for all . The same assumption was used in the analysis of several recent papers Recht et al. (2011); Hazan & Kale (2014); Rakhlin et al. (2012). Bottou et al. (2018) establish a linear convergence of SGD under a somewhat less restrictive condition, namely for all . Recently, Nguyen et al. (2018) turn this assumption into a theorem by establishing formulas and under some reasonable conditions, and provide further insights into the workings of SGD and and its parallel asynchronous cousin, Hogwild!. Similar conditions have been also proved and used in the analysis of decentralized variants of SGD Lian et al. (2017); Assran et al. (2018). Based on a strong growth condition of the stochastic gradients, Cevher & Vu (2017) give sufficient and necessary conditions for the linear convergence of SGD. This strong growth condition holds when using SGD for solving a consistent linear system, but it does not hold for a wide range of problems.
Contributions: Our analysis does not directly assume a growth condition. Instead, we make use of the remarkably weak expected smoothness assumption.
Optimal minibatch size. Recently it was experimentally shown by Goyal et al. (2017)
that using larger minibatches sizes is key to efficient training of large scale nonconvex problems, leading to the training of ImageNet in under 1 hour. The authors conjectured that the stepsize should grow linearly with the minibatch size.
Contributions: We prove (see Section 4) that this is the case, upto a certain optimal minibatch size, and provide exact formulas for the dependency of the stepsizes on the minibatch sizes.
Learning schedules. Chee & Toulis (2018) develop techniques for detecting the convergence of SGD within a region around the solution.
Contributions: We provide a closedform formula for when should SGD switch from a constant stepsize to a decreasing stepsize (see Theorem 3.2). Further, we clearly show how the optimal stepsize (learning rate) increases and the iteration complexity decreases as the minibatch size increases for both independent sampling and sampling with replacement. We also recover the well known convergence rate of gradient descent (GD) when the minibatch size is ; this is the first time a generic SGD analysis recovers the correct rate of GD.
Overparameterized models. There has been some recent work in analysing SGD in the setting where the underlying model being trained has more parameters than there is data available. In this zero–noise setting, Ma et al. (2018) showed that SGD converges linearly.
Contributions:
In the case of overparametrized models, we extend the findings of
Ma et al. (2018)^{2}^{2}2Recently, the results of Ma et al. (2018) were extended to the accelerated case by Vaswani et al. (2018); however, we do not study accelerated methods in this work. to independent sampling and sampling with replacement by showing that the optimal minibatch size is . Moreover, we provide results in the more general setting where the model is not necessarily overparametrized.Practical performance. We corroborate our theoretical results with extensive experimental testing.
1.2 Stochastic reformulation
In this work we provide a single theorem through which we can analyse all importance sampling and minibatch variants of SGD. To do this, we need to introduce a
sampling vector
which we will use to rewrite our problem (1).LD.
We say that a random vector drawn from some distribution is a sampling vector if its mean is the vector of all ones:
(3) 
With each distribution we now introduce a stochastic reformulation of (1) as follows
(4) 
By the definition of the sampling vector, and
are unbiased estimators of
and respectively, and hence probem (4) is indeed equivalent (i.e., a reformulation) of the original problem (1). In the case of the gradient, for instance, we get(5) 
Similar but different stochastic reformulations were recently proposed by Richtárik & Takáč (2017) and further used in (Loizou & Richtárik, 2017) for the more special problem of solving linear systems, and by Gower et al. (2018) in the context of variancereduced methods. Reformulation (4) can be solved using SGD in a natural way:
(6) 
where is sampled i.i.d at each iteration and is a stepsize. However, for different distributions , (6) has a different interpretation as an SGD method for solving the original problem (1). In our main result we will analyse (6) for any satisfying (3). By substituting specific choices of , we obtain specific variants of SGD for solving (1).
2 Expected Smoothness and Gradient Noise
In our analysis of SGD (6) applied to the stochastic reformulation (4) we rely on a generic and remarkably weak assumption of expected smoothness, which we now define and relate to existing growth conditions.
2.1 Expected smoothness
Expected smoothness Gower et al. (2018) is an assumption that combines both the properties of the distribution and the smoothness properties of function .
LA (Expected Smoothness).
We say that is –smooth in expectation with respect to distribution if there exists such that
(7) 
for all . For simplicity, we will write to say that (7) holds. When is clear from the context, we will often ignore mentioning it, and simply state that the expected smoothness constant is
In Section 3.3 we show how convexity and –smoothness of implies expected smoothness. However, the opposite implication does not hold. Indeed, the expected smoothness assumption can hold even when the ’s and are not convex, as we show in the next example.
LE (Nonconvexity and expected smoothness).
Let for , where is a –smooth and nonconvex function which has a global minimum (such functions exist^{3}^{3}3There exists invex functions that satisfy these conditions Karimi et al. (2016). As an example is smooth, nonconvex, and has a unique global minimizer.). Consequently and . Letting we have
where the last inequality follows from Proposition A.1. So, for .
2.2 Gradient noise
Our second key assumption is finiteness of gradient noise, defined next:
LA (Finite Gradient Noise).
The gradient noise , defined by
(8) 
is finite.
This is a very weak assumption, and should intuitively be really seen as an assumption on rather than on . For instance, if the sampling vector is nonnegative with probability one and is finite for all , then is finite. When (1
) is the training problem of an overparametrized model, which often occurs in deep neural networks, each individual loss function
attains its minimum at , and thus It follows that .2.3 Key lemma and connection to the weak growth condition
A common assumption used to prove the convergence of SGD is uniform boundedness of the stochastic gradients^{4}^{4}4Or it is assumed that for all iterates. But this too has issues since it implicitly assumes that the iterates remain within a compact set, and yet it it used to prove the convergence to within a compact set, raising issues of a circular argument.: there exist such that for all . However, this assumption often does not hold, such as in the case when is strongly convex Bottou et al. (2018); Nguyen et al. (2018). We do not assume such a bound. Instead, we use the following direct consequence of expected smoothness to bound the expected norm of the stochastic gradients.
LL.
If , then
(9) 
When the gradient noise is zero (), inequality (9) is known as the weak growth condition Vaswani et al. (2018). We have the following corollary:
LC.
If and if , then satisfies the weak growth condition
with
This corollary should be contrasted with Proposition 2 in Vaswani et al. (2018) and Lemma 1 in Nguyen et al. (2018), where it is shown, by assuming the functions to be smooth and convex, that the weak growth condition holds with . However, as we will show in Lemma C.1, , and hence our bound is often tighter.
3 Convergence Analysis
3.1 Main results
We now present our main theorem, and include its proof to highlight how we make use of expected smoothness and gradient noise.
LT.
Assume is quasistrongly convex and that . Choose for all . Then iterates of SGD given by (6) satisfy:
(10) 
Hence, given any , choosing stepsize
(11) 
and
(12) 
implies
Proof.
Let . From (6), we have
Taking expectation conditioned on we obtain:
Taking expectations again and using Lemma 2.4:
where we used in the last inequality that since Recursively applying the above and summing up the resulting geometric series gives
(13)  
To obtain an iteration complexity result from the above, we use standard techniques as shown in Section A.1. ∎
Note that we do not assume nor to be convex. Theorem 3.1 states that SGD converges linearly up to the additive constant which depends on the gradient noise and on the stepsize . We obtain a more accurate solution with a smaller stepsize, but then the convergence rate slows down. Since we control , we also control and (we compute these parameters for several distributions in Section 3.3).
Furthermore, we can control this additive constant by carefully choosing the stepsize, as shown in the next result.
LT (Decreasing stepsizes).
Assume is quasistrongly convex and that . Let and
(14) 
If , then SGD iterates given by (6) satisfy:
(15) 
3.2 Choosing
For (6) to be efficient, the sampling vector should be sparse. For this reason we will construct so that only a (small and random) subset of its entries are nonzero.
Before we formally define , let us first establish some random set terminology. Let and let , where are the standard basis vectors in . These subsets will be selected using a random set valued map , in the literature referred to by the name sampling Richtárik & Takáč (2016); Qu & Richtárik (2016). A sampling is uniquely characterized by choosing subset probabilities for all subsets of :
(16) 
where . We will only consider proper samplings. A sampling is called proper if is positive for all .
The first analysis of a randomized optimization method with an arbitrary (proper) sampling was performed by Richtárik & Takáč (2016) in the context of randomized coordinate descent for strongly convex functions. This arbitrary sampling paradigm was later adopted in many other settings, including accelerated coordinate descent for strongly convex functions Hanzely & Richtárik (2018), coordinate and accelerated descent for convex functions Qu & Richtárik (2016), primaldual methods Qu et al. (2015); Chambolle et al. (2018), variancereduced methods with convex Csiba & Richtárik (2015) and nonconvex Horváth & Richtárik (2018) objectives. Arbitrary sampling arises as a special case of our more general analysis by specializing the sampling vector to one dependent on a sampling . We now define practical sampling vector as follows:
LL.
Let be a proper sampling, and let Then the random vector given by
(17) 
is a sampling vector.
Proof.
Note that where is the indicator function of the event . It follows that . ∎
We can further specialize and define the following commonly used samplings. Each sampling gives rise to a particular sampling vector (i.e., distribution ), which in turn gives rise to a particular stochastic reformulation (4) and SGD variant (6).
Independent sampling. The sampling includes every , independently, with probability . This type of sampling was considered in different contexts in Horváth & Richtárik (2018); Hanzely & Richtárik (2018).
Partition sampling. A partition of is a set consisting of subsets of such that and for any , with . A partition sampling is a sampling such that for all and .
Single element sampling. Only the singleton sets for have a nonzero probability of being sampled; that is, . We have .
–nice sampling. We say that is a –nice if samples from all subsets of of cardinality uniformly at random. In this case we have that for all So, for all subsets with elements.
3.3 Bounding and
By assuming that the functions are convex and smooth we can calculate closed form expressions for the expected smoothness and gradient noise . In particular we make the following smoothness assumption:
LA.
There exists a symmetric positive definite matrix such that
(18) 
for all and where In this case we say that is –smooth. Furthermore, we assume that each is convex.
To better relate the above assumption to the standard smoothness assumptions we make the following remark.
LR.
As a consequence of Assumption 3.4 we also have that each is –smooth and is –smooth. Let
Using Assumption 3.4 and a sampling we establish the following bounds on .
LT.
Let be a proper sampling, and (i.e., is defined by (17). Let be smooth, and be defined by . Then , where
(19)  
and . If , then
(20) 
By applying the above result to specific samplings, we obtain the following practical bounds on :
LP.
(i) For single element sampling , we have
(21) 
(ii) For independent sampling , we have
(22) 
(iii) For nice sampling , we have
(23)  
(24) 
(iv) For partition sampling with partition , we have
(25) 
For , formulas for the gradient noise are provided in the next result:
LT.
Let . Then
(26) 
Specializing the above theorem to specific samplings gives the following formulas for :
LP.
(i) For single element sampling , we have
(27) 
(ii) For independent sampling with , we have
(28) 
(iii) For nice sampling , we have
(29) 
(iv) For partition sampling with partition , we have
(30) 
Generally, we do not know the values of . But if we have prior knowledge that belongs to some set , we can obtain upper bounds for for these samplings from Proposition 3.9 in a straightforward way.
4 Optimal MiniBatch Size
Here we develop the iteration complexity for different samplings by plugging in the bounds on and given in Section 3.3 into Theorem 3.1. To keep the notation brief, in this section we drop the logarithmic term from the iteration complexity results. Furthermore, for brevity and to better compare our results to others in the literature, we will use , and (see Remark 3.5). Finally let for brevity.
Gradient descent. As a first sanity check, we consider the case where with probability one. That is, each iteration (6) uses the full batch gradient. Thus and it is not hard to see that for in (23) we have Consequently, the resulting iteration complexity (12) is now . This is exactly the rate of gradient descent, which is precisely what we would expect since the resulting method is gradient descent. Though an obvious sanity check, we believe this is the first convergence theorem of SGD that includes gradient descent as a special case. Clearly, this is a necessary prerequisite if we are to hope to understand the complexity of minibatching.
4.1 Nonzero gradient noise
To better appreciate how our iteration complexity evolves with increased minibatch sizes, we now consider independent sampling with and nice sampling.
Independent sampling. Inserting the bound on (22) and (28) into (12) gives the following iteration complexity
(31) 
This is a completely new minibatch complexity result, which opens up the possibility of optimizing the minibatch size and probabilities of sampling. For instance, if we fix uniform probabilities with then (4.1) becomes , where
(32) 
This complexity result corresponds to using the stepsize
(33) 
if , otherwise only the lefthandside term in the minimization remains. The stepsize (33) is increasing since both and decrease as increases.
With such a simple expression for the iteration complexity we can choose a minibatch size that optimizes the total complexity. By defining the total complexity as the number of iterations times the number of gradient evaluations () per iteration gives
(34) 
Minimizing in is easy because is a max of a linearly increasing term and a linearly decreasing term in . Furthermore . Consequently, if , then , otherwise
(35) 
Since is proportional to the noise and and is proportional to the smoothness the condition holds when there is comparatively a lot of noise or the precision is high. As we will see in Section 4.2 this logic extends to the case where the noise is zero, where the optimal minibatch size is
–nice sampling. Inserting the bound on (24) and (29) into (12) gives the iteration complexity , where
(36)  
(37) 
which holds for the stepsize
(38) 
Again, this is an increasing function of
We are now again able to calculate the minibatch size that optimizes the total complexity given by Once again is a max of a linearly increasing term and a linearly decreasing term in . Furthermore . Consequently, if then , otherwise
(39) 
4.2 Zero gradient noise
Consider the case where the gradient noise is zero (). According to Theorem 3.1, the resulting complexity of SGD with constant stepsize is given by the very simple expression
(40) 
where we have dropped the logarithmic term . In this setting, due to Corollary 2.5, we know that satisfies the weak growth condition. Thus our results are directly comparable to those developed in Ma et al. (2018) and in Vaswani et al. (2018).
In particular, Theorem 1 in Ma et al. (2018) states that when running SGD with minibatches based on sampling with replacement, the resulting iteration complexity is
(41) 
again dropping the logarithmic term. Now gaining insight into the complexity (40) is a matter of studying the expected smoothness parameter for different sampling strategies.
Independent sampling. Setting (thus ) and using uniform probabilities with in (4.1) gives
(42) 
–nice sampling. If we use a uniform sampling and then the resulting iteration complexity is given by
(43) 
5 Importance Sampling
In this section we propose importance sampling for single element sampling and independent sampling with , respectively. Due to lack of space, the details of this section are in the appendix, Section I. Again we drop the log term in (12) and adopt the notation in Remark 3.5.
5.1 Single element sampling
For single element sampling, plugging (21) and (27) into (12) gives the following iteration complexity
where and . In order to optimize this iteration complexity over , we need to solve a dimensional linearly constrained nonsmooth convex minimization problem, which could be harder than the original problem (1). So instead, we will focus on minimizing and over seperately. We will then use these two resulting (sub)optimal probabilities to construct a sampling.
In particular, for single element sampling we can recover the partially biased sampling developed in Needell et al. (2016). First, from (21) it is easy to see that the probabilities that minimize are for all . Using these suboptimal probabilities we can construct a partially biased sampling by letting Plugging this sampling in (21) gives , and from (27), we have . This sampling is the same as the partially biased sampling in Needell et al. (2016). From (4.1) in Theorem 3.1, we get that the total complexity is now given by
(44) 
For uniform sampling, and . Hence, compared to uniform sampling, the iteration complexity of partially biased sampling is at most two times larger, but could be smaller in the extreme case where
5.2 Minibatches
Importance sampling for minibatches was first considered in (Csiba & Richtárik, 2018); but not in the context of SGD. Here we propose the first importance sampling for minibatch SGD. In Section I.2 in the appendix we introduce the use of partially biased sampling together with independent sampling with and show that we can achieve a total complexity of (by Proposition I.3)
(45) 
which not only eliminates the dependence on , but also improves as the minibatch size increases.
6 Experiments
In this section, we empirically validate our theoretical results. We perform three experiments in each of which we highlight a different aspect of our contributions.
In the first two experiments we focus on ridge regression and regularized logistic regression problems (problems with strongly convex objective
and components ) and we evaluate the performance of SGD on both synthetic and real data. In particular, in the first experiment (Section 6.1) we numerically verify the performance of SGD (in the case of uniform single element sampling) as predicted from Theorems 3.1 and 3.2 for both constant and decreasing stepsizes. In the second experiment (Section 6.2) we compare the convergence of SGD for several choices of the distribution (different sampling strategies) as described in the previous sections. In the last experiment (Section 6.3) we focus on the problem of principal component analysis (PCA) which by construction can be seen as a problem with a strongly convex objective
but with nonconvex functions AllenZhu & Yuan (2016); Garber & Hazan (2015); ShalevShwartz (2016).In all experiments, to evaluate SGD we use the relative error measure . For all implementations, the starting point is standard Gaussian. We run each method until
or until a prespecified maximum number of epochs is achieved. For the horizontal axis we always use the number of epochs. The code for the experiments is written in Python 3
For more experiments we refer the interested reader to Section J of the Appendix.
Regularized Regression Problems: In the case of the ridge regression problem we solve:
while for the regularized logistic regression problem we solve:
In both problems are the given data and is the regularization parameter. For the generation of the synthetic data in both problems, the rows of matrix (
) were sampled from the standard Gaussian distribution
. For the synthetic data in the case of ridge regression we choose vector to be Gaussian vector while in the case of logistic regression where The regularization parameter varies depending on the experiment. For our experiments on real data we choose several LIBSVM Chang & Lin (2011) datasets.6.1 Constant vs decreasing step size
We now compare the performance of SGD in the constant and decreasing stepsize regimes considered in Theorems 3.1 (see (11)) and 3.2 (see (14)), respectively. As expected from theory, we see in Figure 1 that the decreasing stepsize regime is vastly superior at reaching a higher precision than the constant stepsize variant. In our plots, the vertical red line denotes the value of predicted from Theorem 3.2 and highlights the point where SGD needs to change its update rule from constant to decreasing stepsize.
6.2 Minibatches
In Figures 2 and 5 we compare the single element sampling (uniform and importance), independent sampling (uniform, uniform with optimal batch size and importance) and nice sampling (with some and with optimal ). The probabilities of importance samplings in the single element sampling and independent sampling are calculated by formulas (65) and (73) (see the Appendix). Formulas for optimal minibatch size in independent sampling and nice samplings are given in (35) and (39), respectively. Observe that minibatching with optimal gives the best convergence. In addition, note that for constant step size, importance sampling variants depend on the accuracy . It is clear that before the error approaches required accuracy, importance sampling is comparable or better than their uniform sampling.
6.3 Sumofnonconvex functions
In Figure 3, our goal is to illustrate that Theorem 3.1 holds even if the functions are non convex. The scheme of the experiment is similar to the one from AllenZhu & Yuan (2016). In particular, we first generate random vectors from . Then we consider the minimization problem:
where are diagonal matrices satisfying . In particular, to guarantee that , we randomly select half of the matrices and assign their th diagonal value equal to ; for the other half we assign to be . We repeat that for all diagonal values. Note that under this construction, each is nonconvex function. Once again, in the first plot we observe that while both are equally fast in the beginning, the decreasing stepsize variant is better at reaching higher accuracy than the fixed stepsize variant. In the second plot we see, as expected, that all four minibatch versions of SGD outperform single element SGD. However, while the nice and independent samplings with lead to a slight improvement only, the theoretically optimal choice leads to a vast improvement.
References
 AllenZhu & Yuan (2016) AllenZhu, Z. and Yuan, Y. Improved SVRG for nonstronglyconvex or sumofnonconvex objectives. In International Conference on Machine Learning, pp. 1080–1089, 2016.
 Assran et al. (2018) Assran, M., Loizou, N., Ballas, N., and Rabbat, M. Stochastic gradient push for distributed deep learning. arXiv preprint arXiv:1811.10792, 2018.
 Bottou et al. (2018) Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for largescale machine learning. SIAM Review, 60(2):223–311, 2018.
 Cevher & Vu (2017) Cevher, V. and Vu, B. C. On the linear convergence of the stochastic gradient method with constant stepsize. arXiv:1712.01906, pp. 1–9, 2017.
 Chambolle et al. (2018) Chambolle, A., Ehrhardt, M. J., Richtárik, P., and Schöenlieb, C.B. Stochastic primaldual hybrid gradient algorithm with arbitrary sampling and imaging applications. SIAM Journal on Optimization, 28(4):2783–2808, 2018.

Chang & Lin (2011)
Chang, C.C. and Lin, C.J.
Libsvm: a library for support vector machines.
ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011. 
Chee & Toulis (2018)
Chee, J. and Toulis, P.
Convergence diagnostics for stochastic gradient descent with constant
learning rate.
In Storkey, A. and PerezCruz, F. (eds.),
Proceedings of the TwentyFirst International Conference on Artificial Intelligence and Statistics
, volume 84 of Proceedings of Machine Learning Research, pp. 1476–1485. PMLR, 09–11 Apr 2018.  Csiba & Richtárik (2015) Csiba, D. and Richtárik, P. Primal method for ERM with flexible minibatching schemes and nonconvex losses. arXiv:1506.02227, 2015.
 Csiba & Richtárik (2018) Csiba, D. and Richtárik, P. Importance sampling for minibatches. Journal of Machine Learning Research, 19(27):1–21, 2018.
 Garber & Hazan (2015) Garber, D. and Hazan, E. Fast and simple PCA via convex optimization. arXiv preprint arXiv:1509.05647, 2015.
 Gower et al. (2018) Gower, R. M., Richtárik, P., and Bach, F. Stochastic quasigradient methods: Variance reduction via Jacobian sketching. arxiv:1805.02632, 2018.
 Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R. B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.
 Hanzely & Richtárik (2018) Hanzely, F. and Richtárik, P. Accelerated coordinate descent with arbitrary sampling and best rates for minibatches. arXiv Preprint arXiv: 1809.09354, 2018.
 Hardt et al. (2016) Hardt, M., Recht, B., and Singer, Y. Train faster, generalize better: stability of stochastic gradient descent. In 33rd International Conference on Machine Learning, 2016.
 Hazan & Kale (2014) Hazan, E. and Kale, S. Beyond the regret minimization barrier: optimal algorithms for stochastic stronglyconvex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014.
 Horváth & Richtárik (2018) Horváth, S. and Richtárik, P. Nonconvex variance reduced optimization with arbitrary sampling. arXiv:1809.04146, 2018.
 Karimi et al. (2016) Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximalgradient methods under the polyakłojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer, 2016.
 Lian et al. (2017) Lian, X., Zhang, C., Zhang, H., Hsieh, C.J., Zhang, W., and Liu, J. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 5330–5340, 2017.
 Loizou & Richtárik (2017) Loizou, N. and Richtárik, P. Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677, 2017.

Ma et al. (2018)
Ma, S., Bassily, R., and Belkin, M.
The power of interpolation: Understanding the effectiveness of SGD in modern overparametrized learning.
In ICML, volume 80 of JMLR Workshop and Conference Proceedings, pp. 3331–3340, 2018. 
Moulines & Bach (2011)
Moulines, E. and Bach, F. R.
Nonasymptotic analysis of stochastic approximation algorithms for machine learning.
In Advances in Neural Information Processing Systems, pp. 451–459, 2011.  Necoara et al. (2018) Necoara, I., Nesterov, Y., and Glineur, F. Linear convergence of first order methods for nonstrongly convex optimization. Mathematical Programming, pp. 1–39, 2018. doi: https://doi.org/10.1007/s1010701812321.
 Needell & Ward (2017) Needell, D. and Ward, R. Batched stochastic gradient descent with weighted sampling. In Approximation Theory XV, Springer, volume 204 of Springer Proceedings in Mathematics & Statistics,, pp. 279 – 306, 2017.
 Needell et al. (2016) Needell, D., Srebro, N., and Ward, R. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Mathematical Programming, Series A, 155(1):549–573, 2016.
 Nemirovski & Yudin (1978) Nemirovski, A. and Yudin, D. B. On Cezari’s convergence of the steepest descent method for approximating saddle point of convexconcave functions. Soviet Mathetmatics Doklady, 19, 1978.
 Nemirovski & Yudin (1983) Nemirovski, A. and Yudin, D. B. Problem complexity and method efficiency in optimization. Wiley Interscience, 1983.
 Nemirovski et al. (2009) Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
 Nguyen et al. (2018) Nguyen, L., Nguyen, P. H., van Dijk, M., Richtárik, P., Scheinberg, K., and Takáč, M. SGD and hogwild! Convergence without the bounded gradients assumption. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 3750–3758. PMLR, 2018.
 Qu & Richtárik (2016) Qu, Z. and Richtárik, P. Coordinate descent with arbitrary sampling I: Algorithms and complexity. Optimization Methods and Software, 31(5):829–857, 2016.
 Qu & Richtárik (2016) Qu, Z. and Richtárik, P. Coordinate descent with arbitrary sampling II: Expected separable overapproximation. Optimization Methods and Software, 31(5):858–884, 2016.
 Qu et al. (2015) Qu, Z., Richtárik, P., and Zhang, T. Quartz: Randomized dual coordinate ascent with arbitrary sampling. In Advances in Neural Information Processing Systems, pp. 865–873, 2015.
 Rakhlin et al. (2012) Rakhlin, A., Shamir, O., and Sridharan, K. Making gradient descent optimal for strongly convex stochastic optimization. In 29th International Conference on Machine Learning, volume 12, pp. 1571–1578, 2012.
 Recht et al. (2011) Recht, B., Re, C., Wright, S., and Niu, F. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 693–701, 2011.
 Richtárik & Takáč (2016) Richtárik, P. and Takáč, M. On optimal probabilities in stochastic coordinate descent methods. Optimization Letters, 10(6):1233–1243, 2016.
 Richtárik & Takáč (2016) Richtárik, P. and Takáč, M. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(12):433–484, 2016.
 Richtárik & Takáč (2017) Richtárik, P. and Takáč, M. Stochastic reformulations of linear systems: algorithms and convergence theory. arXiv:1706.01108, 2017.
 Robbins & Monro (1951) Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, pp. 400–407, 1951.
 ShalevShwartz (2016) ShalevShwartz, S. SDCA without duality, regularization, and individual convexity. In International Conference on Machine Learning, pp. 747–754, 2016.
 ShalevShwartz et al. (2007) ShalevShwartz, S., Singer, Y., and Srebro, N. Pegasos: primal estimated subgradient solver for SVM. In 24th International Conference on Machine Learning, pp. 807–814, 2007.
 Shamir & Zhang (2013) Shamir, O. and Zhang, T. Stochastic gradient descent for nonsmooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, pp. 71–79, 2013.
 Vaswani et al. (2018) Vaswani, S., Bach, F., and Schmidt, M. Fast and faster convergence of SGD for overparameterized models and an accelerated perceptron. arXiv preprint arXiv:1810.07288, 2018.
Appendix A Elementary Results
In this section we collect some elementary results; some of them we use repeatedly.
LP.
Let be –smooth, and assume it has a minimizer on . Then
Proof.
Lipschitz continuity of the gradient implies that
Now plugging into the above inequality, we get