We consider stochastic optimization methods for the finite-sum problem
where each function is smooth and convex, and the sum is strongly convex. A classical approach to solving (1.1) is stochastic gradient descent (Sgd). At each iteration Sgd independently samples an index uniformly from , and uses the (stochastic) gradient to compute its update. The stochasticity makes each iteration of Sgd cheap, and the uniformly independent sampling of makes
an unbiased estimator of the full gradient. These properties are central to Sgd
’s effectiveness in large scale machine learning, and underlie much of its theoretical analysis (see for instance,[34, 26, 2, 5, 30]).
However, what is actually used in practice is the without replacement version of Sgd, henceforth called RandomShuffle. Specifically, at each epoch RandomShuffle samples a random permutation of the functions uniformly independently (some implementations shuffle the data only once at load, rather than at each epoch). Then, it iterates over these functions according to the sampled permutation and updates in a manner similar to Sgd. Avoiding the use of random sampling at each iteration, RandomShuffle can be computationally more practical ; furthermore, as one would expect, empirically RandomShuffle is known to converge faster than Sgd .
This discrepancy between theory and practice has been a long-standing problem in the theory of Sgd. It has drawn renewed attention recently, with the goal of better understanding convergence of RandomShuffle. The key difficulty is that without-replacement leads to statistically non-independent samples, which greatly complicates analysis. Two extreme case positive results are however available: Shamir  shows that RandomShuffle is not much worse than usual Sgd, provided the number of epochs is not too large; while Gürbüzbalaban et al.  show that RandomShuffle converges faster than Sgd asymptotically at the rate .
But it remains unclear what happens in between, after a reasonable finite number of epochs are run. This regime is the most compelling one to study, since in practice one runs neither one nor infinitely many epochs. This motivates the central question of our paper:
Does RandomShuffle converge faster than Sgd after a reasonable number of epochs?
We answer this question positively in this paper; our results are more precisely summarized below.
|Algorithm||Quadratic||Lipschitz Hessian||Sparse Data||LP Condition|
1.1 Summary of results
We follow the common practice of reporting convergence rates depending on , the number of calls to the (stochastic / incremental) gradient oracle. For instance, Sgd converges at the rate for solving (1.1), ignoring logarithmic terms in the bound . The underlying argument is to view Sgd as stochastic approximation with noise , therefore ignoring the finite-sum structure of (1.1). Our key observation for RandomShuffle is that one should reasonably include dependence on into the bound (see Section 3.3). Such a compromise leads to a better dependence on , which further shows how RandomShuffle beats Sgd after a finite number of epochs. Our main contributions are the following:
Under a mild assumption on second order differentiability, and assuming strong-convexity, we establish a convergence rate of for RandomShuffle, where is the number of components in (1.1), and is the total number of iterations (Theorem 1 and 2). From the bounds we can calculate the precise number of epochs after which RandomShuffle is strictly better than Sgd.
We prove that a dependence on is necessary for beating the Sgd rate . This tradeoff precludes the possibility of proving a convergence rate of the type with some in the general case, and justifies our choice of introducing into the rate (Theorem 3).
Assuming a sparse data setting common in machine learning, we further improve the convergence rate of RandomShuffle to . This rate is strictly better than Sgd, indicating RandomShuffle’s advantage in such cases (Theorem 4).
We extend our results to the non-convex function class with Polyak-Łojasiewicz condition, establishing a similar rate for RandomShuffle (Theorem 5).
We show a class of examples where RandomShuffle is provably faster than Sgd after arbitrary number (even less than one epoch) of iterations (Theorem 7).
We provide a detailed discussion of various aspects of our results in Section 6, including explicit comparisons to Sgd, the role of condition numbers, as well as some limitations. Finally, we end by noting some extensions and open problems in Section 7. As one of the extensions, for non-strongly convex problems, we prove that RandomShuffle achieves a comparable convergence rate as Sgd, with possibly smaller constant in the bound under certain parameter paradigms (Theorem 6).
1.2 Related work
Recht and Ré  conjecture a tantalizing matrix AM-GM inequality that underlies RandomShuffle’s superiority over Sgd. While limited progress on this conjecture has been reported [14, 38], the correctness of the full conjecture is still wide open. With the technique of transductive Rademacher complexity, Shamir  shows that Sgd is not worse than RandomShuffle
provided the number of iterations is not too large. Asymptotic analysis is provided in, which proves that RandomShuffle limits to a rate for large . Ying et al.  show that for a fixed step size, RandomShuffle converges to a distribution closer to optimal than Sgd asymptotically.
When the functions are visited in a deterministic order (e.g., cyclic), the method turns into Incremental Gradient Descent (Igd), which has a long history . Kohonen  shows that Igd converges to a limit cycle under constant step size and quadratic functions. Convergence to neighborhood of optimality for more general functions is studied in several works, under the assumption that step size is bounded away from zero (see for instance ). With properly diminishing step size, Nedić and Bertsekas  show that an convergence rate in terms of distance to optimal can be achieved under strong convexity of the finite-sum. This rate is further improved in  to under a second order differentiability assumption.
In the real world, RandomShuffle
has been proposed as a standard heuristic. With numerical experiments, Bottou  notices an approximately convergence rate of RandomShuffle. Without-replacement sampling also improves data-access efficiency in distributed settings, see for instance [9, 18]. The permutation-sampling idea has been further embedded into more complicated algorithms; see [6, 8, 32]
for variance-reduced methods, and for decomposition methods.
Finally, we note a related body of work on coordinate descent, where a similar problem has been studied: when does random permutation over coordinates behave well? Gürbüzbalaban et al.  give two kinds of quadratic problems when cyclic version of coordinate descent beats the with replacement one, which is a stronger result indicating that random permutation also beats the with replacement method. However, such a deterministic version of the algorithm suffers from poor worst case. Indeed, in  a setting is analyzed where cyclic coordinate descent can be dramatically worse than both with-replacement and random permutation versions of coordinate descent. Lee and Wright  further study this setting, and analyze how the random permutation version of coordinate descent avoids the slow convergence of cyclic version. In , Wright et el. propose a more general class of quadratic functions where random permutation outperforms cyclic coordinate descent.
2 Background and problem setup
For problem (1.1), we assume the finite sum function is strongly convex, i.e.,
where , and is the strong convexity parameter. Furthermore, we assume each component function is -smooth, so that for , there exists a constant such that
Furthermore, we assume that the component functions are second order differentiable with a Lipschitz continuous Hessian. We use to denote the Hessian of function at . Specifically, for each , we assume that for all , there exists a constant such that
The norm is the spectral norm for matrices and
norm for vectors. We denote the unique minimizer ofas , the index set as . The complexity bound is represented as , with all logarithmic terms hidden. All other parameters that might be hidden in the complexity bounds will be clarified in corresponding sections.
2.1 The algorithms under study: Sgd and RandomShuffle
For both Sgd and RandomShuffle, we use as the step size, which is predetermined before the algorithms are run. The sequences generated by both methods are denoted as ; here is the initial point and is the total number of iterations (i.e., number of stochastic gradients used).
Sgd is defined as follows: for each iteration , it picks an index independently uniformly from the index set , and then performs the update
In contrast, RandomShuffle runs as follows: for each epoch , it picks one permutation independently uniformly from the set of all permutations of . Then, it sequentially visits each of the component functions of the finite-sum (1.1) and performs the update
for . Here represents the -th iterate within the -th epoch. For two consecutive epochs and , one has ; for the initial point one has . For convenience of analysis, we always assume RandomShuffle is run for an integer number of epochs, i.e., for some . This is a reasonable assumption given our main interest is when several epochs of RandomShuffle are run.
3 Convergence analysis of RandomShuffle
The goal of this section is to build theoretical analysis for RandomShuffle. Specifically, we answer the following question: when can we show RandomShuffle to be better than Sgd? We begin by first analyzing quadratic functions in Section 3.1, where the analysis benefits from having a constant Hessian. Subsequently, in Section 3.2, we extend our analysis to the general (smooth) strongly convex setting. A key idea in our analysis is to make the convergence rate bounds sensitive to , the number of components in the finite-sum (1.1). In Section 3.3, we discuss and justify the necessity of introducing into our convergence bound.
3.1 RandomShuffle for quadratics
We first consider the quadratic instance of (1.1), where
where is positive semi-definite, and . We should notice often in analyzing strongly convex problems, the quadratic case presents a good example when tight bounds are achieved.
Quadratic functions have a constant Hessian function , which eases our analysis. Similar to the usual Sgd, our bound also depends on the following constants: (i) strong convexity parameter , and component-wise Lipschitz constant ; (ii) diameter bound (i.e., any iterate remains bounded; can be enforced by explicit projection if needed); and (iii) bounded gradients for each (), and any satisfying (ii). We omit these constants for clarity, but discuss the condition number further in Section 6.
Our main result for RandomShuffle is the following (omitting logarithmic terms):
Let be defined by (3.1). The sample complexity for RandomShuffle to achieve is no more than .
We observe that in the regime when gets large, our result matches . But it provides more information when the number of epochs is not so large that the can be neglected. This setting is clearly the most compelling to study. Formally, we recover the main result of  as the following:
As , RandomShuffle achieves asymptotic convergence rate when run with the proper step size schedule.
3.2 RandomShuffle for strongly convex problems
Next, we consider the more general case where each component function is convex and the sum is strongly convex. Surprisingly111Intuitively, the change of Hessian over the domain can raise challenges. However, our convergence rate here is quite similar to quadratic case, with only mild dependence on Hessian Lipschitz constant. , one can easily adapt the methodology of the proof for Theorem 1 in this setting. To this end, our analysis requires one further assumption that each component function is second order differentiable and its Hessian satisfies the Lipschitz condition (2.2) with constant .
Under these assumptions, we obtain the following result:
Define constant . So long as , with step size , RandomShuffle achieves convergence rate:
3.3 Understanding the dependence on
Since the motivation of building our convergence rate analysis is to show that RandomShuffle behaves better than Sgd, we would definitely hope that our convergence bounds have a better dependence on compared to the bound for Sgd. In an ideal situation, one may hope for a rate of the form with some . One intuitive criticism toward this goal is evident: if we allow , then by setting , RandomShuffle is essentially same as Sgd by the birthday paradox. Therefore, a bound is unlikely to hold.
However, this argument is not rigorous when we require a positive number of epochs to be run (at least one round through all the data). To this end, we provide the following result indicating the impossibility of obtaining even when is required.
Given the information of . Under the assumption of constant step sizes, no step size choice for RandomShuffle leads to a convergence rate for any , if we do not allow to appear in the bound.
Here denotes the transpose of a vector, is some positive definite matrix, and is some vector. Running RandomShuffle on (3.2) leads to a close-formed expression of RandomShuffle’s error. Then by setting (i.e., only running RandomShuffle for one epoch) and assuming a convergence rate of , we deduce a contradiction by properly setting and . The detailed proof can be found in Appendix C. We directly have the following corollary:
Given the information of , under the assumption and constant step size, there is no step size choice that leads to a convergence rate for .
This result indicates that in order to achieve a better dependence on using constant step sizes, the bound should either: (i) depend on ; (ii) make some stronger assumptions on being large enough (at least exclude ); or (iii) leverage a more versatile step size schedule, which could potentially be hard to design and analyze.
Although Theorem 3 shows that one may not hope (under constant step sizes) for a better dependence on for RandomShuffle without an extra dependence, whether the current dependence on we have obtained is optimal still requires further discussion. In the special case , numerical evidence has shown that RandomShuffle behaves at least as well as Sgd. However, our bound fails to even show RandomShuffle converges in this setting. Therefore, it is reasonable to conjecture that a better dependence on exists. In the following section, we improve the dependence on under a specific setting. But whether a better dependence on can be achieved in general remains open.222Convergence rate with dependence on also appears in some variance reduction methods (see for instance, [15, 7]). Sample complexity lower bounds has also be shown to depend on under similar settings, see e.g., .
4 Sparse functions
In the literature on large-scale machine learning, sparsity is a common feature of data. When the data are sparse, each training data point has only a few non-zero features. Under such a setting, each iteration of Sgd
only modifies a few dimensions of the decision variables. Some commonly occurring sparse problems include large-scale logistic regression, matrix completion, and graph cuts.
Sparse data provides a prospective setting under which RandomShuffle might be powerful. Intuitively, when data are sparse, with-replacement sampling used by Sgd is likely to miss some decision variables, while RandomShuffle is guaranteed to update all possible decision variables in one epoch. In this section, we show some theoretical results justifying such intuition.
Formally, a sparse finite-sum problem assumes the form
where () denotes a small subset of and denotes the entries of the vector indexed by . Define the set . By representing each subset with a node, and considering edges for all , we get a graph with nodes. Following the notation in , we consider the sparsity factor of the graph:
One obvious fact is . The statistic (4.1) indicates how likely is it that two subsets of indices intersect, which reflects the sparsity of the problem. For a problem with strong sparsity, we may anticipate a relatively small value for . We summarize our result with the following theorem:
Define constant . So long as , with step size RandomShuffle achieves convergence rate:
Compared with Theorem 2, the bound in Theorem 4 depends on the parameter , so we can exploit sparsity to obtain a faster convergence rate. The key to proving Theorem 4 lies in constructing a tighter bound for the error term in the main recursion (see §5) by including a discount due to sparsity.
We end this section by noting the following simple corollary:
When , there is some constant only dependent on , , , , , such that as long as , for a proper step size, RandomShuffle achieves convergence rate
5 Proof sketch of Theorem 1
In this section we provide a proof sketch for Theorem 1. The central idea is to establish an inequality
where and are the beginning and final points of the -th epoch, respectively, and the randomness is over the permutation of functions in epoch . The constant captures the speed of convergence for the linear convergence part, while and together bound the error introduced by randomness. The underlying motivation for the bound (5.1) is: when the latter two terms depend on the step size with order at least , then by expanding the recursion over all the epochs, and setting , we can obtain a convergence of .
By the definition of the RandomShuffle update and simple calculations, we have the following key equality for one epoch of RandomShuffle:
The idea behind this equality is to split the progress made by RandomShuffle in a given epoch into two parts: a part that behaves like full gradient descent ( and ), and a part that captures the effects of random sampling ( and ). In particular, for a permutation , denotes the gradient error of RandomShuffle for epoch , i.e.,
which is a random variable dependent on. Thus, the terms and are also random variables that depend on , and require taking expectations. The main body of our analysis involves bounding each of these terms separately.
The term can be easily bounded by exploiting the strong convexity of , using a standard inequality (Theorem 2.1.11 in ), as follows
A key step toward building (5.1) is to bound , where the expectation is over . However, it is not easy to directly bound this term with for some constant . Instead, we decompose this term further into three parts: (i) the first part depends on (which will be then captured by in (5.1)); (ii) the second part depends on (which will be then dominated by gradient norm term in ’s bound (5.2)); and (iii) the third part has an at least dependence on (which will be then jointly captured by and in (5.1)). Specifically, by introducing second-order information and somewhat involved analysis, we obtain the following bound for :
Over the randomness of the permutation, we have the inequality:
Where with uniformly drawn from .
Since is the minimizer, we have an elegant bound on the second-order interaction term:
Define with uniformly drawn from , and is the minimizer of sum function, then
We tackle by dominating it with the gradient norm term of ’s bound (5.2), and finally bound the second permutation dependent term using the following lemma.
For any possible permutation in the -th epoch, we have bound
Using this bound, the term can be captured by in (5.1).
Based on the above results, we get a recursive inequality of the form (5.1). Expanding the recursion and substituting into it the step-size choice ultimately leads to an bound of the form (see (A.17) in the Appendix for dependence on hidden constants). The detailed technical steps can be found in Appendix A.
6 Discussion of results
We discuss below our results in more detail, including their implications, strengths, and limitations.
Comparison with Sgd.
It is well-known that under strong convexity Sgd converges with a rate of . A direct comparison indicates the following fact: RandomShuffle is provably better than Sgd after epochs. This is an acceptable amount of epochs for even some of the largest data sets in current machine learning literature. To our knowledge, this is the first result rigorously showing that RandomShuffle behaves better than Sgd within a reasonable number of epochs. To some extent, this result confirms the belief and observation that RandomShuffle is the “correct” choice in real life, at least when the number of epochs is comparable with .
When the algorithm is run in a deterministic fashion, i.e., the functions are visited in a fixed order, better convergence rate than Sgd can also be achieved as becomes large. For instance, a result in  translates into a bound for the deterministic case. This directly implies the same bound for RandomShuffle, since random permutation always has the weaker worst case. But according to this bound, at least epochs are required for RandomShuffle to achieve an error smaller than Sgd, which is not a realistic number of epochs in most applications.
Comparison with Gd.
Another interesting viewpoint is by comparing RandomShuffle with Gradient Descent (Gd). One of the limitations of our result is that we do not show a regime where RandomShuffle can be better than Gd. By computing the average for each epoch and running exact Gd on (1.1), one can get a convergence rate of the form . This fact shows that our convergence rate for RandomShuffle is worse than Gd. This comes naturally from the epoch based recursion (5.1) in our proof methodology, since for one epoch the sum of the gradients is only shown to be no worse than a full gradient. It is true that Gd should behave better in long-term as the dependence on is negligible, and comparing with Gd is not the major goal for this paper. However, being worse than Gd even when is relatively small indicates that the dependence on probably can still be improved. It may be worth investigating whether RandomShuffle can be better than both Sgd and Gd in some regime. However, different techniques may be required.
It is also a limitation that our bound only holds after a certain number of epochs. Moreover, this number of epochs is dependent on (e.g., epochs for the quadratic case). This limits the interest of our result to cases when the problem is not too ill-conditioned. Otherwise, such a number of epochs will be unrealistic by itself. We are currently not certain whether similar bounds can be proved when allowing to assume smaller values, or even after only one epoch.
Dependence on .
It should be noticed that can be large sometimes. Therefore, it may be informative to view our result in a -dependent form. In particular, we still assume , , are constant, but no longer . We use the bound and assume is constant. Since , we now have . Our results translate into -dependent convergence rates of (see inequalities (A.17) (E.13) in the Appendix). The corresponding -dependent sample complexity turns into for quadratic problems, and for strongly convex ones.
At first sight, the dependence on in the convergence rate may seem relatively high. However, it is important to notice that our sample complexity’s dependence on is actually better than what is known for Sgd. A convergence bound for Sgd has long been known , which translates into a , -dependent sample complexity in our notation. Although better dependence has been shown for (see e.g., ), no better dependence has been shown for as far as we know. Furthermore, according to , the lower bound to achieve for strongly convex using stochastic gradients is . Translating this into the sample complexity to achieve is likely to introduce another into the bound. Therefore, it is reasonable to believe that is the best sample complexity one can get for Sgd (which is worse than RandomShuffle), to achieve .
Sparse data setting.
Notably, in the sparse setting (with sparsity factor ), the proven convergence rate is strictly better than the rate of Sgd. This result follows the following intuition: when each dimension is only touched by several functions, letting the algorithm to visit every function would avoid missing certain dimensions. For larger , similar speedup can be observed. In fact, so long as we have , the proven bound is better off than Sgd. Such a result confirms the usage of RandomShuffle under sparse setting.
In this section, we provide some further extensions before concluding with some open problems.
7.1 RandomShuffle for nonconvex optimization
The first extension that we discuss is to nonconvex finite sum problems. In particular, we study RandomShuffle applied to functions satisfying the Polyak-Łojasiewicz condition (also known as gradient dominated functions):
Here is some real number, is the minimal function value of . Strongly convexity is a special situation of this condition with being the strongly convex parameter. One important implication of this condition is that every stationary point is a global minimum. However function can be non-convex under such setting. Also, it doesn’t imply a unique minimum of the function.
This setting was proposed and analyzed in , where a linear convergence rate for Gd was shown. Later, many other optimization methods have been proven efficient under this condition (see  for second order methods and  for variance reduced gradient methods). Notably, Sgd can be proven to converge with rate under this setting (see appendix for a proof).
Assume each component function being Lipschitz continuous, and the average function satisfying the Polyak-Łojasiewicz condition with some constant . We have the following extension of our previous result:
Under the Polyak-Łojasiewicz condition, define condition number . So long as , with step size , RandomShuffle achieves convergence rate:
7.2 RandomShuffle for convex problems
An important extension of RandomShuffle is to the general (smooth) convex case without assuming strong convexity. There are no previous results on the convergence rate of RandomShuffle in this setting that show it to be faster than Sgd. The only result we are aware of is by Shamir , who shows RandomShuffle to be not worse than Sgd in the general (smooth) convex setting. We extend our results to the general convex case, and show a convergence rate that is possibly faster than Sgd, albeit only up to constant terms.
We take the viewpoint of gradients with errors, and denote the difference between component gradient and full gradient as the error:
Different assumptions bounding the error term have been studied in optimization literature. We assume that there is a constant that bound the norm of the gradient error:
Here is any index and is any point in domain. Obviously, , with being the gradient norm bound as before.333Another common assumption is when the variance of the gradient (i.e., ) is bounded. We made the more rigorous assumption here for ease of a simpler analysis. However, there is at most an extra term difference between these two assumptions due to the finite sum structure.
Assume with uniformly drawn from , is an arbitrary minimizer of . Set stepsize
Assume being the average of epoch ending points of RandomShuffle. Then there is
We have some discussion of this result:
Firstly, it is interesting to see what happens asymptotically. We can observe three levels of possible asymptotic (ignore ) convergence rates for RandomShuffle from this theorem: (1) In the most general situation, it converges as ; (2) when the functions are quadratic (i.e., ) and locally the variance vanishes (i.e., ), it converges as ; (3) when the functions are quadratic (i.e., ) and globally the variance vanishes (i.e., ), it converges as .
Secondly, we should notice that there is a known convergence rate of for Sgd. Also, we can further bound with . Therefore, when is relatively small and quadratic functions (i.e., ), our bound translates into form of , with constant in front of possibly smaller than Sgd by constant in certain parameter space.
One obvious limitation of this result is: when globally there is no variance of gradients, it fails to recover the rate of Gd. This indicates the possibility of tighter bounds using more involved analysis. We leave this possibility (either improving upon the dependence on under existence of noise, or recovering when there is no noise) as an open question.
7.3 Vanishing variance
Our previous results show that RandomShuffle converges faster than Sgd after a certain number of epochs. However, one may want to see whether it is possible to show faster convergence of RandomShuffle after only one epoch, or even within one epoch. In this section, we study a specialized class of strongly convex problems where RandomShuffle has faster convergence rate than Sgd after an arbitrary number of iterations.
We build our example based on a vanishing variance setting: for the optimal point . Moulines and Bach  show that when is strongly convex, Sgd converges linearly in this setting. For the construction of our example, we assume a slightly stronger situation: each component function is strongly convex.
Given pairs of positive numbers such that , a dimension and a point , we define a valid problem as a dimensional finite sum function where each component is strongly convex and has Lipschitz continuous gradient, with some minimizing all functions at the same time (which is equivalent to vanishing gradient). Let be the set of all such problems, called valid problems below. For a problem , let random variable be the result of running RandomShuffle from initial point for iterations with step size on problem . Similarly, let be the result of running Sgd from initial point for iterations with step size on problem .
We have the following result on the worst-case convergence rate of RandomShuffle and Sgd:
Given pairs of positive numbers such that , a dimension , a point and an initial set . Let be the set of valid problems. For step size and any , there is
This theorem indicates that RandomShuffle has a better worst-case convergence rate than Sgd after an arbitrary number of iterations under this noted setting.
8 Conclusion and open problems
A long-standing problem in the theory of stochastic gradient descent (Sgd) is to prove that RandomShuffle converges faster than the usual with-replacement Sgd. In this paper, we provide the first non-asymptotic convergence rate analysis for RandomShuffle. We show in particular that after epochs, RandomShuffle behaves strictly better than Sgd under strong convexity and second-order differentiability. The underlying introduction of dependence on into the bound plays an important role toward a better dependence on . We further improve the dependence on for sparse data settings, showing RandomShuffle’s advantage in such situations.
An important open problem remains: how (and to what extent) can we improve the bound such that RandomShuffle can be shown to be better than Sgd for smaller . A possible direction is to improve the dependence arising in our bounds, though different analysis techniques may be required. It is worth noting that for some special settings, this improvement can be achieved. (For example in the setting of Theorem 7, RandomShuffle is shown better than Sgd for any number of iterations.) However, showing RandomShuffle converges better in general, remains open.
- Arjevani and Shamir  Y. Arjevani and O. Shamir. Dimension-free iteration complexity of finite sum optimization problems. In Advances in Neural Information Processing Systems, pages 3540–3548, 2016.
- Bertsekas  D. P. Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010(1-38):3, 2011.
Curiously fast convergence of some stochastic gradient descent
Proceedings of the symposium on learning and data science, Paris, 2009.
- Bottou  L. Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421–436. Springer, 2012.
- Bottou et al.  L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. arXiv:1606.04838, 2016.
- De and Goldstein  S. De and T. Goldstein. Efficient distributed sgd with variance reduction. In Data Mining (ICDM), 2016 IEEE 16th International Conference on, pages 111–120. IEEE, 2016.
- Defazio et al. [2014a] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014a.
- Defazio et al. [2014b] A. Defazio, J. Domke, et al. Finito: A faster, permutable incremental gradient method for big data problems. In International Conference on Machine Learning, pages 1125–1133, 2014b.
- Feng et al.  X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture for in-rdbms analytics. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 325–336. ACM, 2012.
- Gürbüzbalaban et al. [2015a] M. Gürbüzbalaban, A. Ozdaglar, and P. Parrilo. Convergence rate of incremental gradient and newton methods. arXiv preprint arXiv:1510.08562, 2015a.
- Gürbüzbalaban et al. [2015b] M. Gürbüzbalaban, A. Ozdaglar, and P. Parrilo. Why random reshuffling beats stochastic gradient descent. arXiv preprint arXiv:1510.08560, 2015b.
- Gürbüzbalaban et al.  M. Gürbüzbalaban, A. E. Ozdaglar, P. A. Parrilo, and N. D. Vanli. When cyclic coordinate descent outperforms randomized coordinate descent. In NIPS, 2017.
- Hazan and Kale  E. Hazan and S. Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014.
Israel et al. 
A. Israel, F. Krahmer, and R. Ward.
An arithmetic–geometric mean inequality for products of three matrices.Linear Algebra and its Applications, 488:1–12, 2016.
- Johnson and Zhang  R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
- Kohonen  T. Kohonen. An adaptive associative memory principle. IEEE Transactions on Computers, 100(4):444–445, 1974.
- Lee and Wright  C.-P. Lee and S. J. Wright. Random permutations fix a worst case for cyclic coordinate descent. arXiv preprint arXiv:1607.08320, 2016.
- Lee et al.  J. D. Lee, Q. Lin, T. Ma, and T. Yang. Distributed stochastic variance reduced gradient methods and a lower bound for communication complexity. arXiv preprint arXiv:1507.07595, 2015.
- Moulines and Bach  E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pages 451–459, 2011.
- Nedić and Bertsekas  A. Nedić and D. Bertsekas. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications, pages 223–264. Springer, 2001.
- Nemirovski et al.  A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
- Nemirovskii et al.  A. Nemirovskii, D. B. Yudin, and E. R. Dawson. Problem complexity and method efficiency in optimization. Wiley, 1983.
- Nesterov  Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
- Nesterov and Polyak  Y. Nesterov and B. T. Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
- Polyak  B. T. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4):864–878, 1963.
- Rakhlin et al.  A. Rakhlin, O. Shamir, K. Sridharan, et al. Making gradient descent optimal for strongly convex stochastic optimization. In ICML. Citeseer, 2012.
- Recht and Ré  B. Recht and C. Ré. Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences. arXiv preprint arXiv:1202.4184, 2012.
- Recht et al.  B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011.
- Reddi et al.  S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314–323, 2016.
- Shalev-Shwartz and Ben-David  S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- Shalev-Shwartz and Zhang  S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567–599, 2013.
- Shamir  O. Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems, pages 46–54, 2016.
- Solodov  M. V. Solodov. Incremental gradient algorithms with stepsizes bounded away from zero. Computational Optimization and Applications, 11(1):23–35, 1998.
- Sra et al.  S. Sra, S. Nowozin, and S. J. Wright. Optimization for machine learning. Mit Press, 2012.
- Sun and Ye  R. Sun and Y. Ye. Worst-case complexity of cyclic coordinate descent: gap with randomized version. arXiv preprint arXiv:1604.07130, 2016.
- Wright and Lee  S. J. Wright and C.-P. Lee. Analyzing random permutations for cyclic coordinate descent. arXiv preprint arXiv:1706.00908, 2017.
- Ying et al.  B. Ying, K. Yuan, S. Vlaski, and A. H. Sayed. Stochastic learning under random reshuffling. arXiv preprint arXiv:1803.07964, 2018.
- Zhang  T. Zhang. A note on the non-commutative arithmetic-geometric mean inequality. arXiv:1411.5058, 2014.
Appendix A Proof of Theorem 1
Assume where is positive integer. Notate as the th iteration for th epoch. There is , , . Assume the permutation used in th epoch is . Define error term
For one epoch of RandomShuffle, We have the following inequality
where the inequality is due to Theorem 2.1.11 in .
Take the expectation of (A.1) over randomness of permutation , we have
What remains to be done is to bound the two terms with dependence. Firstly, we give a bound on the norm of :
where the first and second inequality is by triangle inequality of vector norm, the third inequality is by definition of , the fourth inequality is by definition of . By this result, we have
For the term, we need more careful bound. Since the Hessian is constant for quadratic functions, we use to denote the Hessian matrix of function . We begin with the following decomposition:
Here we define random variables