# Random Shuffling Beats SGD after Finite Epochs

A long-standing problem in the theory of stochastic gradient descent (SGD) is to prove that its without-replacement version RandomShuffle converges faster than the usual with-replacement version. We present the first (to our knowledge) non-asymptotic solution to this problem, which shows that after a "reasonable" number of epochs RandomShuffle indeed converges faster than SGD. Specifically, we prove that under strong convexity and second-order smoothness, the sequence generated by RandomShuffle converges to the optimal solution at the rate O(1/T^2 + n^3/T^3), where n is the number of components in the objective, and T is the total number of iterations. This result shows that after a reasonable number of epochs RandomShuffle is strictly better than SGD (which converges as O(1/T)). The key step toward showing this better dependence on T is the introduction of n into the bound; and as our analysis will show, in general a dependence on n is unavoidable without further changes to the algorithm. We show that for sparse data RandomShuffle has the rate O(1/T^2), again strictly better than SGD. Furthermore, we discuss extensions to nonconvex gradient dominated functions, as well as non-strongly convex settings.

## Authors

• 8 publications
• 72 publications
09/10/2019

### Better Communication Complexity for Local SGD

We revisit the local Stochastic Gradient Descent (local SGD) method and ...
10/21/2021

We design step-size schemes that make stochastic gradient descent (SGD) ...
02/24/2020

### Closing the convergence gap of SGD without replacement

Stochastic gradient descent without replacement sampling is widely used ...
03/04/2019

### SGD without Replacement: Sharper Rates for General Smooth Convex Functions

We study stochastic gradient descent without replacement () for smooth ...
03/12/2021

### Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?

We propose matrix norm inequalities that extend the Recht-Ré (2012) conj...
02/19/2021

### Permutation-Based SGD: Is Random Optimal?

A recent line of ground-breaking results for permutation-based SGD has c...
07/07/2020

### Understanding the Impact of Model Incoherence on Convergence of Incremental SGD with Random Reshuffle

Although SGD with random reshuffle has been widely-used in machine learn...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider stochastic optimization methods for the finite-sum problem

 F(x):=1nn∑i=1fi(x), (1.1)

where each function is smooth and convex, and the sum is strongly convex. A classical approach to solving (1.1) is stochastic gradient descent (Sgd). At each iteration Sgd independently samples an index uniformly from , and uses the (stochastic) gradient to compute its update. The stochasticity makes each iteration of Sgd cheap, and the uniformly independent sampling of makes

an unbiased estimator of the full gradient

. These properties are central to Sgd

’s effectiveness in large scale machine learning, and underlie much of its theoretical analysis (see for instance,

[34, 26, 2, 5, 30]).

However, what is actually used in practice is the without replacement version of Sgd, henceforth called RandomShuffle. Specifically, at each epoch RandomShuffle samples a random permutation of the functions uniformly independently (some implementations shuffle the data only once at load, rather than at each epoch). Then, it iterates over these functions according to the sampled permutation and updates in a manner similar to Sgd. Avoiding the use of random sampling at each iteration, RandomShuffle can be computationally more practical [4]; furthermore, as one would expect, empirically RandomShuffle is known to converge faster than Sgd [3].

This discrepancy between theory and practice has been a long-standing problem in the theory of Sgd. It has drawn renewed attention recently, with the goal of better understanding convergence of RandomShuffle. The key difficulty is that without-replacement leads to statistically non-independent samples, which greatly complicates analysis. Two extreme case positive results are however available: Shamir [32] shows that RandomShuffle is not much worse than usual Sgd, provided the number of epochs is not too large; while Gürbüzbalaban et al. [11] show that RandomShuffle converges faster than Sgd asymptotically at the rate .

But it remains unclear what happens in between, after a reasonable finite number of epochs are run. This regime is the most compelling one to study, since in practice one runs neither one nor infinitely many epochs. This motivates the central question of our paper:

Does RandomShuffle converge faster than Sgd after a reasonable number of epochs?

We answer this question positively in this paper; our results are more precisely summarized below.

### 1.1 Summary of results

We follow the common practice of reporting convergence rates depending on , the number of calls to the (stochastic / incremental) gradient oracle. For instance, Sgd converges at the rate for solving (1.1), ignoring logarithmic terms in the bound [26]. The underlying argument is to view Sgd as stochastic approximation with noise [21], therefore ignoring the finite-sum structure of (1.1). Our key observation for RandomShuffle is that one should reasonably include dependence on into the bound (see Section 3.3). Such a compromise leads to a better dependence on , which further shows how RandomShuffle beats Sgd after a finite number of epochs. Our main contributions are the following:

• Under a mild assumption on second order differentiability, and assuming strong-convexity, we establish a convergence rate of for RandomShuffle, where is the number of components in (1.1), and is the total number of iterations (Theorem 1 and 2). From the bounds we can calculate the precise number of epochs after which RandomShuffle is strictly better than Sgd.

• We prove that a dependence on is necessary for beating the Sgd rate . This tradeoff precludes the possibility of proving a convergence rate of the type with some in the general case, and justifies our choice of introducing into the rate (Theorem 3).

• Assuming a sparse data setting common in machine learning, we further improve the convergence rate of RandomShuffle to . This rate is strictly better than Sgd, indicating RandomShuffle’s advantage in such cases (Theorem 4).

• We extend our results to the non-convex function class with Polyak-Łojasiewicz condition, establishing a similar rate for RandomShuffle (Theorem 5).

• We show a class of examples where RandomShuffle is provably faster than Sgd after arbitrary number (even less than one epoch) of iterations (Theorem 7).

We provide a detailed discussion of various aspects of our results in Section 6, including explicit comparisons to Sgd, the role of condition numbers, as well as some limitations. Finally, we end by noting some extensions and open problems in Section 7. As one of the extensions, for non-strongly convex problems, we prove that RandomShuffle achieves a comparable convergence rate as Sgd, with possibly smaller constant in the bound under certain parameter paradigms (Theorem 6).

### 1.2 Related work

Recht and Ré [27] conjecture a tantalizing matrix AM-GM inequality that underlies RandomShuffle’s superiority over Sgd. While limited progress on this conjecture has been reported [14, 38], the correctness of the full conjecture is still wide open. With the technique of transductive Rademacher complexity, Shamir [32] shows that Sgd is not worse than RandomShuffle

provided the number of iterations is not too large. Asymptotic analysis is provided in

[11], which proves that RandomShuffle limits to a rate for large . Ying et al. [37] show that for a fixed step size, RandomShuffle converges to a distribution closer to optimal than Sgd asymptotically.

When the functions are visited in a deterministic order (e.g., cyclic), the method turns into Incremental Gradient Descent (Igd), which has a long history [2]. Kohonen [16] shows that Igd converges to a limit cycle under constant step size and quadratic functions. Convergence to neighborhood of optimality for more general functions is studied in several works, under the assumption that step size is bounded away from zero (see for instance [33]). With properly diminishing step size, Nedić and Bertsekas [20] show that an convergence rate in terms of distance to optimal can be achieved under strong convexity of the finite-sum. This rate is further improved in [10] to under a second order differentiability assumption.

In the real world, RandomShuffle

has been proposed as a standard heuristic

[4]. With numerical experiments, Bottou [3] notices an approximately convergence rate of RandomShuffle. Without-replacement sampling also improves data-access efficiency in distributed settings, see for instance [9, 18]. The permutation-sampling idea has been further embedded into more complicated algorithms; see [6, 8, 32]

for variance-reduced methods, and

[31] for decomposition methods.

Finally, we note a related body of work on coordinate descent, where a similar problem has been studied: when does random permutation over coordinates behave well? Gürbüzbalaban et al. [12] give two kinds of quadratic problems when cyclic version of coordinate descent beats the with replacement one, which is a stronger result indicating that random permutation also beats the with replacement method. However, such a deterministic version of the algorithm suffers from poor worst case. Indeed, in [35] a setting is analyzed where cyclic coordinate descent can be dramatically worse than both with-replacement and random permutation versions of coordinate descent. Lee and Wright [17] further study this setting, and analyze how the random permutation version of coordinate descent avoids the slow convergence of cyclic version. In [36], Wright et el. propose a more general class of quadratic functions where random permutation outperforms cyclic coordinate descent.

## 2 Background and problem setup

For problem (1.1), we assume the finite sum function is strongly convex, i.e.,

where , and is the strong convexity parameter. Furthermore, we assume each component function is -smooth, so that for , there exists a constant such that

 ∥∇fi(x)−∇fi(y)∥≤L∥x−y∥. (2.1)

Furthermore, we assume that the component functions are second order differentiable with a Lipschitz continuous Hessian. We use to denote the Hessian of function at . Specifically, for each , we assume that for all , there exists a constant such that

 ∥Hi(x)−Hi(y)∥≤LH∥x−y∥. (2.2)

The norm is the spectral norm for matrices and

norm for vectors. We denote the unique minimizer of

as , the index set as . The complexity bound is represented as , with all logarithmic terms hidden. All other parameters that might be hidden in the complexity bounds will be clarified in corresponding sections.

### 2.1 The algorithms under study: Sgd and RandomShuffle

For both Sgd and RandomShuffle, we use as the step size, which is predetermined before the algorithms are run. The sequences generated by both methods are denoted as ; here is the initial point and is the total number of iterations (i.e., number of stochastic gradients used).

Sgd is defined as follows: for each iteration , it picks an index independently uniformly from the index set , and then performs the update

 xk=xk−1−γ∇fs(k)(xk−1). (Sgd)

In contrast, RandomShuffle runs as follows: for each epoch , it picks one permutation independently uniformly from the set of all permutations of . Then, it sequentially visits each of the component functions of the finite-sum (1.1) and performs the update

 xtk=xtk−1−γ∇fσt(k)(xtk−1), (RandomShuffle)

for . Here represents the -th iterate within the -th epoch. For two consecutive epochs and , one has ; for the initial point one has . For convenience of analysis, we always assume RandomShuffle is run for an integer number of epochs, i.e., for some . This is a reasonable assumption given our main interest is when several epochs of RandomShuffle are run.

## 3 Convergence analysis of RandomShuffle

The goal of this section is to build theoretical analysis for RandomShuffle. Specifically, we answer the following question: when can we show RandomShuffle to be better than Sgd? We begin by first analyzing quadratic functions in Section 3.1, where the analysis benefits from having a constant Hessian. Subsequently, in Section 3.2, we extend our analysis to the general (smooth) strongly convex setting. A key idea in our analysis is to make the convergence rate bounds sensitive to , the number of components in the finite-sum (1.1). In Section 3.3, we discuss and justify the necessity of introducing into our convergence bound.

We first consider the quadratic instance of (1.1), where

 fi(x)=12xTAix+bTix,i=1,…,n, (3.1)

where is positive semi-definite, and . We should notice often in analyzing strongly convex problems, the quadratic case presents a good example when tight bounds are achieved.

Quadratic functions have a constant Hessian function , which eases our analysis. Similar to the usual Sgd, our bound also depends on the following constants: (i) strong convexity parameter , and component-wise Lipschitz constant ; (ii) diameter bound (i.e., any iterate remains bounded; can be enforced by explicit projection if needed); and (iii) bounded gradients for each (), and any satisfying (ii). We omit these constants for clarity, but discuss the condition number further in Section 6.

Our main result for RandomShuffle is the following (omitting logarithmic terms):

###### Theorem 1.

With defined by (3.1), let the condition number of problem (1.1) be . So long as , with step size , RandomShuffle achieves convergence rate:

 E[∥xT−x∗∥2]≤O(1T2+n3T3).

We provide a proof sketch in Section 5, deferring the fairly involved technical details to Appendix A. In terms of sample complexity, Theorem 1 yields the following corollary:

###### Corollary 1.

Let be defined by (3.1). The sample complexity for RandomShuffle to achieve is no more than .

We observe that in the regime when gets large, our result matches [11]. But it provides more information when the number of epochs is not so large that the can be neglected. This setting is clearly the most compelling to study. Formally, we recover the main result of [11] as the following:

###### Corollary 2.

As , RandomShuffle achieves asymptotic convergence rate when run with the proper step size schedule.

### 3.2 RandomShuffle for strongly convex problems

Next, we consider the more general case where each component function is convex and the sum is strongly convex. Surprisingly111Intuitively, the change of Hessian over the domain can raise challenges. However, our convergence rate here is quite similar to quadratic case, with only mild dependence on Hessian Lipschitz constant. , one can easily adapt the methodology of the proof for Theorem 1 in this setting. To this end, our analysis requires one further assumption that each component function is second order differentiable and its Hessian satisfies the Lipschitz condition (2.2) with constant .

Under these assumptions, we obtain the following result:

###### Theorem 2.

Define constant . So long as , with step size , RandomShuffle achieves convergence rate:

 E[∥xT−x∗∥2]≤O(1T2+n3T3).

Except for extra dependence on and a mildly different step size, this rate is essentially the same as that in quadratic case. The proof for the result can be found in Appendix B. Due to the similar formulation, most of the consequences noted in Section 3.1 also hold in this general setting.

### 3.3 Understanding the dependence on n

Since the motivation of building our convergence rate analysis is to show that RandomShuffle behaves better than Sgd, we would definitely hope that our convergence bounds have a better dependence on compared to the bound for Sgd. In an ideal situation, one may hope for a rate of the form with some . One intuitive criticism toward this goal is evident: if we allow , then by setting , RandomShuffle is essentially same as Sgd by the birthday paradox. Therefore, a bound is unlikely to hold.

However, this argument is not rigorous when we require a positive number of epochs to be run (at least one round through all the data). To this end, we provide the following result indicating the impossibility of obtaining even when is required.

###### Theorem 3.

Given the information of . Under the assumption of constant step sizes, no step size choice for RandomShuffle leads to a convergence rate for any , if we do not allow to appear in the bound.

The key idea to prove Theorem 3 is by constructing a special instance of problem (1.1). In particular, the following quadratic instance of (1.1) lays the foundation of our proof:

 fi(x)={12(x−b)′A(x−b)i odd,12(x+b)′A(x+b)i even. (3.2)

Here denotes the transpose of a vector, is some positive definite matrix, and is some vector. Running RandomShuffle on (3.2) leads to a close-formed expression of RandomShuffle’s error. Then by setting (i.e., only running RandomShuffle for one epoch) and assuming a convergence rate of , we deduce a contradiction by properly setting and . The detailed proof can be found in Appendix C. We directly have the following corollary:

###### Corollary 3.

Given the information of , under the assumption and constant step size, there is no step size choice that leads to a convergence rate for .

This result indicates that in order to achieve a better dependence on using constant step sizes, the bound should either: (i) depend on ; (ii) make some stronger assumptions on being large enough (at least exclude ); or (iii) leverage a more versatile step size schedule, which could potentially be hard to design and analyze.

Although Theorem 3 shows that one may not hope (under constant step sizes) for a better dependence on for RandomShuffle without an extra dependence, whether the current dependence on we have obtained is optimal still requires further discussion. In the special case , numerical evidence has shown that RandomShuffle behaves at least as well as Sgd. However, our bound fails to even show RandomShuffle converges in this setting. Therefore, it is reasonable to conjecture that a better dependence on exists. In the following section, we improve the dependence on under a specific setting. But whether a better dependence on can be achieved in general remains open.222Convergence rate with dependence on also appears in some variance reduction methods (see for instance,  [15, 7]). Sample complexity lower bounds has also be shown to depend on under similar settings, see e.g., [1].

## 4 Sparse functions

In the literature on large-scale machine learning, sparsity is a common feature of data. When the data are sparse, each training data point has only a few non-zero features. Under such a setting, each iteration of Sgd

only modifies a few dimensions of the decision variables. Some commonly occurring sparse problems include large-scale logistic regression, matrix completion, and graph cuts.

Sparse data provides a prospective setting under which RandomShuffle might be powerful. Intuitively, when data are sparse, with-replacement sampling used by Sgd is likely to miss some decision variables, while RandomShuffle is guaranteed to update all possible decision variables in one epoch. In this section, we show some theoretical results justifying such intuition.

Formally, a sparse finite-sum problem assumes the form

 F(x)=1nn∑i=1fi(xei),

where () denotes a small subset of and denotes the entries of the vector indexed by . Define the set . By representing each subset with a node, and considering edges for all , we get a graph with nodes. Following the notation in [28], we consider the sparsity factor of the graph:

 ρ:=max1≤i≤n∣∣{ej∈E:ei∩ej≠∅}∣∣n. (4.1)

One obvious fact is . The statistic (4.1) indicates how likely is it that two subsets of indices intersect, which reflects the sparsity of the problem. For a problem with strong sparsity, we may anticipate a relatively small value for . We summarize our result with the following theorem:

###### Theorem 4.

Define constant . So long as , with step size RandomShuffle achieves convergence rate:

 E[∥xT−x∗∥2]≤O(1T2+ρ2n3T3).

Compared with Theorem 2, the bound in Theorem 4 depends on the parameter , so we can exploit sparsity to obtain a faster convergence rate. The key to proving Theorem 4 lies in constructing a tighter bound for the error term in the main recursion (see §5) by including a discount due to sparsity.

We end this section by noting the following simple corollary:

###### Corollary 4.

When , there is some constant only dependent on , , , , , such that as long as , for a proper step size, RandomShuffle achieves convergence rate

 E[∥xT−x∗∥2]≤O(1T2).

## 5 Proof sketch of Theorem 1

In this section we provide a proof sketch for Theorem 1. The central idea is to establish an inequality

 (5.1)

where and are the beginning and final points of the -th epoch, respectively, and the randomness is over the permutation of functions in epoch . The constant captures the speed of convergence for the linear convergence part, while and together bound the error introduced by randomness. The underlying motivation for the bound (5.1) is: when the latter two terms depend on the step size with order at least , then by expanding the recursion over all the epochs, and setting , we can obtain a convergence of .

By the definition of the RandomShuffle update and simple calculations, we have the following key equality for one epoch of RandomShuffle:

The idea behind this equality is to split the progress made by RandomShuffle in a given epoch into two parts: a part that behaves like full gradient descent ( and ), and a part that captures the effects of random sampling ( and ). In particular, for a permutation , denotes the gradient error of RandomShuffle for epoch , i.e.,

 Rt=∑ni=1∇fσt(i)(xti−1)−∑ni=1∇fσt(i)(xt0),

which is a random variable dependent on

. Thus, the terms and are also random variables that depend on , and require taking expectations. The main body of our analysis involves bounding each of these terms separately.

The term can be easily bounded by exploiting the strong convexity of , using a standard inequality (Theorem 2.1.11 in [23]), as follows

 (5.2)

The first term (gradient norm term) in (5.2) is used to dominate later emerging terms in our bounds on and , while the second term (distance term) in (5.2) will be absorbed into in (5.1).

A key step toward building (5.1) is to bound , where the expectation is over . However, it is not easy to directly bound this term with for some constant . Instead, we decompose this term further into three parts: (i) the first part depends on (which will be then captured by in (5.1)); (ii) the second part depends on (which will be then dominated by gradient norm term in ’s bound (5.2)); and (iii) the third part has an at least dependence on (which will be then jointly captured by and in (5.1)). Specifically, by introducing second-order information and somewhat involved analysis, we obtain the following bound for :

###### Lemma 1.

Over the randomness of the permutation, we have the inequality:

 −2γ⟨xt0−x∗,E[Rt]⟩ (5.3) +γ3μ−1n2(n−1)∥Δ∥2+2μ−1γ5L4G2n5. (5.4)

Where with uniformly drawn from .

Since is the minimizer, we have an elegant bound on the second-order interaction term:

###### Lemma 2.

Define with uniformly drawn from , and is the minimizer of sum function, then

 ∥Δ∥≤1n−1LG.

We tackle by dominating it with the gradient norm term of ’s bound (5.2), and finally bound the second permutation dependent term using the following lemma.

###### Lemma 3.

For any possible permutation in the -th epoch, we have bound

 ∥∥Rt∥∥≤n(n−1)2γGL.

Using this bound, the term can be captured by in (5.1).

Based on the above results, we get a recursive inequality of the form (5.1). Expanding the recursion and substituting into it the step-size choice ultimately leads to an bound of the form (see (A.17) in the Appendix for dependence on hidden constants). The detailed technical steps can be found in Appendix A.

## 6 Discussion of results

We discuss below our results in more detail, including their implications, strengths, and limitations.

#### Comparison with Sgd.

It is well-known that under strong convexity Sgd converges with a rate of  [26]. A direct comparison indicates the following fact: RandomShuffle is provably better than Sgd after epochs. This is an acceptable amount of epochs for even some of the largest data sets in current machine learning literature. To our knowledge, this is the first result rigorously showing that RandomShuffle behaves better than Sgd within a reasonable number of epochs. To some extent, this result confirms the belief and observation that RandomShuffle is the “correct” choice in real life, at least when the number of epochs is comparable with .

#### Deterministic variant.

When the algorithm is run in a deterministic fashion, i.e., the functions are visited in a fixed order, better convergence rate than Sgd can also be achieved as becomes large. For instance, a result in [10] translates into a bound for the deterministic case. This directly implies the same bound for RandomShuffle, since random permutation always has the weaker worst case. But according to this bound, at least epochs are required for RandomShuffle to achieve an error smaller than Sgd, which is not a realistic number of epochs in most applications.

#### Comparison with Gd.

Another interesting viewpoint is by comparing RandomShuffle with Gradient Descent (Gd). One of the limitations of our result is that we do not show a regime where RandomShuffle can be better than Gd. By computing the average for each epoch and running exact Gd on (1.1), one can get a convergence rate of the form . This fact shows that our convergence rate for RandomShuffle is worse than Gd. This comes naturally from the epoch based recursion (5.1) in our proof methodology, since for one epoch the sum of the gradients is only shown to be no worse than a full gradient. It is true that Gd should behave better in long-term as the dependence on is negligible, and comparing with Gd is not the major goal for this paper. However, being worse than Gd even when is relatively small indicates that the dependence on probably can still be improved. It may be worth investigating whether RandomShuffle can be better than both Sgd and Gd in some regime. However, different techniques may be required.

#### Epochs required.

It is also a limitation that our bound only holds after a certain number of epochs. Moreover, this number of epochs is dependent on (e.g., epochs for the quadratic case). This limits the interest of our result to cases when the problem is not too ill-conditioned. Otherwise, such a number of epochs will be unrealistic by itself. We are currently not certain whether similar bounds can be proved when allowing to assume smaller values, or even after only one epoch.

#### Dependence on κ.

It should be noticed that can be large sometimes. Therefore, it may be informative to view our result in a -dependent form. In particular, we still assume , , are constant, but no longer . We use the bound and assume is constant. Since , we now have . Our results translate into -dependent convergence rates of (see inequalities (A.17) (E.13) in the Appendix). The corresponding -dependent sample complexity turns into for quadratic problems, and for strongly convex ones.

At first sight, the dependence on in the convergence rate may seem relatively high. However, it is important to notice that our sample complexity’s dependence on is actually better than what is known for Sgd. A convergence bound for Sgd has long been known [26], which translates into a , -dependent sample complexity in our notation. Although better dependence has been shown for (see e.g., [13]), no better dependence has been shown for as far as we know. Furthermore, according to [22], the lower bound to achieve for strongly convex using stochastic gradients is . Translating this into the sample complexity to achieve is likely to introduce another into the bound. Therefore, it is reasonable to believe that is the best sample complexity one can get for Sgd (which is worse than RandomShuffle), to achieve .

#### Sparse data setting.

Notably, in the sparse setting (with sparsity factor ), the proven convergence rate is strictly better than the rate of Sgd. This result follows the following intuition: when each dimension is only touched by several functions, letting the algorithm to visit every function would avoid missing certain dimensions. For larger , similar speedup can be observed. In fact, so long as we have , the proven bound is better off than Sgd. Such a result confirms the usage of RandomShuffle under sparse setting.

## 7 Extensions

In this section, we provide some further extensions before concluding with some open problems.

### 7.1 RandomShuffle for nonconvex optimization

The first extension that we discuss is to nonconvex finite sum problems. In particular, we study RandomShuffle applied to functions satisfying the Polyak-Łojasiewicz condition (also known as gradient dominated functions):

 12∥∇F(x)∥2≥μ(F(x)−F∗),    ∀x.

Here is some real number, is the minimal function value of . Strongly convexity is a special situation of this condition with being the strongly convex parameter. One important implication of this condition is that every stationary point is a global minimum. However function can be non-convex under such setting. Also, it doesn’t imply a unique minimum of the function.

This setting was proposed and analyzed in [25], where a linear convergence rate for Gd was shown. Later, many other optimization methods have been proven efficient under this condition (see [24] for second order methods and [29] for variance reduced gradient methods). Notably, Sgd can be proven to converge with rate under this setting (see appendix for a proof).

Assume each component function being Lipschitz continuous, and the average function satisfying the Polyak-Łojasiewicz condition with some constant . We have the following extension of our previous result:

###### Theorem 5.

Under the Polyak-Łojasiewicz condition, define condition number . So long as , with step size , RandomShuffle achieves convergence rate:

 E[∥xT−x∗∥2]≤O(1T2+n3T3).

### 7.2 RandomShuffle for convex problems

An important extension of RandomShuffle is to the general (smooth) convex case without assuming strong convexity. There are no previous results on the convergence rate of RandomShuffle in this setting that show it to be faster than Sgd. The only result we are aware of is by Shamir [32], who shows RandomShuffle to be not worse than Sgd in the general (smooth) convex setting. We extend our results to the general convex case, and show a convergence rate that is possibly faster than Sgd, albeit only up to constant terms.

We take the viewpoint of gradients with errors, and denote the difference between component gradient and full gradient as the error:

 ∇F(x)−∇fi(x)=ei(x).

Different assumptions bounding the error term have been studied in optimization literature. We assume that there is a constant that bound the norm of the gradient error:

 ∥ei(x)∥≤δ,    ∀x.

Here is any index and is any point in domain. Obviously, , with being the gradient norm bound as before.333Another common assumption is when the variance of the gradient (i.e., ) is bounded. We made the more rigorous assumption here for ease of a simpler analysis. However, there is at most an extra term difference between these two assumptions due to the finite sum structure.

###### Theorem 6.

Assume with uniformly drawn from , is an arbitrary minimizer of . Set stepsize

 γ=min⎧⎨⎩116nL,√DTn(∥Δ∥+LHLD2+2LHDG),(DTn2L2δ)13,(1Tn3L4)14⎫⎬⎭.

Assume being the average of epoch ending points of RandomShuffle. Then there is

 F(¯x)−F(x∗)≤2D√nD(∥Δ∥+LHLD2+2LHDG)√T+O⎛⎝(nT)23δ13+(nT)34⎞⎠.

We have some discussion of this result:

Firstly, it is interesting to see what happens asymptotically. We can observe three levels of possible asymptotic (ignore ) convergence rates for RandomShuffle from this theorem: (1) In the most general situation, it converges as ; (2) when the functions are quadratic (i.e., ) and locally the variance vanishes (i.e., ), it converges as ; (3) when the functions are quadratic (i.e., ) and globally the variance vanishes (i.e., ), it converges as .

Secondly, we should notice that there is a known convergence rate of for Sgd. Also, we can further bound with . Therefore, when is relatively small and quadratic functions (i.e., ), our bound translates into form of , with constant in front of possibly smaller than Sgd by constant in certain parameter space.

One obvious limitation of this result is: when globally there is no variance of gradients, it fails to recover the rate of Gd. This indicates the possibility of tighter bounds using more involved analysis. We leave this possibility (either improving upon the dependence on under existence of noise, or recovering when there is no noise) as an open question.

### 7.3 Vanishing variance

Our previous results show that RandomShuffle converges faster than Sgd after a certain number of epochs. However, one may want to see whether it is possible to show faster convergence of RandomShuffle after only one epoch, or even within one epoch. In this section, we study a specialized class of strongly convex problems where RandomShuffle has faster convergence rate than Sgd after an arbitrary number of iterations.

We build our example based on a vanishing variance setting: for the optimal point . Moulines and Bach [19] show that when is strongly convex, Sgd converges linearly in this setting. For the construction of our example, we assume a slightly stronger situation: each component function is strongly convex.

Given pairs of positive numbers such that , a dimension and a point , we define a valid problem as a dimensional finite sum function where each component is strongly convex and has Lipschitz continuous gradient, with some minimizing all functions at the same time (which is equivalent to vanishing gradient). Let be the set of all such problems, called valid problems below. For a problem , let random variable be the result of running RandomShuffle from initial point for iterations with step size on problem . Similarly, let be the result of running Sgd from initial point for iterations with step size on problem .

We have the following result on the worst-case convergence rate of RandomShuffle and Sgd:

###### Theorem 7.

Given pairs of positive numbers such that , a dimension , a point and an initial set . Let be the set of valid problems. For step size and any , there is

 maxP∈P,x0∈DR(x∗)E[∥XRS(T,x0,γ,P)−x∗∥2]≤maxP∈P,x0∈DR(x∗)E[∥XSGD(T,x0,γ,P)−x∗∥2].

This theorem indicates that RandomShuffle has a better worst-case convergence rate than Sgd after an arbitrary number of iterations under this noted setting.

## 8 Conclusion and open problems

A long-standing problem in the theory of stochastic gradient descent (Sgd) is to prove that RandomShuffle converges faster than the usual with-replacement Sgd. In this paper, we provide the first non-asymptotic convergence rate analysis for RandomShuffle. We show in particular that after epochs, RandomShuffle behaves strictly better than Sgd under strong convexity and second-order differentiability. The underlying introduction of dependence on into the bound plays an important role toward a better dependence on . We further improve the dependence on for sparse data settings, showing RandomShuffle’s advantage in such situations.

An important open problem remains: how (and to what extent) can we improve the bound such that RandomShuffle can be shown to be better than Sgd for smaller . A possible direction is to improve the dependence arising in our bounds, though different analysis techniques may be required. It is worth noting that for some special settings, this improvement can be achieved. (For example in the setting of Theorem 7, RandomShuffle is shown better than Sgd for any number of iterations.) However, showing RandomShuffle converges better in general, remains open.

## Appendix A Proof of Theorem 1

###### Proof.

Assume where is positive integer. Notate as the th iteration for th epoch. There is , , . Assume the permutation used in th epoch is . Define error term

For one epoch of RandomShuffle, We have the following inequality

 ∥∥xtn−x∗∥∥2 =∥∥xt0−x∗∥∥2−2γ⟨xt0−x∗,n∑i=1∇fσt(i)(xti−1)⟩+γ2∥∥ ∥∥n∑i=1∇fσt(i)(xti−1)∥∥ ∥∥2 ≤∥∥xt0−x∗∥∥2−2nγ[LμL+μ∥∥xt0−x∗∥∥2+1L+μ∥∥∇F(xt0)∥∥2] =(1−2nγLμL+μ)∥∥xt0−x∗∥∥2−(2nγ1L+μ−2γ2n2)∥∥∇F(xt0)∥∥2 (A.1)

where the inequality is due to Theorem 2.1.11 in [23].

Take the expectation of (A.1) over randomness of permutation , we have

 E[∥∥xtn−x∗∥∥2] ≤(1−2nγLμL+μ)∥∥xt0−x∗∥∥2−(2nγ1L+μ−2n2γ2)∥∥∇F(xt0)∥∥2 −2γ⟨xt0−x∗,E[Rt]⟩+2γ2E[∥∥Rt∥∥2]. (A.2)

What remains to be done is to bound the two terms with dependence. Firstly, we give a bound on the norm of :

 ∥∥Rt∥∥ =∥∥ ∥∥n∑i=1∇fσt(i)(xti−1)−n∑i=1∇fσt(i)(xt0)∥∥ ∥∥ ≤n∑i=1∥∥∇fσt(i)(xti−1)−∇fσt(i)(xt0)∥∥ =n∑i=1∥∥ ∥∥i−1∑j=1(∇fσt(i)(xtj)−∇fσt(i)(xtj−1))∥∥ ∥∥ ≤n∑i=1i−1∑j=1L∥∥xtj−xtj−1∥∥ =n∑i=1i−1∑j=1L∥∥−γ∇fσt(j)(xtj−1)∥∥ ≤n∑i=1i−1∑j=1LγG =n(n−1)2γGL,

where the first and second inequality is by triangle inequality of vector norm, the third inequality is by definition of , the fourth inequality is by definition of . By this result, we have

 (A.3)

For the term, we need more careful bound. Since the Hessian is constant for quadratic functions, we use to denote the Hessian matrix of function . We begin with the following decomposition:

 Rt =n∑i=1[∇fσt(i)(xti−1)−∇fσt(i)(xt0)] =n∑i=1[Hσt(i)(xti−1−xt0)] =n∑i=1{Hσt(i)i−1∑j=1[−γ∇fσt(j)(xtj−1)]} =−γn∑i=1[Hσt(i)i−1∑j=1∇fσt(j)(xt0)]−γn∑i=1{Hσt(i)i−1∑j=1[∇fσt(j)(xtj−1)−∇fσt(j)(xt0)]} =At+Bt. (A.4)

Here we define random variables

 At=−γn∑i=1[Hσt(i)i−1∑j=1∇fσt(j)(xt0)],
 Bt=−γn∑i=1{Hσt(i)i−1∑j=1[∇fσt(j)(xtj−1)−∇fσt(j)(xt0)]}.

There is

 E[At]=−n(n−1)2γEi≠j[Hi∇fj(xt0)], (A.5)
 ∥∥Bt∥∥ ≤γn∑i=1Hσt(i)i−1∑j=1∥∥∇fσt(j)(xtj−1)−∇fσt(j)(xt0)∥∥ ≤γn∑i=1Li−1∑j=1(j−1)γGL =γ2L2Gn∑i=1(i−1)(i−2)2 ≤12γ2L2Gn3. (A.6)

Using (A.4) and  (A.5), we can decompose the inner product of and into:

 −2γ⟨x