1 Introduction
Stochastic gradient descent (SGD) is a popular optimization method for cost functions of the form:
(1.1) 
More specifically, instead of computing the full gradient at each iteration, SGD computes a cheaper approximation of it by computing where index is randomly sampled from . In sampling such an index at each iteration, one can naturally think of the following two options: (i) withreplacement and (ii) withoutreplacement samplings. Interestingly, withoutreplacement SGD (also known as random reshuffle or random shuffle) has been empirically shown to outperform withreplacement SGD [bottou2009curiously, bottou2012stochastic]. However, the traditional analysis of SGD only covers the withreplacement version, raising the question whether one can capture this phenomenon theoretically.
The first contribution is established by Gürbüzbalaban, Ozdaglar, and Parrilo [gurbuzbalaban2015random] for the case where is strongly convex and each is quadratic and smooth. Under this setting, they prove an asymptotic^{1}^{1}1Their proof is based on an asymptotic version of Chung’s Lemma [fabian1967stochastic, Lemma 4.2]. convergence rate of for epochs (the number of entire passes through the component indices ) when is treated as a constant. This is indeed an asymptotic improvement over the convergence rate of achieved by withreplacement SGD. However, in modern applications, cannot be treated as a constant, and it is important to characterize nonasymptotic
behavior of the algorithm. For instance, in machine learning applications,
is typically equal to the number of data in a training set, and the algorithm is typically run a few epochs and terminates. Hence, it is important to investigate the dependence on as well as nonasymptotic convergence rates.Scrutinizing the analysis of [gurbuzbalaban2015random, (58)], one can actually deduce the asymptotic convergence rate of , yet the exact dependence on and making this analysis nonasymptotic^{2}^{2}2Note that the key ingredient for making the analysis in [gurbuzbalaban2015random] nonasymptotic is a nonasymptotic version of Chung’s lemma [chung1954stochastic, Lemma 1], but as pointed out by Fabian [fabian1967stochastic, Discussion above Lemma 4.2] the original proof of the result has some errors. remain open.
Sum of quadratics  

Upper bound in [gurbuzbalaban2015random]  asymptotic result  
Upper bound in [haochen2018random]  requires  
Upper bound in [rajput2020closing]  requires  
Lower bound in [safran2019good]  constant step size  
Our upper bound  all 
Nonquadratic strongly convex  

Upper bound in [nagaraj2019sgd]  requires  
Lower bound in [rajput2020closing]  constant step size  
Our upper bound  all 
Recently, a few subsequent efforts have been made to characterize nonasymptotic convergence rates in terms of both and . Haochen and Sra [haochen2018random] develop the first nonasymptotic convergence of under the condition for the same setting as [gurbuzbalaban2015random] where denotes the condition number. However, compared with the baseline rate of SGD, i.e., , this convergence rate becomes an improvement only after epochs. This result is strengthened in followup works that demonstrate the superiority of withoutreplacement SGD:

The work by Nagaraj, Jain, and Netrapalli [nagaraj2019sgd] considers a more general setting where ’s no longer have to be quadratic, but just convex and smooth. Under this setting, they introduce clever coupling arguments to prove the nonasymptotic convergence rate of under the condition of which is more stringent than that of [haochen2018random]. Note that this rate is better than the baseline of as soon as the technical condition is fulfilled. This upper bound is shown to be asymptotically tight up to polylogarithmic factors by Rajput, Gupta, and Papailiopoulos [rajput2020closing]. See Table 1 for comparisons.

The convergence rate with a better dependency on for the strongly convex setting given by [nagaraj2019sgd] (at the cost of a severer requirement on ) has motivated researchers to revisit the quadratic sum case of [gurbuzbalaban2015random] and obtain a convergence rate that has better dependency on than that of [haochen2018random]. The first set of results in this direction are given by Safran and Shamir [safran2019good], where an asymptotic lower bound of is developed. They also establish a matching upper bound for the dimensional case up to polylogarithmic factors, evidencing that their lower bound is likely to have the correct dependency on . The question of correct dependency is settled by Rajput, Gupta, and Papailiopoulos [rajput2020closing] where the nonasymptotic convergence rate of under the condition is established building on the coupling arguments in [nagaraj2019sgd]. See Table 1 for comparisons.
Despite such noticeable development over the years, there are key limitations shared among the existing nonasymptotic results (see Section 3 for precise details):

All the results place requirements^{3}^{3}3One notable exception is [nagaraj2019sgd, Theorem 2]. However, the result only proves the rate of which is not an improvement over the baseline rate of for withreplacement SGD. on the number of epochs of the form for some constant .

All the nonasymptotic results have extra polylogarithmic terms, raising a question whether one can remove them with improved analysis.
In this paper, we aim to overcome these limitations.
1.1 Summary of our main results
We establish tight convergence rates for withoutreplacement SGD (see Table 1). The key to obtaining tight convergence rates is to consider iterationdependent step sizes for , during the first epoch and the constant step size for the th epoch (). Our analysis builds on the periteration/epoch progress bounds developed in the prior arts [nagaraj2019sgd, rajput2020closing] (see Section 2). The main distinction lies in turning those progress bounds into global convergence rate bounds. One such tool for obtaining a nonasymptotic convergence rate is a version of Chung’s lemma [chung1954stochastic, Lemma 1], developed in the stochastic approximation literature. Unfortunately, the original proof has some errors as pointed out by Fabian [fabian1967stochastic, Discussion above Lemma 4.2], and even assuming its correctness, it turns out that the lemma is not sufficient for obtaining the desired convergence rate bound (see Section 5). To overcome this difficulty, in Section 6, we introduce a variant of Chung’s lemma (Lemma 2) that can handle the case where there are two asymptotic parameters and ; this lemma may be of independent interest. Our approach removes the epoch requirement of type as well as the extra polylogarithmic terms in the convergence rates.
1.2 Other related work
Apart from the works mentioned above, there are few other theoretical studies of withoutreplacement SGD, although with different objectives. Shamir [shamir2016without] demonstrates that withoutreplacement SGD is not worse than SGD. His proof techniques use tools from transductive learning theory, and as a result, his results only cover the firstepoch and the case where is a generalized linear function. As for another study, a recent work by Nguyen et al. [nguyen2020unified] studies some nonconvex settings. For the strongly convex case, they also obtain a convergence guarantee of under some weaker assumptions (for instance, ’s are not required to be convex). However, this rate does not beat the baseline rate of SGD, i.e., .
1.3 Problem setup and notation
Given the cost function (1.1) of the finitesum form, we consider the following optimization problem:
(1.2) 
We call the cost function and ’s the component functions. Also, let be the optimum solution of (1.2).
In solving the above optimization problem (1.2), we consider the withoutreplacement version of SGD with the initial iterate . For , the th epoch (pass) is executed by first randomly shuffling the component functions with the permutation on and going through each of them as
(1.3) 
where for all and , and is the step size for the th iteration of the th epoch. For simplicity, we denote the output of the th epoch by , i.e., .
Lastly, we provide formal definitions of the regularity assumptions for functions. For definitions, let be a differentiable function and be some positive numbers.
Definition 1.
We say is Lipschitz if for all .
Definition 2.
We say is smooth if for any .
Definition 3.
We say is strongly convex if for any .
Starting from the next section, we first focus on the case where is strongly convex and ’s are convex, Lipschitz and smooth. Later in Section 6.2, we will demonstrate how our techniques can be extended to obtain tight convergence rates for the case where ’s are additionally assumed to be quadratic.
2 Preliminaries: existing periteration/epoch bounds
We first need to quantify the progress made by the algorithm over each iteration. For withoutreplacement SGD, there are two different types of analyses:

Periteration analysis where one characterizes the progress made at each iteration.

Perepoch analysis where one characterizes the aggregate progress made over one epoch.
For periteration analysis, Nagaraj, Jain, and Netrapalli [nagaraj2019sgd] develop coupling arguments to prove that the progress made by withoutreplacement SGD is not worse than withreplacement SGD. In particular, their coupling arguments demonstrate the closeness in expectation between the iterates of without and withreplacement SGD. The following is a consequence of their coupling argument:
Proposition 1 (Periteration analysis [nagaraj2019sgd, implicit in Section A.1]).
Assume for that each component function is convex, Lipschitz and smooth and the cost function is strongly convex. Then, for any step size for the th iteration of the th epoch such that , the following bound holds between the adjacent iterates:
(2.1) 
where the expectation is taken over the randomness within the th epoch.
However, with the above analysis, one can only obtain results comparable to withreplacement SGD, as manifested in [nagaraj2019sgd, Theorem 2]. In order to characterize better progress, one needs to characterize the aggregate progress made over one epoch as a whole. A nice property of withoutreplacement SGD when considered over one epoch is the following observation due to Nedić and Bertsekas [nedic2001convergence, Chapter 2]. For each epoch, say the th epoch, assuming that the iterates stay close to the initial iterate , the aggregate update direction will closely approximate the full gradient at , i.e.,
(2.2) 
Based on this observation, together with the coupling arguments, Nagaraj, Jain, and Netrapalli![nagaraj2019sgd] obtain the following improved bound for one epoch as a whole:
Proposition 2 (Periteration analysis [nagaraj2019sgd, implicit Section 5.1]).
Under the same setting as Proposition 1, let be the step size for the th epoch, i.e., for . Then, the following bound holds between the output of the th and th epochs and :
(2.3) 
where the expectation is taken over the randomness within the th epoch.
Having these periteration/epoch progress bounds, the final ingredient of the nonasymptotic convergence rate analysis is to turn these bounds into acrossepochs global convergence bounds.
3 Limitations of the previous approach
In this section, we explain the approach used in the previous works [haochen2018random, nagaraj2019sgd, rajput2020closing] to obtain nonasymptotic bounds and illustrate the limitations of their approach. To turn the periteration/epoch progress bounds (2.1) and (2.3) into nonasymptotic convergence rates, the previous works take a simple approach of adopting constant stepsizes. For instance, we illustrate the approach in [nagaraj2019sgd] as follows:
Illustration of previous approach: In Proposition 2, let us choose the constant step size . Since , one can disregard the second term in the upper bound (2) as long as . Suppose that we choose small enough that holds. Then, since we also have , the perepoch bound (2.3) becomes:
(3.1) 
Now, recursively applying (3.1) for , we obtain the following bound:
Having established this, they choose for some to obtain:
One can easily see that the assumption is satisfied whenever .∎
The limitations of the constant step size approach are manifested in the above illustration: (i) it incurs extra polylogarithmic terms in the convergence rate bound and (ii) requires the number of epochs to be sufficiently large. Indeed, all previous works share these limitations. Having noticed the limitations, one might then wonder if one can obtain better results by abandoning constant step size.
4 Chung’s lemma: analytic tools for varying stepsize
As an effort to overcome the limitations, let us allow step sizes to vary across iterations or epochs. Let us first consider the periteration progress bound in Proposition 1. Since Proposition 1 works for any iterations, one can disregard the epoch structure and simply denote by the th iterate and by the step size used for the th iteration. Choosing for all with the initial index , where we choose to ensure , the periteration bound (2.1) becomes (we also use ):
(4.1) 
In fact, for the bounds of type (4.1), there are suitable tools for obtaining convergence rates: versions of Chung’s lemma [chung1954stochastic], developed in the stochastic approximation literature. Among the various versions of Chung’s lemma, there is one nonasymptotic version [chung1954stochastic, Lemma 1]:
Lemma 1 (Nonasymptotic Chung’s lemma).
Let be a sequence of positive real numbers. Suppose that there exist an initial index and real numbers , and such that satisfies the following inequality:
(4.2) 
Then, for any we have the following bound:
(4.3)  
(4.4) 
Proof.
Unfortunately, the original “proof” contains some errors as pointed out by Fabian [fabian1967stochastic, Discussion above Lemma 4.2]. We are able to correct the original proof; for this, see Section 7. ∎
Let us apply Lemma 1 to (4.1) as a warmup. From (4.1), one can see that in Lemma 1 can be chosen as . Hence, we obtain:
Corollary 1.
Under the setting of Proposition 1, let be a constant, and consider the step size for . Then the following convergence rate holds for any :
(4.5) 
5 An illustrative failed attempt using Chung’s lemma
Let us apply Lemma 1 to Proposition 2. For illustrative purpose, consider an ideal situation where instead of the actual progress bound (2.3), a nice epoch progress bound of the form (3.1) holds:
(5.1) 
Following the same principle as the previous section, let us take for some constant . On the other hand, to make things simpler, let us assume that one can take . Plugging this stepsize into (5.1), we obtain the following bound for some constants :
which then yields the following nonasymptotic bound due to Lemma 1:
(5.2) 
Although the last two terms in (5.2) are what we desire (see Table 1), the first term is undesirable. Even though we choose large, this bound will still contain the term which is an obstacle when one tries to show the superiority over the baseline of ; note that the former is a better rate than the baseline only if . Therefore, for the target convergence bound, one needs other versions of Lemma 1.
6 A variant of Chung’s lemma and tight convergence rates
As we have seen in the previous section, Chung’s lemma is not enough for capturing the desired convergence rate. In this section, to capture the right order for both and , we develop a variant of Chung’s lemma.
Lemma 2.
Let be an integer, and be a sequence of positive real numbers. Suppose that there exist an initial index and real numbers , and such that the following are satisfied:
(6.1)  
(6.2) 
Then, for any we have the following bound for :
(6.3) 
6.1 Tight convergence rate for strongly convex costs
Now we use Lemma 2 to obtain a tight convergence rate. Let for and . Let be an arbitrarily chosen constant. For the first epoch, we take the following iterationvarying step size: , where to ensure . Then, similarly to Corollary 1, yet this time by using the bound (4.3) in Lemma 1, one can derive the the following bound:
(6.4) 
where , i.e., .
Next, let us establish bounds of the form (6.2) for the th epoch for . From the second epoch on, we use the same step size within an epoch. More specifically, for the th epoch we choose . Using similar argument to obtain (3.1) in Section 3, Proposition 2 yields the following bound for :
(6.5) 
For , recursively applying Proposition 1 with the fact implies:
(6.6) 
Therefore, combining (6.5) and (6.6), we obtain the following bound which holds for any :
(6.7) 
where , i.e., . Let us modify the coefficient of in (6.7) so that it fits into the form of (6.2) in Lemma 2. First . This expression can be modified as
which is then upper bounded by . Thus, (6.7) can be rewritten as:
(6.8) 
Now applying Lemma 2 with (6.4) and (6.8) implies the following result:
Theorem 1 (Strongly convex costs).
Assume for that each component function is convex, Lipschitz and smooth and the cost function is strongly convex. For any constant , let , and consider the step sizes for and for , for . Then, the following convergence bound holds for any :
(6.9) 
where , , and .
6.2 Tight convergence rate for quadratic costs
Similarly, yet using some improved perepoch progress bound developed in [rajput2020closing], one can prove the following better convergence rate for the case where is additionally assumed to be quadratic. Note that this case is slightly more general than the setting considered in [gurbuzbalaban2015random, haochen2018random] where each is assumed to be quadratic. The details are deferred to Appendix A.
Theorem 2 (Quadratic costs).
Under the setting of Theorem 1, we additionally assume that is quadratic. For any constant , let , and consider the step sizes for and for , for . Then, the following convergence bound holds for any :
(6.10) 
where , , , and .
7 Proofs of the versions of Chung’s lemma (Lemmas 1 and 2)
We begin with an elementary fact that we will use throughout:
Proposition 3 (Integral approximation; see e.g. [lehman2010mathematics, Theorem 14.3])).
Let be a nondecreasing continuous function. Then, for any integers , . Similarly, if is nonincreasing, then for any integers , .
We first prove Lemma 1, thereby coming up with a correct proof of the nonasymptotic Chung’s lemma.
7.1 A correct proof of Chung’s lemma (Proof of Lemma 1)
For simplicity, let us define the following quantities for :
Using these notations, the recursive relation (4.2) becomes:
(7.1) 
After recursively applying (7.1) for , one obtain the following bound:
(7.2) 
Now let us upper and lower bound the product of ’s. Note that
and hence Proposition 3 with implies
(7.3) 
Therefore, we have
Applying Proposition 3 with to the above, since is an antiderivative of , we obtain the following upper bounds:
Combining all three cases, we conclude:
Plugging this back to (7.2) and using (7.3) to upper bound , we obtain (4.3). Using (7.3) once again to upper bound the coefficient of , we obtain (4.4). Hence, the proof is completed.
7.2 Proof of Lemma 2
The proof is generally analogous to that of Lemma 1, while some distinctions are required to capture the correct asymptotic for the two parameters and . To simplify notations, let us define the following quantities for :
Using these notations, the recursive relations (6.1) and (6.2) become:
(7.4)  
(7.5) 
Recursively applying (7.5) for and then (7.4), we obtain:
Comments
There are no comments yet.