Stochastic gradient descent (SGD) is a popular optimization method for cost functions of the form:
More specifically, instead of computing the full gradient at each iteration, SGD computes a cheaper approximation of it by computing where index is randomly sampled from . In sampling such an index at each iteration, one can naturally think of the following two options: (i) with-replacement and (ii) without-replacement samplings. Interestingly, without-replacement SGD (also known as random reshuffle or random shuffle) has been empirically shown to outperform with-replacement SGD [bottou2009curiously, bottou2012stochastic]. However, the traditional analysis of SGD only covers the with-replacement version, raising the question whether one can capture this phenomenon theoretically.
The first contribution is established by Gürbüzbalaban, Ozdaglar, and Parrilo [gurbuzbalaban2015random] for the case where is -strongly convex and each is quadratic and -smooth.
Under this setting, they prove an asymptotic111Their proof is based on an asymptotic version of Chung’s Lemma [fabian1967stochastic, Lemma 4.2]. convergence rate of for epochs (the number of entire passes through the component indices ) when is treated as a constant.
This is indeed an asymptotic improvement over the convergence rate of achieved by with-replacement SGD.
However, in modern applications, cannot be treated as a constant, and it is important to characterize non-asymptotic behavior of the algorithm.
For instance, in machine learning applications,
behavior of the algorithm. For instance, in machine learning applications,is typically equal to the number of data in a training set, and the algorithm is typically run a few epochs and terminates. Hence, it is important to investigate the dependence on as well as non-asymptotic convergence rates.
Scrutinizing the analysis of [gurbuzbalaban2015random, (58)], one can actually deduce the asymptotic convergence rate of , yet the exact dependence on and making this analysis non-asymptotic222Note that the key ingredient for making the analysis in [gurbuzbalaban2015random] non-asymptotic is a non-asymptotic version of Chung’s lemma [chung1954stochastic, Lemma 1], but as pointed out by Fabian [fabian1967stochastic, Discussion above Lemma 4.2] the original proof of the result has some errors. remain open.
|Sum of quadratics|
|Upper bound in [gurbuzbalaban2015random]||asymptotic result|
|Upper bound in [haochen2018random]||requires|
|Upper bound in [rajput2020closing]||requires|
|Lower bound in [safran2019good]||constant step size|
|Our upper bound||all|
|Non-quadratic strongly convex|
|Upper bound in [nagaraj2019sgd]||requires|
|Lower bound in [rajput2020closing]||constant step size|
|Our upper bound||all|
Recently, a few subsequent efforts have been made to characterize non-asymptotic convergence rates in terms of both and . Haochen and Sra [haochen2018random] develop the first non-asymptotic convergence of under the condition for the same setting as [gurbuzbalaban2015random] where denotes the condition number. However, compared with the baseline rate of SGD, i.e., , this convergence rate becomes an improvement only after epochs. This result is strengthened in follow-up works that demonstrate the superiority of without-replacement SGD:
The work by Nagaraj, Jain, and Netrapalli [nagaraj2019sgd] considers a more general setting where ’s no longer have to be quadratic, but just convex and smooth. Under this setting, they introduce clever coupling arguments to prove the non-asymptotic convergence rate of under the condition of which is more stringent than that of [haochen2018random]. Note that this rate is better than the baseline of as soon as the technical condition is fulfilled. This upper bound is shown to be asymptotically tight up to poly-logarithmic factors by Rajput, Gupta, and Papailiopoulos [rajput2020closing]. See Table 1 for comparisons.
The convergence rate with a better dependency on for the strongly convex setting given by [nagaraj2019sgd] (at the cost of a severer requirement on ) has motivated researchers to revisit the quadratic sum case of [gurbuzbalaban2015random] and obtain a convergence rate that has better dependency on than that of [haochen2018random]. The first set of results in this direction are given by Safran and Shamir [safran2019good], where an asymptotic lower bound of is developed. They also establish a matching upper bound for the -dimensional case up to poly-logarithmic factors, evidencing that their lower bound is likely to have the correct dependency on . The question of correct dependency is settled by Rajput, Gupta, and Papailiopoulos [rajput2020closing] where the non-asymptotic convergence rate of under the condition is established building on the coupling arguments in [nagaraj2019sgd]. See Table 1 for comparisons.
Despite such noticeable development over the years, there are key limitations shared among the existing non-asymptotic results (see Section 3 for precise details):
All the results place requirements333One notable exception is [nagaraj2019sgd, Theorem 2]. However, the result only proves the rate of which is not an improvement over the baseline rate of for with-replacement SGD. on the number of epochs of the form for some constant .
All the non-asymptotic results have extra poly-logarithmic terms, raising a question whether one can remove them with improved analysis.
In this paper, we aim to overcome these limitations.
1.1 Summary of our main results
We establish tight convergence rates for without-replacement SGD (see Table 1). The key to obtaining tight convergence rates is to consider iteration-dependent step sizes for , during the first epoch and the constant step size for the -th epoch (). Our analysis builds on the per-iteration/epoch progress bounds developed in the prior arts [nagaraj2019sgd, rajput2020closing] (see Section 2). The main distinction lies in turning those progress bounds into global convergence rate bounds. One such tool for obtaining a non-asymptotic convergence rate is a version of Chung’s lemma [chung1954stochastic, Lemma 1], developed in the stochastic approximation literature. Unfortunately, the original proof has some errors as pointed out by Fabian [fabian1967stochastic, Discussion above Lemma 4.2], and even assuming its correctness, it turns out that the lemma is not sufficient for obtaining the desired convergence rate bound (see Section 5). To overcome this difficulty, in Section 6, we introduce a variant of Chung’s lemma (Lemma 2) that can handle the case where there are two asymptotic parameters and ; this lemma may be of independent interest. Our approach removes the epoch requirement of type as well as the extra poly-logarithmic terms in the convergence rates.
1.2 Other related work
Apart from the works mentioned above, there are few other theoretical studies of without-replacement SGD, although with different objectives. Shamir [shamir2016without] demonstrates that without-replacement SGD is not worse than SGD. His proof techniques use tools from transductive learning theory, and as a result, his results only cover the first-epoch and the case where is a generalized linear function. As for another study, a recent work by Nguyen et al. [nguyen2020unified] studies some non-convex settings. For the strongly convex case, they also obtain a convergence guarantee of under some weaker assumptions (for instance, ’s are not required to be convex). However, this rate does not beat the baseline rate of SGD, i.e., .
1.3 Problem setup and notation
Given the cost function (1.1) of the finite-sum form, we consider the following optimization problem:
We call the cost function and ’s the component functions. Also, let be the optimum solution of (1.2).
In solving the above optimization problem (1.2), we consider the without-replacement version of SGD with the initial iterate . For , the -th epoch (pass) is executed by first randomly shuffling the component functions with the permutation on and going through each of them as
where for all and , and is the step size for the -th iteration of the -th epoch. For simplicity, we denote the output of the -th epoch by , i.e., .
Lastly, we provide formal definitions of the regularity assumptions for functions. For definitions, let be a differentiable function and be some positive numbers.
We say is -Lipschitz if for all .
We say is -smooth if for any .
We say is -strongly convex if for any .
Starting from the next section, we first focus on the case where is -strongly convex and ’s are convex, -Lipschitz and -smooth. Later in Section 6.2, we will demonstrate how our techniques can be extended to obtain tight convergence rates for the case where ’s are additionally assumed to be quadratic.
2 Preliminaries: existing per-iteration/-epoch bounds
We first need to quantify the progress made by the algorithm over each iteration. For without-replacement SGD, there are two different types of analyses:
Per-iteration analysis where one characterizes the progress made at each iteration.
Per-epoch analysis where one characterizes the aggregate progress made over one epoch.
For per-iteration analysis, Nagaraj, Jain, and Netrapalli [nagaraj2019sgd] develop coupling arguments to prove that the progress made by without-replacement SGD is not worse than with-replacement SGD. In particular, their coupling arguments demonstrate the closeness in expectation between the iterates of without- and with-replacement SGD. The following is a consequence of their coupling argument:
Proposition 1 (Per-iteration analysis [nagaraj2019sgd, implicit in Section A.1]).
Assume for that each component function is convex, -Lipschitz and -smooth and the cost function is -strongly convex. Then, for any step size for the -th iteration of the -th epoch such that , the following bound holds between the adjacent iterates:
where the expectation is taken over the randomness within the -th epoch.
However, with the above analysis, one can only obtain results comparable to with-replacement SGD, as manifested in [nagaraj2019sgd, Theorem 2]. In order to characterize better progress, one needs to characterize the aggregate progress made over one epoch as a whole. A nice property of without-replacement SGD when considered over one epoch is the following observation due to Nedić and Bertsekas [nedic2001convergence, Chapter 2]. For each epoch, say the -th epoch, assuming that the iterates stay close to the initial iterate , the aggregate update direction will closely approximate the full gradient at , i.e.,
Based on this observation, together with the coupling arguments, Nagaraj, Jain, and Netrapalli![nagaraj2019sgd] obtain the following improved bound for one epoch as a whole:
Proposition 2 (Per-iteration analysis [nagaraj2019sgd, implicit Section 5.1]).
Under the same setting as Proposition 1, let be the step size for the -th epoch, i.e., for . Then, the following bound holds between the output of the -th and -th epochs and :
where the expectation is taken over the randomness within the -th epoch.
Having these per-iteration/-epoch progress bounds, the final ingredient of the non-asymptotic convergence rate analysis is to turn these bounds into across-epochs global convergence bounds.
3 Limitations of the previous approach
In this section, we explain the approach used in the previous works [haochen2018random, nagaraj2019sgd, rajput2020closing] to obtain non-asymptotic bounds and illustrate the limitations of their approach. To turn the per-iteration/-epoch progress bounds (2.1) and (2.3) into non-asymptotic convergence rates, the previous works take a simple approach of adopting constant stepsizes. For instance, we illustrate the approach in [nagaraj2019sgd] as follows:
Illustration of previous approach: In Proposition 2, let us choose the constant step size . Since , one can disregard the second term in the upper bound (2) as long as . Suppose that we choose small enough that holds. Then, since we also have , the per-epoch bound (2.3) becomes:
Now, recursively applying (3.1) for , we obtain the following bound:
Having established this, they choose for some to obtain:
One can easily see that the assumption is satisfied whenever .∎
The limitations of the constant step size approach are manifested in the above illustration: (i) it incurs extra poly-logarithmic terms in the convergence rate bound and (ii) requires the number of epochs to be sufficiently large. Indeed, all previous works share these limitations. Having noticed the limitations, one might then wonder if one can obtain better results by abandoning constant step size.
4 Chung’s lemma: analytic tools for varying stepsize
As an effort to overcome the limitations, let us allow step sizes to vary across iterations or epochs. Let us first consider the per-iteration progress bound in Proposition 1. Since Proposition 1 works for any iterations, one can disregard the epoch structure and simply denote by the -th iterate and by the step size used for the -th iteration. Choosing for all with the initial index , where we choose to ensure , the per-iteration bound (2.1) becomes (we also use ):
In fact, for the bounds of type (4.1), there are suitable tools for obtaining convergence rates: versions of Chung’s lemma [chung1954stochastic], developed in the stochastic approximation literature. Among the various versions of Chung’s lemma, there is one non-asymptotic version [chung1954stochastic, Lemma 1]:
Lemma 1 (Non-asymptotic Chung’s lemma).
Let be a sequence of positive real numbers. Suppose that there exist an initial index and real numbers , and such that satisfies the following inequality:
Then, for any we have the following bound:
Unfortunately, the original “proof” contains some errors as pointed out by Fabian [fabian1967stochastic, Discussion above Lemma 4.2]. We are able to correct the original proof; for this, see Section 7. ∎
Under the setting of Proposition 1, let be a constant, and consider the step size for . Then the following convergence rate holds for any :
5 An illustrative failed attempt using Chung’s lemma
Following the same principle as the previous section, let us take for some constant . On the other hand, to make things simpler, let us assume that one can take . Plugging this stepsize into (5.1), we obtain the following bound for some constants :
which then yields the following non-asymptotic bound due to Lemma 1:
Although the last two terms in (5.2) are what we desire (see Table 1), the first term is undesirable. Even though we choose large, this bound will still contain the term which is an obstacle when one tries to show the superiority over the baseline of ; note that the former is a better rate than the baseline only if . Therefore, for the target convergence bound, one needs other versions of Lemma 1.
6 A variant of Chung’s lemma and tight convergence rates
As we have seen in the previous section, Chung’s lemma is not enough for capturing the desired convergence rate. In this section, to capture the right order for both and , we develop a variant of Chung’s lemma.
Let be an integer, and be a sequence of positive real numbers. Suppose that there exist an initial index and real numbers , and such that the following are satisfied:
Then, for any we have the following bound for :
6.1 Tight convergence rate for strongly convex costs
Now we use Lemma 2 to obtain a tight convergence rate. Let for and . Let be an arbitrarily chosen constant. For the first epoch, we take the following iteration-varying step size: , where to ensure . Then, similarly to Corollary 1, yet this time by using the bound (4.3) in Lemma 1, one can derive the the following bound:
where , i.e., .
Next, let us establish bounds of the form (6.2) for the -th epoch for . From the second epoch on, we use the same step size within an epoch. More specifically, for the -th epoch we choose . Using similar argument to obtain (3.1) in Section 3, Proposition 2 yields the following bound for :
For , recursively applying Proposition 1 with the fact implies:
which is then upper bounded by . Thus, (6.7) can be rewritten as:
Theorem 1 (Strongly convex costs).
Assume for that each component function is convex, -Lipschitz and -smooth and the cost function is -strongly convex. For any constant , let , and consider the step sizes for and for , for . Then, the following convergence bound holds for any :
where , , and .
6.2 Tight convergence rate for quadratic costs
Similarly, yet using some improved per-epoch progress bound developed in [rajput2020closing], one can prove the following better convergence rate for the case where is additionally assumed to be quadratic. Note that this case is slightly more general than the setting considered in [gurbuzbalaban2015random, haochen2018random] where each is assumed to be quadratic. The details are deferred to Appendix A.
Theorem 2 (Quadratic costs).
Under the setting of Theorem 1, we additionally assume that is quadratic. For any constant , let , and consider the step sizes for and for , for . Then, the following convergence bound holds for any :
where , , , and .
We begin with an elementary fact that we will use throughout:
Proposition 3 (Integral approximation; see e.g. [lehman2010mathematics, Theorem 14.3])).
Let be a non-decreasing continuous function. Then, for any integers , . Similarly, if is non-increasing, then for any integers , .
We first prove Lemma 1, thereby coming up with a correct proof of the non-asymptotic Chung’s lemma.
7.1 A correct proof of Chung’s lemma (Proof of Lemma 1)
For simplicity, let us define the following quantities for :
Using these notations, the recursive relation (4.2) becomes:
After recursively applying (7.1) for , one obtain the following bound:
Now let us upper and lower bound the product of ’s. Note that
and hence Proposition 3 with implies
Therefore, we have
Applying Proposition 3 with to the above, since is an anti-derivative of , we obtain the following upper bounds:
Combining all three cases, we conclude:
7.2 Proof of Lemma 2
The proof is generally analogous to that of Lemma 1, while some distinctions are required to capture the correct asymptotic for the two parameters and . To simplify notations, let us define the following quantities for :