We consider minimizing an empirical risk function , which is defined as the sample average of loss values over a possibly large but finite training set:
where the denotes the training data samples and the loss function is assumed convex and differentiable. We assume the empirical risk is strongly-convex which ensures that the minimizer, , is unique. Problems of the form (1
When the size of the dataset is large, it is impractical to solve (1) directly via traditional gradient descent by evaluating the full gradient at every iteration. One simple, yet powerful remedy is to employ the stochastic gradient method (SGD) [2, 3, 4, 5, 6, 7, 8, 9]. Rather than compute the full gradient over the entire data set, these algorithms pick one index at random at every iteration, and employ to approximate . Specifically, at iteration
, the update for estimating the minimizer is of the form:
is the step-size parameter. Note that we are using boldface notation to refer to random variables. Traditionally, the index
is uniformly distributed over the discrete set.
that incorporating random reshuffling into the gradient descent implementation helps achieve better performance. More broadly than in the case of the pure SGD algorithm, it has also been observed that applying random reshuffling in variance-reduction algorithms, like SVRG, SAGA, can accelerate the convergence speed[17, 18, 19, 20]. The reshuffling technique has also been applied in distributed system to reduce the communication and computation cost.
In random reshuffling implementations, the data points are no longer picked independently and uniformly at random. Instead, the gradient descent algorithm is run multiple times over the data where each run is indexed by
and is referred to as an epoch. For each epoch, the original data is first reshuffled and then passed over in order. In this manner, the-th sample of epoch can be viewed as , where the symbol represents a uniform random permutation of the indices. We can then express the random reshuffling algorithm for the th epoch in the following manner:
with the boundary condition:
In other words, the initial condition for epoch is the last iterate from epoch . The boldface notation for the symbols and in (3) emphasizes the random nature of these variables due to the randomness in the permutation operation. While the samples over one epoch are no longer picked independently from each other, the uniformity of the permutation function implies the following useful properties[19, 22, 23]:
where represents the collection of permuted indices for the samples numbered through .
Several recent works [12, 13, 24] have pursued justifications for the enhanced behavior of random reshuffling implementations over independent sampling (with replacement). The work  examined the convergence rate of the learning process under diminishing step-sizes, i.e., , where is some positive constant. It analytically showed that, for strongly convex objective functions, the convergence rate under random reshuffling can be improved from in vanilla SGD to . The incremental gradient methods[26, 27], which can be viewed as the deterministic version of random reshuffling, shares similar conclusions, i.e., random reshuffling helps accelerate the convergence rate from to under decaying step-sizes. Also, in the work , it establishes that random reshuffling will not degrade performance relative to the stochastic gradient descent implementation, provided the number of epochs is not too large.
In this work, we focus on a different setting than[12, 13, 24] involving random reshuffling under constant rather than decaying step-sizes. In this case, convergence is only guaranteed to a small neighborhood of the optimizer albeit at a linear rate. The analysis will establish analytically that random reshuffling outperforms independent sampling (with replacement) by showing that the mean-square-error of the iterate at the end of each run in the random reshuffling strategy will be in the order of . This is a significant improvement over the performance of traditional stochastic gradient descent, which is . Furthermore, we derive an analytical expression for the steady-state mean-square-error performance of the algorithm, which is exact for quadratic risks and provides a good approximation for general risks. This helps clarify in greater detail the differences between sampling with and without replacement We also explain the periodic behavior that is observed in random reshuffling implementations.
I-a Overview of results
Next, we examine more closely the value of the scaling constant in the factor by introducing a long term model and deriving an expression for its mean-square-deviation (MSD) performance — Theorem 2. The theorem reveals how the number of samples , step-size , and Hessian of the loss function impact performance.
In Theorem 3 we provide an expression for an upper bound for the MSD performance at all points close to steady-state. The result of the theorem helps explain the periodic behavior that is observed in random reshuffling implementations.
The mismatch between the original reshuffling and the long model is provided in Lemma 2.
Ii Analysis of the Stochastic Gradient Under Random Reshuffling
Ii-a Properties of the Gradient Approximation
We start by examining the properties of the stochastic gradient under random reshuffling. One main source of difficulty that we shall encounter in the analysis of performance under random reshuffling is the fact that a single sample of the stochastic gradient is now a biased estimate of the true gradient and, moreover, it is no longer independent of past selections, . This is in contrast to implementations where samples are picked independently at every iteration. Indeed, note that conditioned on previously picked data and on the previous iterate, we have:
since at the beginning of one epoch, no data has been selected yet. Perhaps surprisingly, we will be showing that the biased construction of the stochastic gradient estimate not only does not hurt the performance of the algorithm, but instead significantly improves it. In large part, the analysis will revolve around considering the accuracy of the gradient approximation over an entire epoch, rather than focusing on single samples at a time. Recall that by construction in random reshuffling, every sample is picked once and only once over one epoch. This means that the sample average (rather than the true mean) of the gradient noise process is zero since
for any and any reshuffling order . This property will become key in the analysis.
Ii-B Convergence Analysis
We can now establish a key convergence and performance property for the random reshuffling algorithm, which provides solid analytical justification for its observed improved performance in practice.
To begin with, we assume that the risk function satisfies the following conditions, which are automatically satisfied by many learning problems of interest, such as mean-square-error or logistic regression analysis and their regularized versions — see, e.g.,[28, 29, 30, 31, 32].
Assumption 1 (Condition on loss function).
It is assumed that is differentiable and has a -Lipschitz continuous gradient, i.e., for every and any :
where . We also assume is -strongly convex:
If we introduce , then each and are also -Lipschitz continuous.
The following theorem focuses on the convergence of the starting point of each epoch and establishes in (14) that it actually approaches a smaller neighborhood of size around . Afterwards, using this result, we also show that the same performance level holds for all iterates and not just for the starting points of the epochs.
To simplify the notation, we introduce the constant , which is the gradient noise variance at optimal point :
Theorem 1 (Stability of starting points).
when the step-size is sufficiently small, namely, for 111 The proof in this theorem is based on the worst case scenario, which implies the inequalities hold for any realizations. Therefore, this proof is also applicable to the deterministic cyclic sampling case.. The convergence to steady-state regime occurs at an exponential rate, dictated by the parameter:
See Appendix A ∎
Having established the stability of the first point of every epoch, we can establish the stability of every point.
Corollary 1 (Full Stability).
Under assumption 1, it holds that
for all when the step-size is sufficiently small.
See Appendix C ∎
With the previous established Theorem 1, it is also easy to gain the convergence theorem under decaying step-sizes.
Corollary 2 (Convergence under decaying step-sizes).
Under assumption 1 and the decaying step-sizes is employed, the iterate converge to the minimizer exactly as with rate.
See Appendix D ∎
Iii Illustrating Behavior and Periodicity
In this section we illustrate the theoretical findings so far by numerical simulations. We consider the following logistic regression problem:
is the feature vector,is the scalar label, and
The constant is the regularization parameter. In the first simulation, we compare the performance of the standard stochastic gradient descent (SGD) algorithm (2) with replacement and the random reshuffling (RR) algorithm (3). We set and . Each
is generated from the normal distribution, where is a diagonal matrix with each diagonal entry generated from the uniform distribution . To generate , we first generate an auxiliary random vector with each entry following . Next, we generate from a uniform distribution . If then is set as ; otherwise is set as . We select during all simulations. Figure 1 illustrates the mean-square-deviation (MSD) performance, i.e., , of the SGD and RR algorithms when . It is observed that the RR algorithm oscillates during the steady-state regime, and that the MSD at the is the best among all iterates during epoch . Furthermore, it is also observed that RR has better MSD performance than SGD. Similar observations also occur in Fig. 2, where . It is worth noting that the gap between SGD and RR is much larger in Fig. 2 than in Fig. 1.
Next, in the second simulation we verify the conclusion that the MSD for the starting point of each epoch for the random reshuffling algorithm, i.e., , can achieve instead of . We still consider the regularized logistic regression problem (17) and (18), and the same experimental setting. Recall that in Lemma 1, we proved that
which indicates that when is reduced a factor of 10, the MSD-performance should be improved by at least dB. We observe a decay of about 20dB per decade in Fig. 3 for a logistic regression problem with data points and 30dB per decade in Fig. 4 with .
Iv Introducing a Long-Term Model
We proved in the earlier sections that the mean-square error under random reshuffling approaches a small neighborhood around the minimizer. Our objective now is to assess more accurately the size of the constant that multiplies in the result, and examine how this constant may depend on various parameters including the amount of data, , and the form of the loss function . To do that, we proceed in two steps. First, we introduce an auxiliary long-term model in (28) below and subsequently determine how far the performance of this model is from the original system described by (27) further ahead.
Iv-a Error Dynamics
In order to quantify the performance of the random reshuffling implementation more accurately than the figure obtained earlier, we will need to impose a condition on the smoothness of the Hessian matrix of the risk function.
Assumption 2 (Hessian is Lipschitz continuous).
The risk function has a Lipschitz continuous Hessian matrix, i.e., there exists a constant , such that
Under this assumption, the gradient vector, , can be expressed in Taylor expansion in the form[28, p. 378]:
where the residual term satisfies:
As such, we can rewrite algorithm (3) in the form:
To ease the notation, we introduce the Hessian matrix and the gradient noise process:
so that (23) is simplified as:
Note that recursion (27) relates to , which are the starting points of two successive epochs. In this way, we have now transformed recursion (3), which runs from one sample to another within the same epoch, into a relation that runs from one starting point to another over two successive epochs.
To proceed, we will ignore the last two terms in (27) and consider the following approximate model, which we shall refer to as a long-term model.
Obviously, the state evolution will be different than (27) and is therefore denoted by the prime notation, . Observe, however, that in model (28) the gradient noise process is still being evaluated at the original state vector, , and not at the new state vector, .
Iv-B Performance of the Long-Term Model across Epochs
Note that the gradient noise in (28) has the form of a weighted sum over one epoch. This noise clearly satisfies the property:
We also know that satisfies the Markov property, i.e., it is independent of all previous and , where , conditioned on . To motivate the next lemma consider the following auxiliary setting.
Assume we have a collection of vectors in whose sum is zero. We define a random walk over these vectors in the following manner. At each time instant, we select a random vector uniformly and with replacement from this set and move from the current location along the vector to the next location. If we keep repeating this construction, we obtain behavior that is represented by the right plot in Fig. 5. Assume instead that we repeat the same experiment except that now we assume the data is first reshuffled and then vectors are selected uniformly without replacement. Because of the zero sum property, and because sampling is now performed without replacement, we find that in this second implementation we always return to the origin after selections. This situation is illustrated in the left plot of the same Fig. 5. The next lemma considers this scenario and provides useful expressions that allow us to estimate the expected location after or more (unitl ) movements. These results will be used in the sequel in our analysis of the performance of stochastic learning under RR.
Suppose we have a set of vectors with the constraint . Assume the elements of are randomly reshuffled and then selected uniformly without replacement. Let be any nonnegative constant, be any symmetric positive semi-definite matrix, and introduce
Define the following functions for any :
It then holds that
The proof is provided in Appendix E. ∎
We now return to the stochastic gradient implementation under random reshuffling. Recall from (10) that the stochastic gradient satisfies the zero sample mean property so that
at any given point . Applying Lemma 1, we readily conclude that
Similarly, we conclude for the gradient noise at the optimal :
Theorem 2 (Performance of Long-term Model across Epochs).
See Appendix F. ∎
The simulations in Fig. 6 show that the MSD expression (41) fits well the performance of the original random reshuffling algorithm. We will establish this fact analytically in the sequel. For now, the simulation is simply confirming that the performance of the long-term model is a good indication of the performance of the original stochastic gradient implementation under RR.
Iv-C Performance of the Long-Term Model over Iterations
In the previous section we examined the performance of the long-term model at the starting points of successive epochs. In this section, we examine the performance of the same model at any iterate as time approaches . This analysis will help explain the oscillations that are observed in the learning curves in the simulations. First, similar to (39), we need to determine the covariance matrix for any . From Lemma 1, we immediately get that
Theorem 3 (Performance Upper-bound for Long- Term Model).
See Appendix G.∎
We need to point out unlike that (41), expression (144) is an upper-bound rather than an actual performance expression. Still, this bound can help provide useful insights on the periodic behavior that is observed in the simulations. The expression (44) on the right-hand side is a convex combination of two performance measures as defined in (45), where the second term is always larger than the first term but approaching it as increases towards . This behavior will become clearer later in the context of an example and the hyperbolic representation in section V-B.
Before we continue, we would like to comment on the convergence curve under random reshuffling. Unlike the convergence curve under uniform sampling, we observe periodic fluctuations under random reshuffling in Figures 2 and 6. The main reason for this behavior is the fact that the gradient noise is no longer i.i.d. in steady-state. Specifically, the noise variance is now a function of the iterate and it assumes its lowest value at the beginning and end of every epoch. In lemma 1, we show that the variance of the random walk process resulting from random reshuffling at each iteration in Eq. (34). We plot the function for and in Fig. 7. Since the mean-square performance of the algorithm is related to the variance of the gradient noise, it is expected that this bell-shape behavior will be reflected in to the MSD curve as well, thus, resulting in better performance at the beginning and end of every epoch.
Iv-D Mismatch Bound
Lemma 2 (Mismatch Bound).
Proof: See Appendix I.
V Quadratic Risks and Hyperbolic Representation
Lastly, we consider an example involving a quadratic (least-squares) risk to show that, in this case, the long-term model provides the exact MSD for the original algorithm. The analysis will also provide some insights into expression (41). It also motivates a hyperbolic representation for the MSD, which helps provides some more insights into the MSD behavior.
V-a Quadratic Risks
Thus, consider the following quadratic risk function:
where has full column rank. We have:
Since the gradient noise is independent of , we have
Moreover, since the risk is quadratic, it also holds that
Therefore, the long-term model is exactly the same as the original algorithm. For this example, we can calculate the following quantities:
In special case when the columns of are orthogonal and normalized, i.e., , we can simplify the MSD expression (41) by noting that
In order to provide further insights on this MSD expression, we simplify it under a small assumption. We could introduce the Taylor series:
However, this approximation can be bad if is large, which is not uncommon in big data. Instead, we appeal to:
and arrive at the simplified expression:
For comparison purposes, we know that a simplified expression for MSD under uniform sampling has the following expression:
Hence, the random reshuffling case has an extra multiplicative factor:
We plot versus in the left plot of Fig. 8 where we ignore . Now it is clear from the figure that the smaller the step size or the smaller sample size are, the larger the improvement in performance is. In contrast, when goes to infinity, the term will converge to , i.e., the same performance as uniform sampling situation, which is consistent with the infinite-horizon case. Lastly, noting that
and using the approximation (61):
Substituting in (44), we get for :
Since is monotonically increasing,