Stochastic Learning under Random Reshuffling

03/21/2018 ∙ by Bicheng Ying, et al. ∙ EPFL 0

In empirical risk optimization, it has been observed that stochastic gradient implementations that rely on random reshuffling of the data achieve better performance than implementations that rely on sampling the data uniformly. Recent works have pursued justifications for this behavior by examining the convergence rate of the learning process under diminishing step-sizes. This work focuses on the constant step-size case. In this case, convergence is guaranteed to a small neighborhood of the optimizer albeit at a linear rate. The analysis establishes analytically that random reshuffling outperforms uniform sampling by showing explicitly that iterates approach a smaller neighborhood of size O(μ^2) around the minimizer rather than O(μ). Furthermore, we derive an analytical expression for the steady-state mean-square-error performance of the algorithm, which helps clarify in greater detail the differences between sampling with and without replacement. We also explain the periodic behavior that is observed in random reshuffling implementations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Motivation

We consider minimizing an empirical risk function , which is defined as the sample average of loss values over a possibly large but finite training set:

(1)

where the denotes the training data samples and the loss function is assumed convex and differentiable. We assume the empirical risk is strongly-convex which ensures that the minimizer, , is unique. Problems of the form (1

) are common in many areas of machine learning including linear regression, logistic regression and their regularized versions.

When the size of the dataset is large, it is impractical to solve (1) directly via traditional gradient descent by evaluating the full gradient at every iteration. One simple, yet powerful remedy is to employ the stochastic gradient method (SGD) [2, 3, 4, 5, 6, 7, 8, 9]. Rather than compute the full gradient over the entire data set, these algorithms pick one index at random at every iteration, and employ to approximate . Specifically, at iteration

, the update for estimating the minimizer is of the form

[10]:

(2)

where

is the step-size parameter. Note that we are using boldface notation to refer to random variables. Traditionally, the index

is uniformly distributed over the discrete set

.

It has been noted in the literature [11, 12, 13, 14]

that incorporating random reshuffling into the gradient descent implementation helps achieve better performance. More broadly than in the case of the pure SGD algorithm, it has also been observed that applying random reshuffling in variance-reduction algorithms, like SVRG

[15], SAGA[16], can accelerate the convergence speed[17, 18, 19, 20]. The reshuffling technique has also been applied in distributed system to reduce the communication and computation cost[21].

In random reshuffling implementations, the data points are no longer picked independently and uniformly at random. Instead, the gradient descent algorithm is run multiple times over the data where each run is indexed by

and is referred to as an epoch. For each epoch, the original data is first reshuffled and then passed over in order. In this manner, the

-th sample of epoch can be viewed as , where the symbol represents a uniform random permutation of the indices. We can then express the random reshuffling algorithm for the th epoch in the following manner:

(3)

with the boundary condition:

(4)

In other words, the initial condition for epoch is the last iterate from epoch . The boldface notation for the symbols and in (3) emphasizes the random nature of these variables due to the randomness in the permutation operation. While the samples over one epoch are no longer picked independently from each other, the uniformity of the permutation function implies the following useful properties[19, 22, 23]:

(5)
(6)
(7)

where represents the collection of permuted indices for the samples numbered through .

Several recent works [12, 13, 24] have pursued justifications for the enhanced behavior of random reshuffling implementations over independent sampling (with replacement). The work [13] examined the convergence rate of the learning process under diminishing step-sizes, i.e., , where is some positive constant. It analytically showed that, for strongly convex objective functions, the convergence rate under random reshuffling can be improved from in vanilla SGD[25] to . The incremental gradient methods[26, 27], which can be viewed as the deterministic version of random reshuffling, shares similar conclusions, i.e., random reshuffling helps accelerate the convergence rate from to under decaying step-sizes. Also, in the work [24], it establishes that random reshuffling will not degrade performance relative to the stochastic gradient descent implementation, provided the number of epochs is not too large.

In this work, we focus on a different setting than[12, 13, 24] involving random reshuffling under constant rather than decaying step-sizes. In this case, convergence is only guaranteed to a small neighborhood of the optimizer albeit at a linear rate. The analysis will establish analytically that random reshuffling outperforms independent sampling (with replacement) by showing that the mean-square-error of the iterate at the end of each run in the random reshuffling strategy will be in the order of . This is a significant improvement over the performance of traditional stochastic gradient descent, which is  [10]. Furthermore, we derive an analytical expression for the steady-state mean-square-error performance of the algorithm, which is exact for quadratic risks and provides a good approximation for general risks. This helps clarify in greater detail the differences between sampling with and without replacement We also explain the periodic behavior that is observed in random reshuffling implementations.

I-a Overview of results

  • Section II provides a stability proof, which shows that under constant step-size random reshuffling will converge into a small neighborhood around the minimizer. The radius of the neighborhood improves from under uniform sampling to under random reshuffling. — Theorem 1.

  • Next, we examine more closely the value of the scaling constant in the factor by introducing a long term model and deriving an expression for its mean-square-deviation (MSD) performance — Theorem 2. The theorem reveals how the number of samples , step-size , and Hessian of the loss function impact performance.

  • In Theorem 3 we provide an expression for an upper bound for the MSD performance at all points close to steady-state. The result of the theorem helps explain the periodic behavior that is observed in random reshuffling implementations.

  • The mismatch between the original reshuffling and the long model is provided in Lemma 2.

  • Inspired by quadratic risks, we simplify the MSD expressions in Theorems 2 and 3 by using the hyperbolic tanh functions — equations (76) and (80).

  • In equations (81) – (85), we show that as the sample size increases, the established MSD expression in Theorem 2 will regress to the same expression as the uniform sampling case.

Ii Analysis of the Stochastic Gradient Under Random Reshuffling

Ii-a Properties of the Gradient Approximation

We start by examining the properties of the stochastic gradient under random reshuffling. One main source of difficulty that we shall encounter in the analysis of performance under random reshuffling is the fact that a single sample of the stochastic gradient is now a biased estimate of the true gradient and, moreover, it is no longer independent of past selections, . This is in contrast to implementations where samples are picked independently at every iteration. Indeed, note that conditioned on previously picked data and on the previous iterate, we have:

(8)

The difference (8) is generally nonzero in view of the definition (1). For the first iteration of every epoch however, it can be verified that the following holds:

(9)

since at the beginning of one epoch, no data has been selected yet. Perhaps surprisingly, we will be showing that the biased construction of the stochastic gradient estimate not only does not hurt the performance of the algorithm, but instead significantly improves it. In large part, the analysis will revolve around considering the accuracy of the gradient approximation over an entire epoch, rather than focusing on single samples at a time. Recall that by construction in random reshuffling, every sample is picked once and only once over one epoch. This means that the sample average (rather than the true mean) of the gradient noise process is zero since

(10)

for any and any reshuffling order . This property will become key in the analysis.

Ii-B Convergence Analysis

We can now establish a key convergence and performance property for the random reshuffling algorithm, which provides solid analytical justification for its observed improved performance in practice.

To begin with, we assume that the risk function satisfies the following conditions, which are automatically satisfied by many learning problems of interest, such as mean-square-error or logistic regression analysis and their regularized versions — see, e.g.,

[28, 29, 30, 31, 32].

Assumption 1 (Condition on loss function).

It is assumed that is differentiable and has a -Lipschitz continuous gradient, i.e., for every and any :

(11)

where . We also assume is -strongly convex:

(12)

If we introduce , then each and are also -Lipschitz continuous.

The following theorem focuses on the convergence of the starting point of each epoch and establishes in (14) that it actually approaches a smaller neighborhood of size around . Afterwards, using this result, we also show that the same performance level holds for all iterates and not just for the starting points of the epochs.

To simplify the notation, we introduce the constant , which is the gradient noise variance at optimal point :

(13)
Theorem 1 (Stability of starting points).

Under assumption 1, the starting point of each run in (3), i.e., , satisfies

(14)

when the step-size is sufficiently small, namely, for 111 The proof in this theorem is based on the worst case scenario, which implies the inequalities hold for any realizations. Therefore, this proof is also applicable to the deterministic cyclic sampling case.. The convergence to steady-state regime occurs at an exponential rate, dictated by the parameter:

(15)
Proof.

See Appendix A

Having established the stability of the first point of every epoch, we can establish the stability of every point.

Corollary 1 (Full Stability).

Under assumption 1, it holds that

(16)

for all when the step-size is sufficiently small.

Proof.

See Appendix C

With the previous established Theorem 1, it is also easy to gain the convergence theorem under decaying step-sizes.

Corollary 2 (Convergence under decaying step-sizes).

Under assumption 1 and the decaying step-sizes is employed, the iterate converge to the minimizer exactly as with rate.

Proof.

See Appendix D

Iii Illustrating Behavior and Periodicity

In this section we illustrate the theoretical findings so far by numerical simulations. We consider the following logistic regression problem:

(17)

where

is the feature vector,

is the scalar label, and

(18)

The constant is the regularization parameter. In the first simulation, we compare the performance of the standard stochastic gradient descent (SGD) algorithm (2) with replacement and the random reshuffling (RR) algorithm (3). We set and . Each

is generated from the normal distribution

, where is a diagonal matrix with each diagonal entry generated from the uniform distribution . To generate , we first generate an auxiliary random vector with each entry following . Next, we generate from a uniform distribution . If then is set as ; otherwise is set as . We select during all simulations. Figure 1 illustrates the mean-square-deviation (MSD) performance, i.e., , of the SGD and RR algorithms when . It is observed that the RR algorithm oscillates during the steady-state regime, and that the MSD at the is the best among all iterates during epoch . Furthermore, it is also observed that RR has better MSD performance than SGD. Similar observations also occur in Fig. 2, where . It is worth noting that the gap between SGD and RR is much larger in Fig. 2 than in Fig. 1.

Fig. 1: RR has better mean-square-deviation (MSD) performance, i.e., , than standard SGD when . The dotted black curve is drawn by connecting the MSD performance at the starting points of the successive epochs.
Fig. 2: RR has much better MSD performance, i.e., , than standard SGD when . The dotted black curve is drawn by connecting the MSD performance at the starting points of the successive epochs.

Next, in the second simulation we verify the conclusion that the MSD for the starting point of each epoch for the random reshuffling algorithm, i.e., , can achieve instead of . We still consider the regularized logistic regression problem (17) and (18), and the same experimental setting. Recall that in Lemma 1, we proved that

(19)

which indicates that when is reduced a factor of 10, the MSD-performance should be improved by at least dB. We observe a decay of about 20dB per decade in Fig. 3 for a logistic regression problem with data points and 30dB per decade in Fig. 4 with .

Fig. 3: Mean-square-deviation performance at steady-state versus the step size for a logistic problem involving data points. The slope is around dB per decade.
Fig. 4: Mean-square-deviation performance at steady-state versus the step size for a logistic problem involving data points. The slope is around dB per decade.

Iv Introducing a Long-Term Model

We proved in the earlier sections that the mean-square error under random reshuffling approaches a small neighborhood around the minimizer. Our objective now is to assess more accurately the size of the constant that multiplies in the result, and examine how this constant may depend on various parameters including the amount of data, , and the form of the loss function . To do that, we proceed in two steps. First, we introduce an auxiliary long-term model in (28) below and subsequently determine how far the performance of this model is from the original system described by (27) further ahead.

Iv-a Error Dynamics

In order to quantify the performance of the random reshuffling implementation more accurately than the figure obtained earlier, we will need to impose a condition on the smoothness of the Hessian matrix of the risk function.

Assumption 2 (Hessian is Lipschitz continuous).

The risk function has a Lipschitz continuous Hessian matrix, i.e., there exists a constant , such that

(20)

Under this assumption, the gradient vector, , can be expressed in Taylor expansion in the form[28, p. 378]:

(21)

where the residual term satisfies:

(22)

As such, we can rewrite algorithm (3) in the form:

(23)

To ease the notation, we introduce the Hessian matrix and the gradient noise process:

(24)

so that (23) is simplified as:

(25)

Now property (9) motivates us to expand (25) into the following error recursion by adding and subtracting the same gradient noise term evaluated at :

(26)

Iterating (26) and using (4) we can establish the following useful relation, which we call upon in the sequel:

(27)

Note that recursion (27) relates to , which are the starting points of two successive epochs. In this way, we have now transformed recursion (3), which runs from one sample to another within the same epoch, into a relation that runs from one starting point to another over two successive epochs.

To proceed, we will ignore the last two terms in (27) and consider the following approximate model, which we shall refer to as a long-term model.

(28)

Obviously, the state evolution will be different than (27) and is therefore denoted by the prime notation, . Observe, however, that in model (28) the gradient noise process is still being evaluated at the original state vector, , and not at the new state vector, .

Iv-B Performance of the Long-Term Model across Epochs

Note that the gradient noise in (28) has the form of a weighted sum over one epoch. This noise clearly satisfies the property:

(29)

We also know that satisfies the Markov property, i.e., it is independent of all previous and , where , conditioned on . To motivate the next lemma consider the following auxiliary setting.

Assume we have a collection of vectors in whose sum is zero. We define a random walk over these vectors in the following manner. At each time instant, we select a random vector uniformly and with replacement from this set and move from the current location along the vector to the next location. If we keep repeating this construction, we obtain behavior that is represented by the right plot in Fig. 5. Assume instead that we repeat the same experiment except that now we assume the data is first reshuffled and then vectors are selected uniformly without replacement. Because of the zero sum property, and because sampling is now performed without replacement, we find that in this second implementation we always return to the origin after selections. This situation is illustrated in the left plot of the same Fig. 5. The next lemma considers this scenario and provides useful expressions that allow us to estimate the expected location after or more (unitl ) movements. These results will be used in the sequel in our analysis of the performance of stochastic learning under RR.

Fig. 5: Random walk versus Random reshuffling walk. The lines with same color represent all -th choices walk in different epochs.
Lemma 1.

Suppose we have a set of vectors with the constraint . Assume the elements of are randomly reshuffled and then selected uniformly without replacement. Let be any nonnegative constant, be any symmetric positive semi-definite matrix, and introduce

(30)
(31)

Define the following functions for any :

(32)
(33)

It then holds that

(34)
(35)
Proof.

The proof is provided in Appendix E. ∎

We now return to the stochastic gradient implementation under random reshuffling. Recall from (10) that the stochastic gradient satisfies the zero sample mean property so that

(36)

at any given point . Applying Lemma 1, we readily conclude that

(37)

where

(38)

Similarly, we conclude for the gradient noise at the optimal :

(39)

where

(40)
Theorem 2 (Performance of Long-term Model across Epochs).

Under assumptions 1 and 2, when the step size is sufficiently small, namely, for , the mean-square-deviation (MSD) of the long term model (28) is given by

(41)

The convergence to steady-state regime occurs at an exponential rate, dictated by the parameter:

(42)
Proof.

See Appendix F. ∎

The simulations in Fig. 6 show that the MSD expression (41) fits well the performance of the original random reshuffling algorithm. We will establish this fact analytically in the sequel. For now, the simulation is simply confirming that the performance of the long-term model is a good indication of the performance of the original stochastic gradient implementation under RR.

Fig. 6: Mean-square-deviation perfromance of random reshuffling algorithm curve on least-mean-square cost function

Iv-C Performance of the Long-Term Model over Iterations

In the previous section we examined the performance of the long-term model at the starting points of successive epochs. In this section, we examine the performance of the same model at any iterate as time approaches . This analysis will help explain the oscillations that are observed in the learning curves in the simulations. First, similar to (39), we need to determine the covariance matrix for any . From Lemma 1, we immediately get that

(43)
Theorem 3 (Performance Upper-bound for Long- Term Model).

Under assumptions 1 and 2, when the step size satisfies , the upper-bound of mean-square-deviation (MSD) of the long term model (28) at all iterations is given by

(44)
(45)
Proof.

See Appendix G.∎

We need to point out unlike that (41), expression (144) is an upper-bound rather than an actual performance expression. Still, this bound can help provide useful insights on the periodic behavior that is observed in the simulations. The expression (44) on the right-hand side is a convex combination of two performance measures as defined in (45), where the second term is always larger than the first term but approaching it as increases towards . This behavior will become clearer later in the context of an example and the hyperbolic representation in section V-B.

Before we continue, we would like to comment on the convergence curve under random reshuffling. Unlike the convergence curve under uniform sampling, we observe periodic fluctuations under random reshuffling in Figures 2 and 6. The main reason for this behavior is the fact that the gradient noise is no longer i.i.d. in steady-state. Specifically, the noise variance is now a function of the iterate and it assumes its lowest value at the beginning and end of every epoch. In lemma 1, we show that the variance of the random walk process resulting from random reshuffling at each iteration in Eq. (34). We plot the function for and in Fig. 7. Since the mean-square performance of the algorithm is related to the variance of the gradient noise, it is expected that this bell-shape behavior will be reflected in to the MSD curve as well, thus, resulting in better performance at the beginning and end of every epoch.

Fig. 7: The variance function at (34) versus with different value.

Iv-D Mismatch Bound

Now we provide an upper bound on the mismatch between the long-term model (28) and the original algorithm (3).

Lemma 2 (Mismatch Bound).

After long enough iterations, i.e., , the difference between the long term model trajectory (28) and the original trajectory (3) is

(46)

Proof: See Appendix I.

V Quadratic Risks and Hyperbolic Representation

Lastly, we consider an example involving a quadratic (least-squares) risk to show that, in this case, the long-term model provides the exact MSD for the original algorithm. The analysis will also provide some insights into expression (41). It also motivates a hyperbolic representation for the MSD, which helps provides some more insights into the MSD behavior.

V-a Quadratic Risks

Thus, consider the following quadratic risk function:

(47)

where has full column rank. We have:

(48)
(49)
(50)
(51)

Since the gradient noise is independent of , we have

(52)

Moreover, since the risk is quadratic, it also holds that

(53)

Therefore, the long-term model is exactly the same as the original algorithm. For this example, we can calculate the following quantities:

(54)
(55)
(56)
(57)

In special case when the columns of are orthogonal and normalized, i.e., , we can simplify the MSD expression (41) by noting that

(58)

and, hence,

(59)

In order to provide further insights on this MSD expression, we simplify it under a small assumption. We could introduce the Taylor series:

(60)

However, this approximation can be bad if is large, which is not uncommon in big data. Instead, we appeal to:

(61)

Notice it is instead of , and therefore (61) is a tighter approximation than (60) when is large. Based on this, we further approximate:

(62)

and arrive at the simplified expression:

(63)

For comparison purposes, we know that a simplified expression for MSD under uniform sampling has the following expression[10]:

(64)

Hence, the random reshuffling case has an extra multiplicative factor:

(65)

We plot versus in the left plot of Fig. 8 where we ignore . Now it is clear from the figure that the smaller the step size or the smaller sample size are, the larger the improvement in performance is. In contrast, when goes to infinity, the term will converge to , i.e., the same performance as uniform sampling situation, which is consistent with the infinite-horizon case. Lastly, noting that

(66)

and using the approximation (61):

(67)

Substituting in (44), we get for :

(68)

Since is monotonically increasing,