Many canonical machine learning problems boil down to solving a convex empirical risk minimization problem of the form
where each individual function is convex (e.g. the loss on a given example in the training data), and the set is convex. In large-scale applications, where both can be huge, a very popular approach is to employ stochastic gradient methods. Generally speaking, these methods maintain some iterate , and at each iteration, sample an individual function , and perform some update to based on . Since the update is with respect to a single function, this update is usually computationally cheap. Moreover, when the sampling is done independently and uniformly at random,
is an unbiased estimator of the true gradient, which allows for good convergence guarantees after a reasonable number of iterations (see for instance [33, 5, 9, 29]).
However, in practical implementations of such algorithms, it is actually quite common to use without-replacement sampling, or equivalently, pass sequentially over a random shuffling of the functions . Intuitively, this forces the algorithm to process more equally all data points, and often leads to better empirical performance. Moreover, without-replacement sampling is often easier and faster to implement, as it requires sequential data access, as opposed to the random data access required by with-replacement sampling (see for instance [6, 7, 17, 30, 19]).
1.1 What is Known so Far?
Unfortunately, without-replacement sampling is not covered well by current theory. The challenge is that unlike with-replacement sampling, the functions processed at every iteration are not statistically independent, and their correlations are difficult to analyze. Since this lack of theory is the main motivation for our paper, we describe the existing known results in some detail, before moving to our contributions.
To begin with, there exist classic convergence results which hold deterministically for every order in which the individual functions are processed, and in particular when we process them by sampling without replacement (e.g. ). However, these can be exponentially worse than those obtained using random without-replacement sampling, and this gap is inevitable (see for instance ).
More recently, Recht and Ré  studied this problem, attempting to show that at least for least squares optimization, without-replacement sampling is always better (or at least not substantially worse) than with-replacement sampling on a given dataset. They showed this reduces to a fundamental conjecture about arithmetic-mean inequalities for matrices, and provided partial results in that direction, such as when the individual functions themselves are assumed to be generated i.i.d. from some distribution. However, the general question remains open.
) based on random reshuffling: Each epoch, the algorithm draws a new permutation onuniformly at random, and processes the individual functions in that order. Under smoothness and strong convexity assumptions, the authors obtain convergence guarantees of essentially after epochs, vs. using with-replacement sampling (with the notation including certain dependencies on the problem parameters and data size). Thus, without-replacement sampling is shown to be strictly better than with-replacement sampling, after sufficiently many passes over the data. However, this leaves open the question of why without-replacement sampling works well after a few – or even just one – passes over the data. Indeed, this is often the regime at which stochastic gradient methods are most useful, do not require repeated data reshuffling, and their good convergence properties are well-understood in the with-replacement case.
1.2 Our Results
In this paper, we provide convergence guarantees for stochastic gradient methods, under several scenarios, in the natural regime where the number of passes over the data is small, and in particular that no data reshuffling is necessary. We emphasize that our goal here will be more modest than those of [30, 19]: Rather than show superiority of without-replacement sampling, we only show that it will not be significantly worse (in a worst-case sense) than with-replacement sampling. Nevertheless, such guarantees are novel, and still justify the use of with-replacement sampling, especially in situations where it is advantageous due to data access constraints or other reasons. Moreover, these results have a useful application in the context of distributed learning and optimization, as we will shortly describe.
Our main contributions can be summarized as follows:
For convex functions on some convex domain , we consider algorithms which perform a single pass over a random permutation of individual functions, and show that their suboptimality can be characterized by a combination of two quantities, each from a different field: First, the regret which the algorithm can attain in the setting of adversarial online convex optimization [10, 32, 20], and second, the transductive Rademacher complexity of with respect to the individual functions, a notion stemming from transductive learning theory [40, 16].
As a concrete application of the above, we show that if each function corresponds to a convex Lipschitz loss of a linear predictor, and the algorithm belongs to the class of algorithms which in the online setting attain regret on such functions (which includes, for example, stochastic gradient descent), then the suboptimality using without-replacement sampling, after processing functions, is . Up to constants, the guarantee is the same as that obtained using with-replacement sampling.
We turn to consider more specifically the stochastic gradient descent algorithm, and show that if the objective function is -strongly convex, and the functions are also smooth, then the suboptimality bound becomes , which matches the with-replacement guarantees (although with replacement, smoothness is not needed, and the dependence on some parameters hidden in the is somewhat better).
In recent years, a new set of fast stochastic algorithms to solve Eq. (1) has emerged, such as SAG , SDCA (as analyzed in [34, 35], SVRG , SAGA , Finito , S2GD  and others. These algorithms are characterized by cheap stochastic iterations, involving computations of individual function gradients, yet unlike traditional stochastic algorithms, enjoy a linear convergence rate (runtime scaling logarithmically with the required accuracy). To the best of our knowledge, all existing analyses require sampling with replacement. We consider a representative algorithm from this set, namely SVRG, and the problem of regularized least squares, and show that similar guarantees can be obtained using without-replacement sampling. This result has a potentially interesting implication: Under the mild assumption that the problem’s condition number is smaller than the data size, we get that SVRG can converge to an arbitrarily accurate solution (even up to machine precision), without the need to reshuffle the data – only a single shuffle at the beginning suffices. Thus, at least for this problem, we can obatin fast and high-quality solutions even if random data access is expensive.
A further application of the SVRG result is in the context of distributed learning: By simulating without-replacement SVRG on data randomly partitioned between several machines, we get a nearly-optimal algorithm for regularized least squares, in terms of communication and computational complexity, as long as the condition number is smaller than the data size per machine (up to logarithmic factors). This builds on the work of Lee et al. , who were the first to recognize the applicability of SVRG to distributed optimization. However, their results relied on with-replacement sampling, and are applicable only for much smaller condition numbers.
We note that our focus is on scenarios where no reshufflings are necessary. In particular, the and bounds apply for all , namely up to one full pass over a random permutation of the entire data. However, our techniques are also applicable to a constant number of passes, by randomly reshuffling the data after every pass. In a similar vein, our SVRG result can be readily extended to a situation where each epoch of the algorithm is done on an independent permutation of the data. We leave a full treatment of this to future work.
The paper is organized as follows. In Section 2, we introduce the basic notation and definitions used throughout the paper. In Section 3, we discuss the case of convex and Lipschitz functions. In Section 4, we consider strongly convex functions, and in Section 5, we provide the analysis for the SVRG algorithm. We provide full proofs in Section 6, and conclude with a summary and open problems in Section 7. Some technical proofs and results are relegated to appendices.
2 Preliminaries and Notation
We use bold-face symbols to denote vectors. Given a vector, denotes it’s -th coordinate. We utilize the standard notation to hide constants, and to hide constants and logarithmic factors.
Given convex functions from to , we define our objective function as
with some fixed optimum . In machine learning applications, each individual
usually corresponds to a loss with respect to a data point, hence will use the terms “individual function”, “loss function” and “data point” interchangeably throughout the paper.
We let be a permutation over chosen uniformly at random. In much of the paper, we consider methods which draw loss functions without replacement according to that permutation (that is, ). We will use the shorthand notation
to denote the average loss with respect to the first and last loss functions respectively, as ordered by the permutation (intuitively, the losses in are those already observed by the algorithm at the beginning of iteration , whereas the losses in are those not yet observed). We use the convention that , and the same goes for other expressions of the form throughout the paper, when . We also define quantities such as and similarly.
A function is -strongly convex, if for any ,
where is any (sub)-gradient of at . Note that any convex function is -strongly convex. We also say a function is -smooth if for any ,
-smoothness also implies that the function is differentiable, and its gradient is -Lipschitz. Based on these properties, it is easy to verify that if , and is -strongly convex and -smooth, then
We will also require the notion of trandsuctive Rademacher complexity, as developed by El-Yaniv and Pechyony [16, Definition 1], with a slightly different notation adapted to our setting:
Let be a set of vectors in . Let be positive integers such that , and denote . We define the transductive Rademacher Complexity as
where are i.i.d. random variables such that
are i.i.d. random variables such that
This quantity is an important parameter is studying the richness of the set , and will prove crucial in providing some of the convergence results presented later on. Note that it differs slightly from standard Rademacher complexity, which is used in the theory of learning from i.i.d. data (e.g. [3, 27, 33], where the Rademacher variables only take values, and is replaced by ).
3 Convex Lipschitz Functions
We begin by considering loss functions which are convex and -Lipschitz over some convex domain . We assume the algorithm sequentially goes over some permuted ordering of the losses, and before processing the -th loss function, produces an iterate . Moreover, we assume that the algorithm has a regret bound in the adversarial online setting, namely that for any sequence of convex Lipschitz losses , and any ,
for some scaling sub-linearly in 111For simplicity, we assume the algorithm is deterministic given , but all results in this section also hold for randomized algorithms (in expectation over the algorithm’s randomness), assuming the expected regret of the algorithm w.r.t. any is at most .. For example, online gradient descent (which is equivalent to stochastic gradient descent when the losses are i.i.d.), with a suitable step size, satisfies , where is the Lipschitz parameter and upper bounds the norm of any vector in . A similar regret bound can also be shown for other online algorithms (see [20, 32, 41]).
Since the ideas used for analyzing this setting will also be used in the more complicated results which follow, we provide the analysis in some detail. We first have the following straightforward theorem, which upper bounds the expected suboptimality in terms of regret and the expected difference between the average loss on prefixes and suffixes of the data.
Suppose the algorithm has a regret bound , and sequentially processes where is a random permutation on . Then in expectation over ,
The left hand side in the inequality above can be interpreted as an expected bound on , where is drawn uniformly at random from . Alternatively, by Jensen’s inequality and the fact that is convex, the same bound also applies on , where .
The proof of the theorem relies on the following simple but key lemma, which expresses the expected difference between with-replacement and without-replacement sampling in an alternative form, similar to Thm. 1 and one which lends itself to tools and ideas from transductive learning theory. This lemma will be used in proving all our main results.
Let be a permutation over chosen uniformly at random. Let be random variables which conditioned on , are independent of . Then
for , and for we have .
Proof of Thm. 1.
Adding and subtracting terms, and using the facts that is a permutation chosen uniformly at random, and is fixed,
Applying the regret bound assumption on the sequence of losses , the above is at most
Since (as a random variable over the permutation of the data) depends only on , we can use Lemma 1 (where , and noting that the expectation above is when ), and get that the above equals
Having reduced the expected suboptimality to the expected difference , the next step is to upper bound it with : Namely, having split our loss functions at random to two groups of size and , how large can be the difference between the average loss of any on the two groups? Such uniform convergence bounds are exactly the type studied in transductive learning theory, where a fixed dataset is randomly split to a training set and a test set, and we consider the generalization performance of learning algorithms ran on the training set. Such results can be provided in terms of the transductive Rademacher complexity of , and combined with Thm. 1, lead to the following bound in our setting:
Suppose that each is chosen from a fixed domain , that the algorithm enjoys a regret bound , and that . Then in expectation over the random permutation ,
Thus, we obtained a generic bound which depends on the online learning characteristics of the algorithm, as well as the statistical learning properties of the class on the loss functions. The proof (as the proofs of all our results from now on) appears in Section 6.
We now instantiate the theorem to the prototypical case of bounded-norm linear predictors, where the loss is some convex and Lipschitz function of the prediction of a predictor on an instance , possibly with some regularization:
Under the conditions of Thm. 2, suppose that , and each loss function has the form for some -Lipschitz , , and a fixed function . Then
As discussed earlier, in the setting of Corollary 1, typical regret bounds are on the order of . Thus, the expected suboptimality is , all the way up to (i.e. a full pass over a random permutation of the data). Up to constants, this is the same as the suboptimality attained by iterations of with-replacement sampling, using stochastic gradient descent or similar algorithms .
4 Faster Convergence for Strongly Convex Functions
We now consider more specifically the stochastic gradient descent algorithm on a convex domain , which can be described as follows: We initialize at some , and perform the update steps
where are fixed step sizes, is projection on , and is a subgradient of at . Moreover, we assume the objective function is -strongly convex for some . In this scenario, using with-replacement sampling (i.e. is a subgradient of an independently drawn ), performing iterations as above and returning a randomly sampled iterate or their average results in expected suboptimality , where is an upper bound on the expected squared norm of [29, 33]. Here, we study a similar situation in the without-replacement case.
In the result below, we will consider specifically the case where each is a Lipschitz and smooth loss of a linear predictor , possibly with some regularization. The smoothness assumption is needed to get a good bound on the transductive Rademacher complexity of quantities such as . However, the technique can be potentially applied to more general cases.
Suppose has diameter , and that is -strongly convex on . Assume that where , is possibly some regularization term, and each is -Lipschitz and -smooth on . Furthermore, suppose . Then for any , if we run SGD for iterations with step size , we have
where is a universal positive constant.
As in the results of the previous section, the left hand side is the expected optimality of a single where is chosen uniformly at random, or an upper bound on the expected suboptimality of the average . This result is similar to the with-replacement case, up to numerical constants and the additional term in the numerator. We note that in some cases, is the dominant term anyway222 is generally on the order of , which is the same as if is the dominant term. This happens for instance with the squared loss, whose Lipschitz parameter is on the order of .. However, it is not clear that the term is necessary, and removing it is left to future work.
Remark 1 ( Factor).
It is possible to remove the logarithmic factors in Thm. 3, by considering not , but rather only an average over a suffix of the iterates ( for some fixed ), or by a weighted averaging scheme. This has been shown in the case of with-replacement sampling ([29, 24, 38]), and the same proof techniques can be applied here.
The proof of Thm. 3 is somewhat more challenging than the results of the previous section, since we are attempting to get a faster rate of rather than , all the way up to . A significant technical obstacle is that our proof technique relies on concentration of averages around expectations, which on samples does not go down faster than . To overcome this, we apply concentration results not on the function values (i.e. as in the previous section), but rather on gradient-iterate inner products, i.e. , where is the optimal solution. To get good bounds, we need to assume these gradients have a certain structure, which is why we need to make the assumption in the theorem about the form of each . Using transductive Rademacher complexity tools, we manage to upper bound the expectation of these inner products by quantities roughly of the form (assuming here for simplicity). We now utilize the fact that in the strongly convex case, itself decreases to zero with at a certain rate, to get fast rates decreasing as . The full proof appears in Subsection 6.5.
5 Without-Replacement SVRG for Least Squares
In this section, we will consider a more sophisticated stochastic gradient approach, namely the SVRG algorithm of , designed to solve optimization problems with a finite sum structure as in Eq. (1). Unlike purely stochastic gradient procedures, this algorithm does require several passes over the data. However, assuming the condition number is smaller than the data size (where is the smoothness of each , and is the strong convexity parameter of ), only gradient evaluations are required to get an -accurate solution, for any . Thus, we can get a high-accuracy solution after the equivalent of a small number of data passes. As discussed in the introduction, over the past few years several other algorithms have been introduced and shown to have such a behavior. We will focus on the algorithm in its basic form, where the domain is unconstrained and equals .
The existing analysis of SVRG () assumes stochastic iterations, which involves sampling the data with replacement. Thus, it is natural to consider whether a similar convergence behavior occurs using without-replacement sampling. As we shall see, a positive reply has at least two implications: The first is that as long as the condition number is smaller than the data size, SVRG can be used to obtain a high-accuracy solution, without the need to reshuffle the data: Only a single shuffle at the beginning suffices, and the algorithm terminates after a small number of sequential passes (logarithmic in the required accuracy). The second implication is that such without-replacement SVRG can be used to get a nearly-optimal algorithm for convex distributed learning and optimization on randomly partitioned data, as long as the condition number is smaller than the data size at each machine.
The SVRG algorithm using without-replacement sampling on a dataset of size is described as Algorithm 1. It is composed of multiple epochs (indexed by ), each involving one gradient computation on the entire dataset, and stochastic iterations, involving gradient computations with respect to individual data points. Although the gradient computation on the entire dataset is expensive, it is only needed to be done once per epoch. Since the algorithm will be shown to have linear convergence as a function of the number of epochs, this requires only a small (logarithmic) number of passes over the data.
We will consider specifically the regularized least mean squares problem, where
for some and . Moreover, we assume that is -strongly convex (note that necessarily ). For convenience, we will assume that are all at most (this is without much loss of generality, since we can always re-scale the loss functions by an appropriate factor). Note that under this assumption, each as well as are also -smooth.
There is some universal constant , such that for any and any , if we run algorithm 1, using parameters satisfying
then after epochs of the algorithm, we obtain a point for which
In particular, by taking and , the algorithm attains an -accurate solution after stochastic iterations of without-replacement sampling, and sequential passes over the data to compute gradients of . This implies that as long as (which stands for the condition number of the problem) is smaller than , the number of without-replacement stochastic iterations is smaller than the data size . Therefore, assuming the data is randomly shuffled, we can get a solution using only sequential passes over the data, without the need to reshuffle.
We make a few remarks about the result:
Remark 2 ( Parameter).
The condition that with probability is needed for technical reasons in our analysis, which requires some uniform upper bound on . That being said, only appears inside log factors, so even a very crude bound would suffice. In appendix B, we indeed show that under the theorem’s conditions, there is always a valid satisfying , which can be plugged into Thm. 4. Even then, we conjecture that this bound on can be significantly improved, or perhaps even removed altogether from the analysis.
Remark 3 ( Factor).
The logarithmic dependence on the dimension is due to an application of a matrix Bernstein inequality for without-replacement sampling on matrices. However, with some additional effort, it might be possible to shave off this log factor and get a fully dimension-free result, by deriving an appropriate matrix concentration result depending on the intrinsic dimension (see Section 7 in  for derivations in the with-replacement setting). We did not systematically pursue this, as it is not the focus of our work.
Remark 4 (Extension to other losses).
Our proof applies as-is in the slightly more general setting where each is of the form , where are bounded and is a rank-1 positive semidefinite matrix (i.e. of the form for some vector ). The rank-1 assumption is required in one step of the proof, when deriving Eq. (14). Using a somewhat different analysis, we can also derive a convergence bound for general strongly convex and smooth losses, but only under the condition that the epoch size scales as (rather than ), which means that the strong convexity parameter can be at most (ignoring logarithmic factors). Unfortunately, in the context of without-replacement sampling, this is a somewhat trivial regime: For such , the analysis of with-replacement SVRG requires sampling less than individual loss functions for convergence. But by the birthday paradox, with-replacement and without-replacement sampling are then statistically indistinguishable, so we could just argue directly that the expected suboptimality of without-replacement SVRG would be similar to that of with-replacement SVRG.
The proof of Thm. 4 (in Subsection 6.6) uses ideas similar to those employed in the previous section, but is yet again more technically challenging. Recall that in the previous section, we were able to upper bound certain inner products by a quantity involving both the iteration number and , namely the Euclidean distance of the iterate to the optimal solution. To get Thm. 4, this is not good enough, and we need to upper bound similar inner products directly in terms of the suboptimality , as well as . We are currently able to get a satisfactory result for least squares, where the Hessians of the objective function are fixed, and the concentration tools required boil down to concentration of certain stochastic matrices, without the need to apply transductive Rademacher complexity results. The full proof appears in Subsection 6.6.
5.1 Application of Without-Replacement SVRG to distributed learning
An important variant of the problems we discussed so far is when training data (or equivalently, the individual loss functions ) are split between different machines, who need to communicate in order to reach a good solution. This naturally models situations where data is too large to fit at a single machine, or where we wish to speed up the optimization process by parallelizing it on several computers. There has been much research devoted to this question in the past few years, with some examples including [43, 4, 2, 46, 1, 8, 26, 42, 14, 11, 15, 21, 37, 36, 44, 25].
A substantial number of these works focus on the setting where the data is split equally at random between machines (or equivalently, that data is assigned to each machine by sampling without replacement from )). Intuitively, this creates statistical similarities between the data at different machines, which can be leveraged to aid the optimization process. Recently, Lee et al.  proposed a simple and elegant solution, at least in certain parameter regimes. This is based on the observation that SVRG, according to its with-replacement analysis, requires epochs, where in each epoch one needs to compute an exact gradient of the objective function , and gradients of individual losses chosen uniformly at random (where is the required suboptimality, is the smoothness parameter of each , and is the strong convexity parameter of ). Therefore, if each machine had i.i.d. samples from , whose union cover , the machines could just simulate SVRG: First, each machine splits its data to batches of size . Then, each SVRG epoch is simulated by the machines computing a gradient of , involving a fully parallel computation and one communication round333Specifically, each machine computes a gradient with respect to an average of a different subset of , which can be done in parallel, and then perform distributed averaging to get a gradient with respect to . For simplicity we assume a broadcast model where this requires a single communication round, but this can be easily generalized., and one machine computing gradients with respect to one of its batches. Overall, this would require communication rounds, and runtime, where is the number of data points per machine (ignoring communication time, and assuming constant time to compute a gradient of ). Under the reasonable assumption that the condition number is less than , this requires runtime linear in the data size per machine, and logarithmic in the target accuracy , with only a logarithmic number of communication rounds. Up to logarithmic factors, this is essentially the best one can hope for with a distributed algorithm444Moreover, a lower bound in  implies that under very mild conditions, the number of required communication rounds is at least . Thus, a logarithmic number of communication rounds is unlikely to be possible once is significantly larger than ..
Unfortunately, the reasoning above crucially relies on each machine having access to i.i.d. samples, which can be reasonable in some cases, but is different than the more common assumption that the data is randomly split among the machines. To circumvent this issue,  propose to communicate individual data points / losses between machines, so as to simulate i.i.d. sampling. However, by the birthday paradox, this only works well in the regime where the overall number of samples required () is not much larger than . Otherwise, with-replacement and without-replacement sampling becomes statistically very different, and a large number of data points would need to be communicated. In other words, if communication is an expensive resource, then the solution proposed in  only works well up to condition number being roughly . In machine learning applications, the strong convexity parameter often comes from explicit regularization designed to prevent over-fitting, and needs to scale with the data size, usually so that is between and . Thus, the solution above is communication-efficient only when is relatively small
However, the situation immediately improves if we can use a without-replacement version of SVRG, which can easily be simulated with randomly partitioned data: The stochastic batches can now be simply subsets of each machine’s data, which are statistically identical to sampling without replacement. Thus, no data points need to be sent across machines, even if is large. We present the pseudocode as Algorithm 2.
Let us consider the analysis of no-replacement SVRG provided in Thm. 4. According to this analysis, by setting , then as long as the total number of batches is at least , and , then Algorithm 2 will attain an -suboptimal solution in expectation. In other words, without any additional communication, we extend the applicability of distributed SVRG (at least for regularized least squares) from the regime to , which covers most scenarios relevant to machine learning.
Before providing the proofs of our main theorems, we develop in Subsection 6.1 below some important technical results on transductive Rademacher complexity, required in some of the proofs.
6.1 Results on Transductive Rademacher Complexity
In this subsection, we develop a few important tools and result about Rademacher complexity, that will come handy in our analysis. We begin by the following theorem, which is a straightforward corollary of Theorem 1 from  (attained by simplifying and upper-bounding some of the parameters).
Suppose for some . Let be a permutation over chosen uniformly at random, and define , . Then for any , with probability at least ,
We note that the theorem in  actually bounds , but the proof is completely symmetric to the roles of and , and hence also implies the formulation above.
We will also need the well-known contraction property, which states that the Rademacher complexity of a class of vectors can only increase by a factor of if we apply on each coordinate a fixed -Lipschitz function:
Let are real-valued, -Lipschitz functions, and given some , define . Then
This is a slight generalization of Lemma 5 from  (which is stated for , but the proof is exactly the same).
In our analysis, we will actually only need bounds on the expectation of , which is weaker than what Thm. 5 provides. Although such a bound can be developed from scratch, we find it more convenient to simply get such a bound from Thm. 5. Specifically, combining Thm. 5 and Lemma 7 from Appendix A, we have the following straightforward corollary:
Suppose for some . Let be a permutation over chosen uniformly at random, and define , . Then
Moreover, if for any permutation , then
We now turn to collect a few other structural results, which will be useful when studying the Rademacher complexity of linear predictors or loss gradients derived from such predictors.
Given two sets of vectors , for some , define
The proof resembles the proof of the contraction inequality for standard Rademacher complexity (see for instance Lemma 26.9 in ).
By definition of , it is enough to prove that
since the right hand side can be upper bounded by . To get this, we will treat the coordinates one-by-one, starting with the first coordinate and showing that
Repeating the same argument for coordinates will yield Eq. (3).
For any values and in the coordinates of some and respectively, we have
Recalling that are i.i.d. and take values of and with probability (and otherwise), we can write the left hand side of Eq. (4) as