# Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging

This work characterizes the benefits of averaging techniques widely used in conjunction with stochastic gradient descent (SGD). In particular, this work sharply analyzes: (1) mini-batching, a method of averaging many samples of the gradient to both reduce the variance of a stochastic gradient estimate and for parallelizing SGD and (2) tail-averaging, a method involving averaging the final few iterates of SGD in order to decrease the variance in SGD's final iterate. This work presents the first tight non-asymptotic generalization error bounds for these schemes for the stochastic approximation problem of least squares regression. Furthermore, this work establishes a precise problem-dependent extent to which mini-batching can be used to yield provable near-linear parallelization speedups over SGD with batch size one. These results are utilized in providing a highly parallelizable SGD algorithm that obtains the optimal statistical error rate with nearly the same number of serial updates as batch gradient descent, which improves significantly over existing SGD-style methods. Finally, this work sheds light on some fundamental differences in SGD's behavior when dealing with agnostic noise in the (non-realizable) least squares regression problem. In particular, the work shows that the stepsizes that ensure optimal statistical error rates for the agnostic case must be a function of the noise properties. The central analysis tools used by this paper are obtained through generalizing the operator view of averaged SGD, introduced by Defossez and Bach (2015) followed by developing a novel analysis in bounding these operators to characterize the generalization error. These techniques may be of broader interest in analyzing various computational aspects of stochastic approximation.

## Authors

• 63 publications
• 56 publications
• 13 publications
• 42 publications
• 55 publications
• ### Beating SGD Saturation with Tail-Averaging and Minibatching

While stochastic gradient descent (SGD) is one of the major workhorses i...
02/22/2019 ∙ by Nicole Mücke, et al. ∙ 0

• ### The Impact of the Mini-batch Size on the Variance of Gradients in Stochastic Gradient Descent

The mini-batch stochastic gradient descent (SGD) algorithm is widely use...
04/27/2020 ∙ by Xin Qian, et al. ∙ 0

• ### Improving the convergence of SGD through adaptive batch sizes

10/18/2019 ∙ by Scott Sievert, et al. ∙ 0

• ### Active Mini-Batch Sampling using Repulsive Point Processes

The convergence speed of stochastic gradient descent (SGD) can be improv...
04/08/2018 ∙ by Cheng Zhang, et al. ∙ 0

• ### Optimal Mini-Batch Size Selection for Fast Gradient Descent

This paper presents a methodology for selecting the mini-batch size that...
11/15/2019 ∙ by Michael P. Perrone, et al. ∙ 0

• ### ROOT-SGD: Sharp Nonasymptotics and Asymptotic Efficiency in a Single Algorithm

The theory and practice of stochastic optimization has focused on stocha...
08/28/2020 ∙ by Chris Junchi Li, et al. ∙ 19

• ### Communication-efficient SGD: From Local SGD to One-Shot Averaging

We consider speeding up stochastic gradient descent (SGD) by parallelizi...
06/09/2021 ∙ by Artin Spiridonoff, et al. ∙ 10

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction and Problem Setup

With the ever increasing size of modern day datasets, practical algorithms for machine learning are increasingly constrained to spend less time and use less memory. This makes it particularly desirable to employ simple streaming algorithms that generalize well in a few passes over the dataset.

Stochastic gradient descent (SGD) is perhaps the simplest and most well studied algorithm that meets these constraints. The algorithm repeatedly samples an instance from the stream of data and updates the current parameter estimate using the gradient of the sampled instance. Despite its simplicity, SGD has been immensely successful and is the de-facto method for large scale learning problems. The merits of SGD for large scale learning and the associated computation versus statistics tradeoffs is discussed in detail by the seminal work of Bottou and Bousquet (2007).

While a powerful machine learning tool, SGD in its simplest forms is inherently serial. Over the past years, as dataset sizes have grown there have been remarkable developments in processing capabilities with multi-core/distributed/GPU computing infrastructure available in abundance. The presence of this computing power has triggered the development of parallel/distributed machine learning algorithms (Mann et al. (2009); Zinkevich et al. (2011); Bradley et al. (2011); Niu et al. (2011); Li et al. (2014); Zhang and Xiao (2015)) that possess the capability to utilize multiple cores/machines. However, despite this exciting line of work, it is yet unclear how to best parallelize SGD and fully utilize these computing infrastructures.

This paper takes a step towards answering this question, by characterizing the behavior of constant stepsize SGD for the problem of strongly convex stochastic least square regression (LSR) under two averaging schemes widely believed to improve the performance of SGD. In particular, this work considers the natural parallelization technique of mini-batching, where multiple data-points are processed simultaneously and the current iterate is updated by the average gradient over these samples, and combine it with variance reducing technique of tail-averaging, where the average of many of the final iterates are returned as SGD’s estimate of the solution.

In this work, parallelization arguments are structured through the lens of a work-depth tradeoff: work refers to the total computation required to reach a certain generalization error, and depth refers to the number of serial updates. Depth, defined in this manner, is a reasonable estimate of the runtime of the algorithm on a large multi-core architecture with shared memory, where there is no communication overhead, and has strong implications for parallelizability on other architectures.

### 1.1 Problem Setup and Notations

We use boldface small letters (

etc.) for vectors, boldface capital letters (

etc.) for matrices and normal script font letters (

etc) for tensors. We use

to denote the outer product of two vectors or matrices. Loewner ordering between two PSD matrices is represented using .

This paper considers the stochastic approximation problem of Least Squares Regression (LSR). Let be the expected square loss over tuples sampled from a distribution :

 L(w)=12⋅E(x,y)∼D[(y−⟨w,x⟩)2]∀w∈Rd. (1)

Let be a minimizer of the problem (1). Now, let the Hessian of the problem (1) be denoted as:

 Hdef=∇2L(w)=E[xx⊤].

Next, we define the fourth moment tensor

of the inputs as:

 Mdef=E[x⊗x⊗x⊗x].

Let the noise in a sample with respect to the minimizer of (1) be denoted as:

 ϵx,ydef=y−⟨w∗,x⟩.

Finally, let the noise covariance matrix be denoted as:

 Σdef=E[ϵ2x,yxx⊤].

The homoscedastic (or, additive noise/well specified) case of LSR refers to the case when is mutually independent from . This is the case, say, when sampled from a Gaussian, independent of . In this case, , where, , where the subscript on is suppressed owing to the independence of on any sample . On the other hand, the heteroscedastic (or, mis-specified) case refers to the setting when is correlated with the input . In this paper, all our results apply to the general mis-specified case of the LSR problem.

#### 1.1.1 Assumptions

We make the following assumptions about the problem.

• Finite fourth moment: The fourth moment tensor exists and is finite.

• Strong convexity: The Hessian of , is positive definite i.e., .

is a standard regularity assumption for the analysis of SGD and related algorithms. is also a standard assumption and guarantees that the minimizer of (1), i.e., is unique.

#### 1.1.2 Important Quantities

In this section, we will introduce some important quantities required to present our results. Let denote the identity matrix. For any matrix , . Let and represent the left and right multiplication operators of the matrix so that for any matrix , we have and .

• Fourth moment bound: Let be the smallest number such that .

• Smallest eigenvalue: Let

be the smallest eigenvalue of

i.e., .

The fourth moment bound implies that . Further more, implies that the smallest eigenvalue of is strictly greater than zero ().

#### 1.1.3 Stochastic Gradient Descent: Mini-Batching and Iterate Averaging

In this paper, we work with a stochastic first order oracle. This oracle, when queried at samples an instance

and uses this to return an unbiased estimate of the gradient of

:

 ˆ∇L(w)=−(y−⟨w,x⟩)⋅x;  E[ˆ∇L(w)]=∇L(w).

We consider the stochastic gradient descent (SGD) method (Robbins and Monro, 1951), which minimizes by following the direction opposite to this noisy stochastic gradient estimate, i.e.:

 wt=wt−1−γ⋅ˆ∇Lt(wt−1), with, ˆ∇Lt(wt−1)=−(yt−⟨wt−1,xt⟩)⋅xt

with being a constant step size/learning rate; is the stochastic gradient evaluated using the sample at . We consider two algorithmic primitives used in conjunction with SGD namely, mini-batching and tail-averaging (also referred to as iterate/suffix averaging).

Mini-batching involves querying the gradient oracle several times and using the average of the returned stochastic gradients to take a single step. That is,

 wt=wt−1−γ⋅(1bb∑i=1ˆ∇Lt,i(wt−1)),

where, is the batch size. Note that at iteration , mini-batching involves repeatedly querying the stochastic gradient oracle at for a total of times. For every query at iteration , the oracle samples an instance and returns a stochastic gradient estimate . These estimates are averaged and then used to perform a single step from to . Mini-batching enables the possibility of parallelization owing to the use of cheap matrix-vector multiplication for computing stochastic gradient estimates. Furthermore, mini-batching allows for the possible reduction of variance owing to the effect of averaging several stochastic gradient estimates.

Tail-averaging (or suffix averaging) refers to returning the average of the final few iterates of a stochastic gradient method as a means to improve its variance properties (Ruppert, 1988; Polyak and Juditsky, 1992). In particular, assuming the stochastic gradient method is run for steps, tail-averaging involves returning

 ¯w=1n−sn∑t=s+1wt

as an estimate of . Note that can be interpreted as being , with being some constant.

Typical excess risk bounds (or, generalization error bounds) for the stochastic approximation problem involve the contribution of two error terms namely, (i) the bias, which refers to the dependence on the starting conditions /initial excess risk and, (ii) the variance, which refers to the dependence on the noise introduced by the use of a stochastic first order oracle.

#### 1.1.4 Optimal Error Rates for the Stochastic Approximation problem

Under standard regularity conditions often employed in the statistics literature, the minimax optimal rate on the excess risk is achieved by the standard Empirical Risk Minimizer (or, Maximum Likelihood Estimator) (Lehmann and Casella, 1998; van der Vaart, 2000). Given i.i.d. samples drawn from , define the empirical risk minimization problem as obtaining

 w∗n=argminw12nn∑i=1(yi−⟨w,xi⟩)2.

Let us define the noise variance to represent

 ˆσ2MLE=E[∥ˆ∇L(w∗)∥2H−1]=Tr[H−1Σ].

The asymptotic minimax rate of the Empirical Risk Minimizer on every problem instance is  (Lehmann and Casella, 1998; van der Vaart, 2000), i.e.,

 limn→∞ESn[L(w∗n)]−L(w∗)ˆσ2MLE/n=1.

For the well-specified case (i.e., the additive noise case, where, ), we have . Seminal works of Ruppert (1988); Polyak and Juditsky (1992) prove that tail-averaged SGD, with averaging from start, achieves the minimax rate for the well-specified case in the limit of .

Goal: In this paper, we seek to provide a non-asymptotic understanding of (a) mini-batching and issues of learning rate versus batch-size, (b) tail-averaging, (c) the effect of the model mis-specification, (d) a batch size doubling scheme for parallelizing statistical estimation, (e) a communication efficient parallelization scheme namely, parameter-mixing/model averaging and (f) the behavior of learning rate versus batch size on the final iterate of the mini-batch SGD procedure, on the behavior of excess risk of SGD (in terms of both the bias and the variance terms) for the streaming LSR problem, with the goal of achieving the minimax rate on every problem instance.

### 1.2 This Paper’s Contributions

The main contributions of this paper are as follows:

• This work shows that mini-batching yields near-linear parallelization speedups over the standard serial SGD (i.e. with batch size ), as long as the mini-batch size is smaller than a problem dependent quantity (which we denote by ). When batch-sizes increase beyond , mini-batching is inefficient (owing to the lack of serial updates), thus obtaining only sub-linear speedups over mini-batching with a batch size

. A by-product of this analysis sheds light on how the step sizes naturally interpolate from ones used by standard serial SGD (with batch size

) to ones used by batch gradient descent.

• While the final iterate of SGD decays the bias at a geometric rate but does not obtain minimax rates on the variance, the averaged iterate (Polyak and Juditsky, 1992; Défossez and Bach, 2015) decays the bias at a sublinear rate while achieving minimax rates on the variance. This work rigorously shows that tail-averaging obtains the best of both worlds: decaying the bias at a geometric rate and obtaining near-minimax rates (up to constants) on the variance. This result corroborates with empirical findings (Merity et al., 2017)

that indicate the benefits of tail-averaging in general contexts such as training Long-Short term memory models (LSTMs).

• Next, this paper precisely characterizes the tradeoffs of learning rate versus batch size and its effect on the excess risk of the final iterate of an SGD procedure, which provides theoretical evidence to empirical observations (Goyal et al., 2017; Smith et al., 2017)

described in the context of deep learning and non-convex optimization.

• Combining the above results, this paper provides a mini-batching and tail-averaging version of SGD that is highly parallelizable: the number of serial steps (which is a proxy for the un-parallelizable time) of this algorithm nearly matches that of offline gradient descent and is lower than the serial time of all existing streaming LSR algorithms. See Table 1 for comparison. We note that these results are obtained by providing a tight finite-sample analysis of the effects of mini-batching and tail-averaging with large constant learning rate schemes.

• We provide a non-asymptotic analysis of parameter mixing/model averaging schemes for the streaming LSR problem. Model averaging schemes are an attractive proposition for distributed learning owing to their communication efficient nature, and they are particularly effective in the regime when the estimation error (i.e. variance) is the dominating term in the excess risk. Here, we characterize the excess risk (in terms of both the bias and variance) of the model averaging procedure which sheds light on situations when it is an effective parallelization scheme (in that when this scheme yields linear parallelization speedups).

• All the results in this paper are established for the general mis-specified case of the streaming LSR problem. This establishes a fundamental difference in the behavior of SGD when dealing with mis-specified models in contrast to existing analyses that deal with the well-specified case. In particular, this analysis reveals a surprising insight that the maximal stepsizes (that ensure minimax optimal rates) are a function of the noise properties of the mis-specified problem instance. The main takeaway of this analysis is that the maximal step sizes (that permit achieving minimax rates) for the mis-specified case can be much lower than ones employed in the well-specified case: indeed, a problem instance that yields such a separation between the maximal learning rates for the well specified and the mis-specified case is presented.

The tool employed in obtaining these results generalizes the operator view of averaged SGD with batch size  (Défossez and Bach, 2015) and a clear exposition of the bias-variance decomposition from Jain et al. (2017a) to obtain a sharp bound on the excess risk for mini-batch, tail-averaged constant step-size SGD. Note that the work of Défossez and Bach (2015) does not establish minimax rates while working with large constant step sizes; this shortcoming is remedied by this paper through a novel sharp analysis that rigorously establishes minimax optimal rates while working with large constant step sizes. Furthermore, note that while straightforward operator norm bounds of the matrix operators suffice to show convergence of the SGD method, they turn out to be pretty loose bounds (particularly for bounding the variance). To tighten these bounds, this paper presents a fine grained analysis that bounds the trace of the SGD operators when applied to the relevant matrices. The bounds of this paper and its advantages compared to existing algorithms is indicated in table 1.

While this paper’s results focus on strongly convex streaming least square regression, we believe that our techniques and results extend more broadly. This paper aims to serve as the basis for future work on analyzing SGD and parallelization of large scale algorithms for machine learning.

Paper organization: Section 2 presents the related work. Section 3 presents the main results of this work. Section 4 outlines the proof techniques. Section 5 presents experimental simulations to demonstrate the practical utility of the established mini-batching limits and tail-averaging. The proofs of all the claims and theorems are provided in the appendix.

## 2 Related Work

Stochastic approximation has been the focus of much efforts starting with the work of Robbins and Monro (1951), and has been analyzed in subsequent works including Nemirovsky and Yudin (1983); Kushner and Yin (1987, 2003). These questions and the related issues of computation versus statistics tradeoffs have received renewed attention owing to their relevance in the context of modern large scale machine learning, as highlighted by the work of Bottou and Bousquet (2007).

Geometric Rates on initial error: For offline optimization with strongly convex objectives, gradient descent (Cauchy, 1847) and fast gradient methods (Polyak, 1964; Nesterov, 1983) indicate linear convergence. However, a multiplicative coupling of number of samples and condition number in the computational effort is a major drawback in the large scale context. These limitations are addressed through developments in offline stochastic methods (Roux et al., 2012; Shalev-Shwartz and Zhang, 2012; Johnson and Zhang, 2013; Defazio et al., 2014) and their accelerated variants (Shalev-Shwartz and Zhang, 2013a; Frostig et al., 2015a; Lin et al., 2015; Defazio, 2016; Allen-Zhu, 2016) which offer near linear running time in the number of samples and condition number with passes over the dataset stored in memory.

For stochastic approximation with strongly convex objectives, SGD offers linear rates on the bias without achieving minimax rates on the variance (Bach and Moulines, 2011; Needell et al., 2016; Bottou et al., 2016). In contrast, iterate averaged SGD (Ruppert, 1988; Polyak and Juditsky, 1992) offers a sub-linear rate on the bias (Défossez and Bach, 2015; Dieuleveut and Bach, 2015) while achieving minimax rates on the variance. Note that all these results consider the well-specified (additive noise) case when stating the generalization error bounds. We are unaware of any results that provide sharp non-asymptotic analysis of SGD and the related step size issues in the general mis-specified case. Streaming SVRG (Frostig et al., 2015b) offers a geometric rate on the bias and optimal statistical error rates; we will return to a discussion of Streaming SVRG below. In terms of methods faster than SGD, our own effort (Jain et al., 2017b) provides the first accelerated stochastic approximation method that improves over SGD on every problem instance.

Parallelization of Machine Learning algorithms: In offline optimizationBradley et al. (2011) study parallel co-ordinate descent for sparse optimization. Parallelization via mini-batching has been studied in Cotter et al. (2011); Takác et al. (2013); Shalev-Shwartz and Zhang (2013b); Takác et al. (2015). These results compare worst case upper bounds on the training error to argue parallelization speedups, thus providing weak upper bounds on mini-batching limits. Parameter mixing/Model averaging (Mann et al., 2009) guarantees linear parallelization speedups on the variance but do not improve the bias. Approaches that attempt to re-conciliate communication-computation tradeoffs (Li et al., 2014) indicate increased mini-batching hurts convergence, and this is likely an artifact of comparing weak upper bounds. Hogwild (Niu et al., 2011) indicates near-linear parallelization speedups in the harder asynchronous optimization setting, relying on specific input structures like hard sparsity; these bounds are obtained by comparing worst case upper bounds on training error. Refer to oracle models paragraph below for details on these worst case upper bounds.

In the stochastic approximation context, Dekel et al. (2012) study mini-batching in an oracle model that assumes bounded variance of stochastic gradients. These results compare worst case bounds on the generalization error to prescribe mini-batching limits, which renders these limits to be too loose (as mentioned in their paper). Our paper’s mini-batching result offers guidelines on batch sizes for linear parallelization speedups by comparing generalization bounds that hold on a per problem basis as opposed to worst case bounds. Refer to the paragraph on oracle models for more details. Finally, parameter mixing in the stochastic approximation context (Rosenblatt and Nadler, 2014; Zhang et al., 2015) offers linear parallelization speedups on the variance error while not improving the bias (Rosenblatt and Nadler, 2014). Finally, Duchi et al. (2015) guarantees asymptotic optimality of asynchronous optimization with linear parallelization speedups on the variance.

Oracle models and optimality: In stochastic approximation, there are at least two lines of thought with regards to oracle models and notions of optimality. One line involves considering the case of bounded noise (Kushner and Yin, 2003; Kushner and Clark, 1978), or, bounded variance of the stochastic gradient, which in the least squares setting amounts to assuming bounds on

 ˆ∇L(w)−∇L(w)=(xx⊤−H)(w−w∗)−ϵx.

This implies additional assumptions are required on compactness of the parameter set (which are enforced via projection steps); such assumptions do not hold in practical implementation of stochastic gradient methods and in the setting considered by this paper. Thus, the mini-batching thresholds in  Cotter et al. (2011); Niu et al. (2011); Dekel et al. (2012); Li et al. (2014) present bounds in the above worst-case oracle model by comparing weak upper bounds on the training/test error.

Another view of optimality (Anbar, 1971; Fabian, 1973) considers an objective where the goal is to match the rate of the statistically optimal estimator (referred to as the estimator) on every problem instance. Polyak and Juditsky (1992) consider this oracle model for the LSR problem and prove that the distribution of the averaged SGD estimator on every problem matches that of the estimator under certain regularity conditions (Lehmann and Casella, 1998). A recent line of work (Bach and Moulines, 2013; Frostig et al., 2015b) aims to provide non-asymptotic guarantees for SGD and its variants in this oracle model. This paper aims to understand mini-batching and other computational aspects of parallelizing stochastic approximation on every problem instance by working in this practically relevant oracle model. Refer to Jain et al. (2017b) for more details.

Comparing offline and streaming algorithms: Firstly, offline algorithms require performing multiple passes over a dataset stored in memory. Note that results and convergence rates established in the finite sum/offline optimization context do not translate to rates on the generalization error. Indeed, these results require going though concentration and a generalization error analysis for this translation to occur. Refer to Frostig et al. (2015b) for more details.

Comparison to streaming SVRG: Streaming SVRG does not function in the stochastic first order oracle model (Agarwal et al., 2012) satisfied by SGD as run in practice since it requires gradients at two points from a single sample (Frostig et al., 2015b). Furthermore, in contrast to this work, its depth bounds depend on a stronger fourth moment property due to lack of mini-batching.

## 3 Main Results

We begin by writing out the behavior of the learning rate as a function of batch size.

Maximal Learning Rates: We write out a characterization of the largest learning rate

that permits the convergence of the mini-batch Stochastic Gradient Descent update. The following generalized eigenvector problem allows for the computation of

:

 2γdivb,max=supW∈S(d)⟨W,MW⟩+(b−1)⋅TrWHWHb⋅TrWHW. (2)

This characterization generalizes the divergent stepsize characterization of Défossez and Bach (2015) for batch sizes . The derivation of the above characterization can be found in appendix A.5.1. We note that this characterization sheds light on how the divergent learning rates interpolate from batch size (which is to the batch gradient descent learning rate (setting to ), which turns out to be . A property of worth noting is that it does not depend on properties of the noise (), and depends only on the second and fourth moment properties of the covariate .

We note that in this paper, our interest does not lie in the non-divergent stepsizes , but in the set of (maximal) stepsizes () that are sufficient to guarantee minimax error rates of . For the LSR problem, these maximal learning rates are:

 γb,maxdef=2bR2⋅ρm+(b−1)∥H∥2, where, ρmdef=d∥(HL+HR)−1Σ∥2Tr((HL+HR)−1Σ). (3)

Note that captures a notion of “degree” of model mismatch, and how it impacts the learning rate ; for the additive noise/well specified/homoscedastic case, . Thus, for problems where and is held the same, the well-specified variant of the LSR problem admits a strictly larger learning rate (that achieves minimax rates on the variance) compared to the mis-specified case. Furthermore, in stark contrast to the well-specified case, in the mis-specified case depends not just on the second and fourth moment properties of the input, but also on the noise covariance . We show that our characterization of in the mis-specified case is tight in that there exist problem instances where (equation 3) is off the maximal learning rate in the well-specified case (obtained by setting in equation 3) by a factor of the dimension and is still the largest step size yielding minimax rates. We also note that there could exist mis-specified problem instances where a step size exceeding achieves minimax rates. Characterizing the maximal learning rate that achieves minimax rates on every mis-specified problem instance is an interesting open question. We return to the characterization of in section 3.1.

Note that this paper characterizes the performance of Algorithms 1 and 2 when run with a step size . The proofs turn out to be tedious for and can be found in the initial version of this paper Jain et al. (2016b) and these were obtained through generalizing the operator view of analyzing SGD methods introduced by Défossez and Bach (2015). For the well-specified case, this paper’s results hold for the same learning rate regimes as Bach and Moulines (2013); Frostig et al. (2015b), that are known to admit statistical optimality. We also note that in the additive noise case, we are unaware of a separation between and ; but as we will see, this is not of much consequence given that there exists a strict separation in the learning rate between the well-specified and mis-specified problem instances.

Finally, note that the stochastic process viewpoint allows us to work with learning rates that are significantly larger compared to standard analyses that use function value contraction e.g., Bottou et al. (2016, Theorem 4.6). All existing works establishing mini-batching thresholds in the stochastic optimization setting e.g., Dekel et al. (2012) work in the worst case (bounded noise) oracle with small step sizes, and draw conclusions on mini-batch thresholds and effects by comparing weak upper bounds on the excess risk.

Mini-Batched Tail-Averaged SGD for the mis-specified case: We present our main result, which is the error bound for mini-batch tail-averaged SGD for the general mis-specified LSR problem.

###### Theorem 1.

Consider the general mis-specified case of the LSR problem 1. Running Algorithm 1 with a batch size , step size , number of unaveraged iterations , total number of samples , we obtain an iterate satisfying the following excess risk bound:

 E[L(¯¯¯¯¯w)]−L(w∗)≤2γ2μ2⋅(1−γμ)s(nb−s)2⋅(L(w0)−L(w∗))+4⋅ˆσ2MLEb⋅(nb−s). (4)

In particular, with , we have the following excess risk bound:

 L(¯¯¯¯¯w)−L(w∗)≤2κ2b(nb−s)2exp(−sκb)(L(w0)−L(w∗))T1+4⋅ˆσ2MLEb(nb−s)T2,

with .

Note that the above theorem indicates that the excess risk is composed of two terms, namely the bias (), which represents the dependence on the initial conditions and the variance (), which depends on the statistical noise (); the bias decays geometrically during the “” unaveraged iterations while the variance is minimax optimal (up to constants) provided . We will understand this geometric decay on the bias more precisely.

Effect of tail-averaging SGD’s iterates: To understand tail-averaging, we specialize theorem 1 with a batch size to the well-specified case, i.e., where, , and .

###### Corollary 2.

Consider the well-specified (additive noise) case of the streaming LSR problem (), with a batch size . With a learning rate , unaveraged iterations and total samples , we have the following excess risk bound:

 L(¯¯¯¯¯w)−L(w∗)≤2κ21(n−s)2exp(−sκ1){L(w0)−L(w∗)}T1+4⋅dσ2n−sT2,where, κ1=R2/μ.

Tail-averaging allows for a geometric decay of the initial error , while tail-averaging over (with ), allows for the variance to be minimax optimal (up to constants). We note that the work of Merity et al. (2017), which studies empirical optimization for training non-convex sequence models (e.g. Long-Short term memory models (LSTMs)) also indicate the benefits of tail-averaging.

Note that this particular case (i.e. additive noise/well-specified case with batch size ) with tail-averaging from start () is precisely the setting considered in Défossez and Bach (2015), and their result (a) achieves a sub-linear rate on the bias and (b) their variance term is shown to be minimax optimal only with learning rates that approach zero (i.e. ).

### 3.1 Effects Of Learning Rate, Batch Size and The Role of Mis-specified Models

We now consider the interplay of learning rate, batch size and how model mis-specification plays into the mix. Towards this, we split this section into three parts: (a) understanding learning rate versus mini-batch size in the well-specified case, (b) how model mis-specification leads to a significant difference in the behavior of SGD and (c) how model mis-specification manifests itself when considered in tradeoff between the learning rate versus batch-size.

Effects of mini-batching in the well-specified case: As mentioned previously, in the well-specified case, and . For this case, equation (3) can be specialized as:

 γb,max=2bR2+(b−1)∥H∥2. (5)

Observe that the learning rate grows linearly as a function of the batch size until a batch size . In the regime of batch sizes , the resulting mini-batch SGD updates offer near-linear parallelization speedups over SGD with a batch size of . Furthermore, increasing batch sizes beyond leads to sub-linear increase in the learning rate, and this implies that we lose the linear parallelization speedup offered by mini-batching with a batch-size . Losing the linear parallelization is indicative of the following: consider the case when we double batch-size from to . Suppose the bias error is larger than the variance , we require performing the same number of updates with a batch size as we did with a batch size to achieve a similar excess risk bound; this implies we are inefficient in terms of number of samples (or, number of gradient computations) used to achieve a given excess risk. When the estimation error () dominates the approximation error (), we note that larger batch sizes (with ) serves to improve the variance term, thus allowing linear parallelization speedups via mini-batching.

Note that with a batch size of , the learning rate of

employed by mini-batch SGD resembles ones used by batch gradient descent. This mini-batching characterization thus allows for understanding tradeoffs of learning rate versus batch size. This behavior is noted in practice (empirically, but with no underlying rigorous theory) for a variety of problems (going beyond linear regression/convex optimization), in the deep learning context

(Goyal et al., 2017).

SGD’s behaviour with mis-specified models: Next, this paper attempts to shed light on some fundamental differences in the behavior of SGD when dealing with the mis-specified case (as against the well-specified case, which is the focus of existing results (Polyak and Juditsky, 1992; Bach and Moulines, 2013; Dieuleveut and Bach, 2015; Défossez and Bach, 2015)) of the LSR problem. This paper’s results in general mis-specified case with batch sizes specialize to existing results additive noise/well-specified case with batch size  (Bach and Moulines, 2013; Dieuleveut and Bach, 2015). To understand these issues better, we consider in equation 3 with a batch size :

 γ1,max=2R2⋅ρm. (6)

Recounting that , observe that the mis-specified case admits a maximal learning rate (with a view of achieving minimax rates) that is at most as large as the additive noise/well-specified case, where . Note that when is nearly the same (say, upto constants) as the spectral norm , then and . This implies that there exist mis-specified models whose noise properties (captured through the noise covariance matrix ) prevents SGD from working with large learning rates of used in the well-specified case.

This notion is formalized in the following lemma, which presents an instance working with the mis-specified case, wherein, SGD cannot employ large learning rates used by the well-specified variant of the problem, while retaining minimax optimality. This behavior is in stark contrast to algorithms such as streaming SVRG (Frostig et al. (2015b)), which work with the same large learning rates in the mis-specified case as in the well-specified case, while guaranteeing minimax optimal rates. The proof of lemma 3 can be found in the appendix A.5.6.

###### Lemma 3.

Consider a Streaming LSR example with Gaussian covariates (i.e. ) with a diagonal second moment matrix that is defined by:

 Hii={1if i=11/dif i>1.

Further, let the noise covariance matrix be diagonal as well, with the following entries:

 Σii={1if i=11/[(d−1)d]if i>1.

For this problem instance, is necessary for retaining minimax rates, while the well-specified variant of this problem permits a maximal learning rate , thus implying an separation in learning rates between the well-specified and mis-specified case.

Learning rate versus mini-batch size issues in the mis-specified case: Noting that for the batch size , as mentioned in equation 6, the learning rate for the mis-specified case in the most optimistic situation (when ) can be atmost as large as the learning rate for the well-specified case. Furthermore, we also know from the observations in the mis-specified case that the learning rate tends to grow linearly as a function of the batch size until it hits the limit of . Combining these observations, we will revisit equation 3, which says:

 γb,maxdef=2bR2⋅ρm+(b−1)∥H∥2.

This implies that the mini-batching size threshold can be expressed as:

 bthreshdef=1+R2∥H∥2⋅ρm. (7)

When , we achieve near linear parallelization speedups over running SGD with a batch size . Note that this characterization specializes to the batch size threshold presented in the well-specified case (i.e. where ). Furthermore, this batch size threshold (in the mis-specified case) could be much larger than the threshold in the well-specified case, which is expected since the learning rate for a batch size in the mis-specified case can potentially be much smaller than ones used in the well specified case. Furthermore, with a batch size , note that the learning rate is , resembling ones used with batch gradient descent.

Behavior of the final-iterate: We now present the excess risk bound offered by the final iterate of a stochastic gradient scheme. This result is of much practical relevance in the context of modern machine learning and deep learning, where final iterate is often used, and where the tradeoffs between learning rate and batch sizes are discussed in great detail (Smith et al., 2017). For this discussion, we consider the well-specified case to present our results owing to its ease in presentation. Our framework and results are generic for translating these observations to the mis-specified case.

###### Lemma 4.

Consider the well-specified case of the LSR problem. Running Algorithm 1 with a step size , batch size , total samples and with no iterate averaging (i.e. with ) yields a result that satisfies the following excess risk bound:

 E[L(w⌊n/b⌋)]−L(w∗)≤κb(1−γμ)⌊n/b⌋(L(w0)−L(w∗))+γbσ2Tr(H), (8)

where . In particular, with a step size , we have:

 E[L(w⌊n/b⌋)]−L(w∗)≤κb⋅e−⌊n/b⌋κb⋅(L(w0)−L(w∗))+σ2Tr(H)R2+(b−1)∥H∥2. (9)

Remarks: Noting that , the variance of the final iterate with batch size is . Next, with a batch size , the final iterate has a variance ; at cursory glance this may appear interesting, in that by mini-batching, we do not appear to gain much in terms of the variance. This is unsurprising given that in the regime of , the grows linearly, thus nullifying the effect of averaging multiple stochastic gradients. Furthermore, this follows in accordance with the linear parallelization speedups offered by a batch size . Note however, once , any subsequent increase in batch sizes allows the variance of the final iterate to behave as . Finally, note that once , doubling batch sizes (in equation 9) possesses the same effect as halving learning rate from to (as seen from equation 8), providing theoretical rigor to issues explored in training practical deep models (Smith et al., 2017).

### 3.2 Parallelization via Doubling Batch Sizes and Model Averaging

We now elaborate on a highly parallelizable stochastic gradient method, which is epoch based and relies on doubling batch sizes across epochs to yield an algorithm that offers the same generalization error as that of offline (batch) gradient descent in nearly the

same number of serial updates as batch gradient descent, while being a streaming algorithm that does not require storing the entire dataset in memory. Following this, we present a non-asymptotic bound for parameter mixing/model averaging, which is a communication efficient parallelization scheme that has favorable properties when the estimation error (i.e. variance) is the dominating term of the excess risk.

(Nearly) Matching the depth of Batch Gradient Descent: The result of theorem 1 establishes a scalar generalization error bound of Algorithm 1 for the general mis-specified case of LSR and showed that the depth (number of sequential updates in our algorithm) is decreased to . This section builds upon this result to present a simple and intuitive doubling based streaming algorithm that works in epochs and processes a total of points. In each epoch, the minibatch size is increased by a factor of while applying Algorithm 1 (with no tail-averaging) with twice as many samples as the previous epoch. After running over samples using this epoch based approach, we run Algorithm 1 (with tail-averaging) with the remaining points. Note that each epoch decays the bias of the previous epoch linearly and halves the statistical error (since we double mini-batch size). The final tail-averaging phase ensures that the variance is small.

The next theorem formalizes this intuition and shows Algorithm 2 improves the depth exponentially from to in the presence of an error oracle that provides us with the initial excess risk and the noise level .

###### Theorem 5.

Consider the general mis-specified case of LSR. Suppose in Algorithm 2, we use initial batchsize of , stepsize and number of iterations in each epoch being , we obtain the following excess risk bound on :

 E[L(¯¯¯¯¯w)]−L(w∗)≤(2btn)t12κlog(κ)⋅(L(w0)−L(w∗))+80ˆσ2% MLEn.

Remarks: The final error again has two parts: the bias term that depends on the initial error and the variance term that depends on the statistical noise . Note that the variance error decays at a rate of which is minimax optimal up to constant factors.

Algorithm 2 decays the bias at a superpolynomial rate by choosing large enough. If Algorithm 2 has access to an initial error oracle that provides and , we can run Algorithm 2 with a batch size until the excess risk drops to the noise level and subsequently begin doubling the batch size. Such an algorithm indeed gives geometric convergence with a generalization error bound as:

 E[L(¯¯¯¯¯w)]−L(w∗) ≤exp(−(nλminR2⋅log(κ))⋅1ρm){L(w0)−L(w∗)}+80ˆσ2MLEn,

with a depth of . The proof of this claim follows relatively straightforwardly from the proof of Theorem 5. We note that this depth nearly matches (up to factors), the depth of standard offline gradient descent despite being a streaming algorithm. This algorithm (aside from tail-averaging in the final epoch) resembles empirically effective schemes proposed in the context of training deep models (Smith et al., 2017).

Parameter Mixing/Model-Averaging: We consider a communication efficient method for distributed optimization which involves running mini-batch tail-averaged SGD independently on separate machines (each containing their own independent samples) and averaging the resulting solution estimates. This is a well studied scheme for distributed optimization (Mann et al., 2009; Zinkevich et al., 2011; Rosenblatt and Nadler, 2014; Zhang et al., 2015). As mentioned in Rosenblatt and Nadler (2014), these schemes do not appear to offer improvements in the bias error while offering near linear parallelization speedups on the variance. We provide here a non-asymptotic characterization of the behavior of model averaging for the general mis-specified LSR problem.

###### Theorem 6.

Consider running Algorithm (1), i.e., mini-batch tail-averaged SGD (for the mis-specified LSR problem (1)) independently in machines, each of which contains samples. Let algorithm (1) be run with a batch size , learning rate , tail-averaging begun after iterations, and let each of these machines output . The excess risk of the model-averaged estimator is upper bounded as:

 E[L(¯¯¯¯¯w)]−L(w∗) ≤(1−γμ)sγ2μ2(nP⋅b−s)2⋅2+(P−1)(1−γμ)sP⋅(L(w0)−L(w∗)) +4⋅ˆσ2MLEP⋅b⋅(nP⋅b−s).

In particular, with , we have the following excess risk bound:

 E[L(¯¯¯¯¯w)]−L(w∗) ≤exp(−sκb)⋅κ2b(nP⋅b−s)2⋅2+(P−1)⋅exp(−s/κb)P⋅(L(w0)−L(w∗)) +4⋅ˆσ2MLEP⋅b⋅(nP⋅b−s).

Remarks: We note that during the iterate-averaged phase (i.e. ), there is no reduction of the bias, whereas, during the (initial) unaveraged iterations, once , we achieve linear speedups on the bias. We note that model averaging offers linear parallelization speedups on the variance error. Furthermore, when the bias reduces to the noise level, model averaging offers linear parallelization speedups on the overall excess risk. Note that if , with , then the excess risk is minimax optimal. Finally, we note that the theorem can be generalized in a straightforward manner to the situation when each machine has different number of examples.

## 4 Proof Outline

We present here the framework for obtaining the results described in this paper; the framework has been introduced in the work of Défossez and Bach (2015). Towards this purpose, we begin by introducing some notations. We begin by defining the centered estimate as:

 ηtdef=wt−w∗.

Mini-batch SGD (with a batch size ) moves to using the following update:

 ηt =(I−γb⋅b∑i=1xti⊗xti)ηt−1+γbb∑i=1ϵtixti=(I−γˆHtb)ηt−1+γ⋅ξtb,

where, and . Next, the tail-averaged iterate is associated with its own centered estimate . The analysis proceeds by tracking the covariance of the centered estimates , i.e. by tracking .

Bias-Variance decomposition: The main results of this paper are derived by going through the bias-variance decomposition, which is well known in the context of Stochastic Approximation (Bach and Moulines, 2011, 2013; Frostig et al., 2015b). The bias-variance decomposition allows for us to bound the generalization error by analyzing two sub-problems, namely, (i) The bias sub-problem, which analyzes the noiseless/realizable (or the consistent linear system) problem, by setting the noise , and (ii) the variance sub-problem, which involves starting at the solution, i.e., and allowing the noise to drive the resulting process. The corresponding tail-averaged iterates are associated with their centered estimates and respectively. The bias-variance decomposition for the square loss establishes the following relation:

 E[¯ηs,n⊗¯ηs,n]⪯2⋅(E[¯ηbiass,n⊗¯ηbiass,n]+E[¯η% variances,n⊗¯ηvariances,n]). (10)

Using the bias-variance decomposition, we obtain an estimate of the generalization error as

 E[L(¯xs,n)]−L(x∗) =12⋅⟨H,E[¯ηs,n⊗¯ηs,n]⟩ ≤Tr(H⋅E[¯ηbiass,n⊗¯ηbiass,n])+Tr(H⋅E[¯ηvariances,n⊗¯ηvariances,n]).

We now provide a few lemmas that help us bound the behavior of the bias and variance error.

###### Lemma 7.

With a batch size , step size , the centered bias estimate exhibits the following per step contraction:

 ⟨I,E[η% biast⊗ηbiast]⟩≤cκb⟨I,E[ηbiast−1⊗ηbiast−1]⟩,

where, , where .

Lemma (7) ensures that the bias decays at a geometric rate during the burn-in iterations when the iterates are not averaged; this rate holds only when the excess risk is larger than the noise level .

We now turn to bounding the variance error. It turns out that it suffices to understand the behavior of limiting centered variance .

###### Lemma 8.

Consider the well-specified case of the streaming LSR problem. With a batch size , step size , the limiting centered variance has an expected covariance that is upper bounded in a psd sense as:

 E[ηvariance∞⊗ηvariance∞]⪯1R2+(b−1)∥H∥2⋅σ2⋅I.

Characterizing the behavior of the final iterate is crucial towards obtaining bounds on the behavior of the tail-averaged iterate. In particular, the final iterate having a excess variance risk (as is the case with lemma (8)) appears crucial towards achieving minimax rates of the averaged iterate.

## 5 Experimental Simulations

We conduct experiments using a synthetic example to illustrate the implications of our theoretical results on mini-batching and tail-averaging. The data is sampled from a dimensional Gaussian with eigenvalues decaying as