 # A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent

In this paper we introduce a unified analysis of a large family of variants of proximal stochastic gradient descent ( SGD) which so far have required different intuitions, convergence analyses, have different applications, and which have been developed separately in various communities. We show that our framework includes methods with and without the following tricks, and their combinations: variance reduction, importance sampling, mini-batch sampling, quantization, and coordinate sub-sampling. As a by-product, we obtain the first unified theory of SGD and randomized coordinate descent ( RCD) methods, the first unified theory of variance reduced and non-variance-reduced SGD methods, and the first unified theory of quantized and non-quantized methods. A key to our approach is a parametric assumption on the iterates and stochastic gradients. In a single theorem we establish a linear convergence result under this assumption and strong-quasi convexity of the loss function. Whenever we recover an existing method as a special case, our theorem gives the best known complexity result. Our approach can be used to motivate the development of new useful methods, and offers pre-proved convergence guarantees. To illustrate the strength of our approach, we develop five new variants of SGD, and through numerical experiments demonstrate some of their properties.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

### 1 Introduction

In this paper we are interested in the optimization problem

 minx∈Rdf(x)+R(x), (1)

where is convex, differentiable with Lipschitz gradient, and is a proximable (proper closed convex) regularizer. In particular, we focus on situations when it is prohibitively expensive to compute the gradient of

, while an unbiased estimator of the gradient can be computed efficiently. This is typically the case for stochastic optimization problems, i.e., when

 f(x)=Eξ∼D[fξ(x)], (2)

where

is a random variable, and

is smooth for all

. Stochastic optimization problems are of key importance in statistical supervised learning theory. In this setup,

represents a machine learning model described by

parameters (e.g., logistic regression or a deep neural network),

is an unknown distribution of labelled examples, represents the loss of model on datapoint , and is the generalization error. Problem (1) seeks to find the model

minimizing the generalization error. In statistical learning theory one assumes that while

is not known, samples are available. In such a case, is not computable, while , which is an unbiased estimator of the gradient of at , is easily computable.

Another prominent example, one of special interest in this paper, are functions which arise as averages of a very large number of smooth functions:

 f(x)=1nn∑i=1fi(x). (3)

This problem often arises by approximation of the stochastic optimization loss function (2) via Monte Carlo integration, and is in this context known as the empirical risk minimization (ERM) problem. ERM is currently the dominant paradigm for solving supervised learning problems shai_book . If index is chosen uniformly at random from , is an unbiased estimator of . Typically, is about times more expensive to compute than .

Lastly, in some applications, especially in distributed training of supervised models, one considers problem (3), with being the number of machines, and each also having a finite sum structure, i.e.,

 fi(x)=1mm∑j=1fij(x), (4)

where corresponds to the number of training examples stored on machine .

### 2 The Many Faces of Stochastic Gradient Descent

Stochastic gradient descent (SGD) RobbinsMonro:1951 ; Nemirovski-Juditsky-Lan-Shapiro-2009 ; Vaswani2019-overparam is a state-of-the-art algorithmic paradigm for solving optimization problems (1) in situations when is either of structure (2) or (3). In its generic form, (proximal) SGD defines the new iterate by subtracting a multiple of a stochastic gradient from the current iterate, and subsequently applying the proximal operator of :

 xk+1=proxγR(xk−γgk). (5)

Here, is an unbiased estimator of the gradient (i.e., a stochastic gradient),

 E[gk|xk]=∇f(xk), (6)

and . However, and this is the starting point of our journey in this paper, there are infinitely many

ways of obtaining a random vector

satisfying (6). On the one hand, this gives algorithm designers the flexibility to construct stochastic gradients in various ways in order to target desirable properties such as convergence speed, iteration cost, parallelizability and generalization. On the other hand, this poses considerable challenges in terms of convergence analysis. Indeed, if one aims to, as one should, obtain the sharpest bounds possible, dedicated analyses are needed to handle each of the particular variants of SGD.

Vanilla111In this paper, by vanilla SGD we refer to SGD variants with or without importance sampling and mini-batching, but excluding variance-reduced variants, such as SAGA SAGA and SVRG SVRG . SGD. The flexibility in the design of efficient strategies for constructing has led to a creative renaissance in the optimization and machine learning communities, yielding a large number of immensely powerful new variants of SGD, such as those employing importance sampling IProx-SDCA ; NeedellWard2015 , and mini-batching mS2GD . These efforts are subsumed by the recently developed and remarkably sharp analysis of SGD under arbitrary sampling paradigm SGD_AS , first introduced in the study of randomized coordinate descent methods by NSync . The arbitrary sampling paradigm covers virtually all stationary mini-batch and importance sampling strategies in a unified way, thus making headway towards theoretical unification of two separate strategies for constructing stochastic gradients. For strongly convex , the SGD methods analyzed in SGD_AS converge linearly to a neighbourhood of the solution for a fixed stepsize

. The size of the neighbourhood is proportional to the second moment of the stochastic gradient at the optimum (

), to the stepsize (), and inversely proportional to the modulus of strong convexity. The effect of various sampling strategies, such as importance sampling and mini-batching, is twofold: i) improvement of the linear convergence rate by enabling larger stepsizes, and ii) modification of . However, none of these strategies222Except for the full batch strategy, which is prohibitively expensive. is able to completely eliminate the adverse effect of . That is, SGD with a fixed stepsize does not reach the optimum, unless one happens to be in the overparameterized case characterized by the identity .

Variance reduced SGD. While sampling strategies such as importance sampling and mini-batching reduce the variance of the stochastic gradient, in the finite-sum case (3) a new type of variance reduction strategies has been developed over the last few years SAG ; SAGA ; SVRG ; SDCA ; QUARTZ ; nguyen2017sarah ; Loopless . These variance-reduced SGD methods differ from the sampling strategies discussed before in a significant way: they can iteratively learn the stochastic gradients at the optimum, and in so doing are able to eliminate the adverse effect of the gradient noise which, as mentioned above, prevents the iterates of vanilla SGD from converging to the optimum. As a result, for strongly convex , these new variance-reduced SGD methods converge linearly to , with a fixed stepsize. At the moment, these variance-reduced variants require a markedly different convergence theory from the vanilla variants of SGD. An exception to this is the situation when as then variance reduction is not needed; indeed, vanilla SGD already converges to the optimum, and with a fixed stepsize. We end the discussion here by remarking that this hints at a possible existence of a more unified theory, one that would include both vanilla and variance-reduced SGD.

Distributed SGD, quantization and variance reduction. When SGD is implemented in a distributed fashion, the problem is often expressed in the form (3), where is the number of workers/nodes, and corresponds to the loss based on data stored on node . Depending on the number of data points stored on each node, it may or may not be efficient to compute the gradient of in each iteration. In general, SGD is implemented in this way: each node first computes a stochastic gradient of at the current point (maintained individually by each node). These gradients are then aggregated by a master node DANE ; RDME , in-network by a switch switchML , or a different technique best suited to the architecture used. To alleviate the communication bottleneck, various lossy update compression strategies such as quantization 1bit ; Gupta:2015limited ; zipml , sparsification RDME ; alistarh2018convergence ; tonko and dithering alistarh2017qsgd were proposed. The basic idea is for each worker to apply a randomized transformation to , resulting in a vector which is still an unbiased estimator of the gradient, but one that can be communicated with fewer bits. Mathematically, this amounts to injecting additional noise into the already noisy stochastic gradient . The field of quantized SGD is still young, and even some basic questions remained open until recently. For instance, there was no distributed quantized SGD capable of provably solving (1) until the DIANA algorithm mishchenko2019distributed was introduced. DIANA applies quantization to gradient differences, and in so doing is able to learn the gradients at the optimum, which makes is able to work for any regularizer . DIANA has some structural similarities with SEGA hanzely2018sega —the first coordinate descent type method which works for non-separable regularizers—but a more precise relationship remains elusive. When the functions of are of a finite-sum structure as in (4), one can apply variance reduction to reduce the variance of the stochastic gradients together with quantization, resulting in the VR-DIANA method horvath2019stochastic . This is the first distributed quantized SGD method which provably converges to the solution of (1)+(4) with a fixed stepsize.

Randomized coordinate descent (RCD). Lastly, in a distinctly separate strain, there are SGD methods for the coordinate/subspace descent variety RCDM . While it is possible to see some RCD methods as special cases of (5)+(6), most of them do not follow this algorithmic template. First, standard RCD methods use different stepsizes for updating different coordinates ALPHA , and this seems to be crucial to their success. Second, until the recent discovery of the SEGA method, RCD methods were not able to converge with non-separable regularizers. Third, RCD methods are naturally variance-reduced in the case as partial derivatives at the optimum are all zero. As a consequence, attempts at creating variance-reduced RCD methods seem to be futile. Lastly, RCD methods are typically analyzed using different techniques. While there are deep links between standard SGD and RCD methods, these are often indirect and rely on duality SDCA ; FACE-OFF ; SDA .

### 3 Contributions

As outlined in the previous section, the world of SGD is vast and beautiful. It is formed by many largely disconnected islands populated by elegant and efficient methods, with their own applications, intuitions, and convergence analysis techniques. While some links already exist (e.g., the unification of importance sampling and mini-batching variants under the arbitrary sampling umbrella), there is no comprehensive general theory. It is becoming increasingly difficult for the community to understand the relationships between these variants, both in theory and practice. New variants are yet to be discovered, but it is not clear what tangible principles one should adopt beyond intuition to aid the discovery. This situation is exacerbated by the fact that a number of different assumptions on the stochastic gradient, of various levels of strength, is being used in the literature.

The main contributions of this work include:

Unified analysis. In this work we propose a unifying theoretical framework which covers all of the variants of SGD outlined in Section 2. As a by-product, we obtain the first unified analysis of vanilla and variance-reduced SGD methods. For instance, our analysis covers as special cases vanilla SGD methods from nguyen2018sgd and SGD_AS , variance-reduced SGD methods such as SAGA SAGA , L-SVRG hofmann2015variance ; Loopless and JacSketch gower2018stochastic . Another by-product is the first unified analysis of SGD methods which include RCD. For instance, our theory covers the subspace descent method SEGA hanzely2018sega as a special case. Lastly, our framework is general enough to capture the phenomenon of quantization. For instance, we obtain the DIANA and VR-DIANA methods in special cases.

Generalization of existing methods. An important yet relatively minor contribution of our work is that it enables generalization of knowns methods. For instance, some particular methods we consider, such as L-SVRG (Alg 10Loopless , were not analyzed in the proximal () case before. To illustrate how this can be done within our framework, we do it here for L-SVRG. Further, all methods we analyze can be extended to the arbitrary sampling paradigm.

Sharp rates. In all known special cases, the rates obtained from our general theorem (Theorem 4.1) are the best known rates for these methods.

New methods. Our general analysis provides estimates for a possibly infinite array of new and yet-to-be-developed variants of SGD. One only needs to verify that Assumption 4.1 holds, and a complexity estimate is readily furnished by Theorem 4.1. Selected existing and new methods that fit our framework are summarized in Table 1. This list is for illustration only, we believe that future work by us and others will lead to its rapid expansion.

Experiments. We show through extensive experimentation that some of the new and generalized methods proposed here and analyzed via our framework have some intriguing practical properties when compared against appropriately selected existing methods.

### 4 Main Result

We first introduce the key assumption on the stochastic gradients enabling our general analysis (Assumption 4.1), then state our assumptions on (Assumption 4.2), and finally state and comment on our unified convergence result (Theorem 4.1).

Notation. We use the following notation. is the standard Euclidean inner product, and is the induced norm. For simplicity we assume that (1) has a unique minimizer, which we denote . Let denote the Bregman divergence associated with : . We often write .

#### 4.1 Key assumption

Our first assumption is of key importance. It is mainly an assumption on the sequence of stochastic gradients generated by an arbitrary randomized algorithm. Besides unbiasedness (see (7)), we require two recursions to hold for the iterates and the stochastic gradients of a randomized method. We allow for flexibility by casting these inequalities in a parametric manner.

###### Assumption 4.1.

Let be the random iterates produced by proximal SGD (Algorithm in Eq (5)). We first assume that the stochastic gradients are unbiased

 E[gk∣xk]=∇f(xk), (7)

for all . Further, we assume that there exist non-negative constants and a (possibly) random sequence such that the following two relations hold333For convex and -smooth , one can show that Hence, can be used as a measure of proximity for the gradients.

 E[σ2k+1∣σ2k]≤(1−ρ)σ2k+2CDf(xk,x∗)+D2, (9)

The expectation above is with respect to the randomness of the algorithm.

The unbiasedness assumption (7) is standard. The key innovation we bring is inequality (8) coupled with (9). We argue, and justify this statement by furnishing many examples in Section 5, that these inequalities capture the essence of a wide array of existing and some new SGD methods, including vanilla, variance reduced, arbitrary sampling, quantized and coordinate descent variants. Note that in the case when (e.g., when ), the inequalities in Assumption 4.1 reduce to

 E[∥∥gk∥∥2∣xk]≤2A(f(xk)−f(x∗))+Bσ2k+D1, (10)
 E[σ2k+1∣σ2k]≤(1−ρ)σ2k+2C(f(xk)−f(x∗))+D2. (11)

Similar inequalities can be found in the analysis of stochastic first-order methods. However, this is the first time that such inequalities are generalized, equipped with parameters, and elevated to the status of an assumption that can be used on its own, independently from any other details defining the underlying method that generated them.

#### 4.2 Main theorem

For simplicity, we shall assume throughout that is -strongly quasi-convex, which is a generalization of -strong convexity. We leave an analysis under different assumptions on to future work.

###### Assumption 4.2 (μ-strong quasi-convexity).

There exists such that is -strongly quasi-convex. That is, the following inequality holds:

 f(x∗)≥f(x)+⟨∇f(x),x∗−x⟩+μ2∥x∗−x∥2,∀x∈Rd. (12)

We are now ready to present our main convergence result.

###### Theorem 4.1.

Let Assumptions 4.1 and 4.2 be satisfied. Choose constant such that . Choose a stepsize satisfying

 0<γ≤min{1μ,1A+CM}. (13)

Then the iterates of proximal SGD (Algorithm (5)) satisfy

 E[Vk]≤max{(1−γμ)k,(1+BM−ρ)k}V0+(D1+MD2)γ2min{γμ,ρ−BM}, (14)

where the Lyapunov function is defined by .

This theorem establishes a linear rate for a wide range of proximal SGD methods up to a certain oscillation radius, controlled by the additive term in (14), and namely, by parameters and . As we shall see in Section A (refer to Table 2), the main difference between the vanilla and variance-reduced SGD methods is that while the former satisfy inequality (9) with or , which in view of (14) prevents them from reaching the optimum (using a fixed stepsize), the latter methods satisfy inequality (9) with , which in view of (14) enables them to reach the optimum.

### 5 The Classic, The Recent and The Brand New

In this section we deliver on the promise from the introduction and show how many existing and some new variants of SGD fit our general framework (see Table 1).

An overview. As claimed, our framework is powerful enough to include vanilla methods (  in the “VR” column) as well as variance-reduced methods (  in the “VR” column), methods which generalize to arbitrary sampling (  in the “AS” column), methods supporting gradient quantization (  in the “Quant” column) and finally, also RCD type methods (  in the “RCD” column).

For existing methods we provide a citation; new methods developed in this paper are marked accordingly. Due to space restrictions, all algorithms are described (in detail) in the Appendix; we provide a link to the appropriate section for easy navigation. While these details are important, the main message of this paper, i.e., the generality of our approach, is captured by Table 1. The “Result” column of Table 1 points to a corollary of Theorem 4.1; these corollaries state in detail the convergence statements for the various methods. In all cases where known methods are recovered, these corollaries of Theorem 4.1 recover the best known rates.

Parameters. From the point of view of Assumption 4.1, the methods listed in Table 1 exhibit certain patterns. To shed some light on this, in Table 2 we summarize the values of these parameters.

Note, for example, that for all methods the parameter is non-zero. Typically, this a multiple of an appropriately defined smoothness parameter (e.g., is the Lipschitz constant of the gradient of , and in SGD-SR444SGD-SR is first SGD method analyzed in the arbitrary sampling paradigm. It was developed using the stochastic reformulation approach (whence the “SR”) pioneered in ASDA in a numerical linear algebra setting, and later extended to develop the JacSketch variance-reduction technique for finite-sum optimization gower2018stochastic ., SGD-star and JacSketch are expected smoothness parameters). In the three variants of the DIANA method, captures the variance of the quantization operator . That is, one assumes that and for all . In view of (13), large means a smaller stepsize, which slows down the rate. Likewise, the variance also affects the parameter , which in view of (14) also has an adverse effect on the rate. Further, as predicted by Theorem 4.1, whenever either or , the corresponding method converges to an oscillation region only. These methods are not variance-reduced. All symbols used in Table 2 are defined in the appendix, in the same place where the methods are described and analyzed.

Five new methods. To illustrate the usefulness of our general framework, we develop 5 new variants of SGD never explicitly considered in the literature before (see Table 1). Here we briefly motivate them; details can be found in the Appendix.

SGD-MB (Algorithm 3). This method is specifically designed for functions of the finite-sum structure (4). As we show through experiments, this is a powerful mini-batch SGD method, with mini-batches formed with replacement as follows: in each iteration, we repeatedly ( times) and independently pick

with probability

. Stochastic gradient is then formed by averaging the stochastic gradients for all selected indices (including each as many times as this index was selected).

SGD-star (Algorithm 4). This new method forms a bridge between vanilla and variance-reduced SGD methods. While not practical, it sheds light on the role of variance reduction. Again, we consider functions of the finite-sum form (4). This methods answers the following question: assuming that the gradients , are known, can they be used to design a more powerful SGD variant? The answer is yes, and SGD-star is the method. In its most basic form, SGD-star constructs the stochastic gradient via , where is chosen uniformly at random. That is, the standard stochastic gradient is perturbed by the stochastic gradient at the same index evaluated at the optimal point . Inferring from Table 2, where , this method converges to , and not merely to some oscillation region. Variance-reduced methods essentially work by iteratively constructing increasingly more accurate estimates of . Typically, the term in the Lyapunov function of variance reduced methods will contain a term of the form , with being the estimators maintained by the method. Remarkably, SGD-star was never explicitly considered in the literature before.

N-SAGA (Algorithm 6). This is a novel variant of SAGA SAGA , one in which one does not have access to the gradients of , but instead only has access to noisy stochastic estimators thereof (with noise ). Like SAGA, N-SAGA is able to reduce the variance inherent in the finite sum structure (4) of the problem. However, it necessarily pays the price of noisy estimates of , and hence, just like vanilla SGD methods, is ultimately unable to converge to . The oscillation region is governed by the noise level (refer to and in Table 2). This method will be of practical importance for problems where each is of the form (2), i.e., for problems of the “average of expectations” structure. Batch versions of N-SAGA would be well suited for distributed optimization, where each is owned by a different worker, as in such a case one wants the workers to work in parallel.

N-SEGA (Algorithm 8). This is a noisy extension of the RCD-type method SEGA, in complete analogy with the relationship between SAGA and N-SAGA. Here we assume that we only have noisy estimates of partial derivatives (with noise ). This situation is common in derivative-free optimization, where such a noisy estimate can be obtained by taking (a random) finite difference approximation nesterov2017randomDFO . Unlike SEGA, N-SEGA only converges to an oscillation region the size of which is governed by .

Q-SGD-SR (Algorithm 13). This is a quantized version of SGD-SR, which is the first SGD method analyzed in the arbitrary sampling paradigm. As such, Q-SGD-SR is a vast generalization of the celebrated QSGD method alistarh2017qsgd .

### 6 Experiments

In this section we numerically verify the claims from the paper. We present only a fraction of experiments here, the rest is contained in Appendix B.

In Section A.3, we describe in detail the SGD-MB method already outlined before. The main advantage of SGD-MB is that the sampling procedure it employs can be implemented in just time. In contrast, even the simplest without-replacement sampling which selects each function into the minibatch with a prescribed probability independently (we will refer to it as independent SGD) requires calls of a uniform random generator. We demonstrate numerically that SGD-MB has essentially identical iteration complexity to independent SGD in practice. We consider logistic regression with Tikhonov regularization. For a fixed expected sampling size , consider two options for the probability of sampling the -th function:

1. , or

2. , where is such that555An RCD version of this sampling was proposed in AccMbCd ; it was shown to be superior to uniform sampling both in theory and practice. .

The results can be found in Figure 1, where we also report the choice of stepsize and the choice of in the legend and title of the plot, respectively. Figure 1: SGD-MB and independent SGD applied on LIBSVM chang2011libsvm . Title label “unif” corresponds to probabilities chosen by a while label “imp” corresponds to probabilities chosen by b. Lastly, legend label “r” corresponds to “replacement” with value “True” for SGD-MB and value “False” for independent SGD.

Indeed, iteration complexity of SGD-MB and independent SGD is almost identical. Since the cost of each iteration of SGD-MB is cheaper666The relative difference between iteration costs of SGD-MB and independent SGD can be arbitrary, especially for the case when cost of evaluating is cheap, is huge and . In such case, cost of one iteration of SGD-MB is while the cost of one iteration of independent SGD is ., we conclude superiority of SGD-MB to independent SGD.

### 7 Limitations and Extensions

Although our approach is rather general, we still see several possible directions for future extensions, including:

We believe our results can be extended to weakly convex functions. However, producing a comparable result in the nonconvex case remains a major open problem.

It would be further interesting to unify our theory with biased gradient estimators. If this was possible, one could recover methods as SAG SAG in special cases, or obtain rates for the zero-order optimization. We have some preliminary results in this direction already.

Although our theory allows for non-uniform stochasticity, it does not recover the best known rates for RCD type methods with importance sampling. It would be thus interesting to provide a more refined analysis capable of capturing importance sampling phenomena more accurately.

An extension of Assumption 4.1 to iteration dependent parameters would enable an array of new methods, such as SGD with decreasing stepsizes.

It would be interesting to provide a unified analysis of stochastic methods with acceleration and momentum. In fact, kulunchakov2019estimate provide (separately) a unification of some methods with and without variance reduction. Hence, an attempt to combine our insights with their approach seems to be a promising starting point in these efforts.

### Appendix A Special Cases

#### a.1 Proximal Sgd for stochastic optimization

We start with stating the problem, the assumptions on the objective and on the stochastic gradients for SGD . Consider the expectation minimization problem

 minx∈Rdf(x)+R(x),f(x)\coloneqqED[fξ(x)] (15)

where , is differentiable and -smooth almost surely in .

Lemma A.1 shows that the stochastic gradient satisfies Assumption 4.1. The corresponding choice of parameters can be found in Table 2.

###### Lemma A.1 (Generalization of Lemmas 1,2 from ).

Assume that is convex in for every . Then for every

 ED[∥∥∇fξ(x)−∇f(x∗)∥∥2]≤4L(Df(x,x∗))+2σ2, (16)

where . If further is -strongly convex with possibly non-convex , then for every

 ED[∥∥∇fξ(x)−∇f(x∗)∥∥2]≤4Lκ(Df(x,x∗))+2σ2, (17)

where .

###### Corollary A.1.

Assume that is convex in for every and is -strongly quasi-convex. Then SGD with satisfies

 E[∥∥xk−x∗∥∥2]≤(1−γμ)k∥∥x0−x∗∥∥2+2γσ2μ. (18)

If we further assume that is -strongly convex with possibly non-convex , SGD with satisfies (18) as well.

###### Proof.

It suffices to plug parameters from Table 2 into Theorem 4.1. ∎

##### Proof of Lemma a.1

The proof is a direct generalization to the one from . Note that

 12ED[∥∥∇fξ(x)−∇f(x∗)∥∥2]−ED[∥∥∇fξ(x∗)−∇f(x∗)∥∥2] =12ED[∥∥∇fξ(x)−∇f(x∗)∥∥2−∥∥∇fξ(x∗)−∇f(x∗)∥∥2] ED[∥∥∇fξ(x)−∇fξ(x∗)∥∥2] ≤2LDf(x,x∗).

It remains to rearrange the above to get (16). To obtain (17), we shall proceed similarly:

 12ED[∥∥∇fξ(x)−∇f(x∗)∥∥2]−ED[∥∥∇fξ(x∗)−∇f(x∗)∥∥2] =12ED[∥∥∇fξ(x)−∇f(x∗)∥∥2−∥∥∇fξ(x∗)−∇f(x∗)∥∥2] ED[∥∥∇fξ(x)−∇fξ(x∗)∥∥2] ≤L2∥x−x∗∥2 ≤2L2μDf(x,x∗).

Again, it remains to rearrange the terms.

#### a.2 Sgd-Sr

In this section, we recover convergence result of SGD under expected smoothness property from . This setup allows obtaining tight convergence rates of SGD under arbitrary stochastic reformulation of finite sum minimization777For technical details on how to exploit expected smoothness for specific reformulations, see .

The stochastic reformulation is a special instance of (15):

 minx∈Rdf(x)+R(x),f(x)=ED[fξ(x)],fξ(x)\coloneqq1nn∑i=1ξifi(x) (19)

where is a random vector from distribution such that for all : and (for all ) is smooth, possibly non-convex function. We next state the expextes smoothness assumption. A specific instances of this assumption allows to get tight convergence rates of SGD, which we recover in this section.

###### Assumption A.1 (Expected smoothness).

We say that is -smooth in expectation with respect to distribution if there exists such that

 ED[∥∥∇fξ(x)−∇fξ(x∗)∥∥2]≤2LDf(x,x∗), (20)

for all . For simplicity, we will write to say that (20) holds.

Next, we present Lemma A.2 which shows that choice of constants for Assumption 4.1 from Table 2 is valid.

###### Lemma A.2 (Generalization of Lemma 2.4, ).

If , then

 ED[∥∥∇fξ(x)−∇f(x∗)∥∥2]≤4LDf(x,x∗)+2σ2. (21)

where .

A direct consequence of Theorem 4.1 in this setup is Corollary A.2.

###### Corollary A.2.

Assume that is -strongly quasi-convex and . Then SGD-SR with satisfies

 E[∥∥xk−x∗∥∥2]≤(1−γμ)k∥∥x0−x∗∥∥2+2γσ2μ. (22)
##### Proof of Lemma a.2

Here we present the generalization of the proof of Lemma 2.4 from  for the case when . In this proof all expectations are conditioned on .

 E[∥∥∇fξ(x)−∇f(x∗)∥∥2] = E[∥∥∇fξ(x)−∇fξ(x∗)+∇fξ(x∗)−∇f(x∗)∥∥2] 2E[∥∥∇fξ(x)−∇fξ(x∗)∥∥2]+2E[∥∥∇fξ(x∗)−∇f(x∗)∥∥2] 4LDf(x,x∗)+2σ2.

#### a.3 Sgd-Mb

In this section, we present a specific practical formulation of (19) which was not considered in . The resulting algorithm (Algorithm 3) is novel; it was not considered in  as a specific instance of SGD-SR. The key idea behind SGD-MB is constructing unbiased gradient estimate via with-replacement sampling.

Consider random variable such that

 P(ν=i)=pi;n∑i=1pi=1. (23)

Notice that if we define

 ψi(x)\coloneqq1npifi(x),i=1,2,…,n, (24)

then

 f(x)=1nn∑i=1fi(x)n∑i=1piψi(x)ED[ψν(x)]. (25)

So, we have rewritten the finite sum problem (3) into the equivalent stochastic optimization problem

 minx∈RdED[ψν(x)]. (26)

We are now ready to describe our method. At each iteration we sample independently (), and define . Further, we use as a stochastic gradient, resulting in Algorithm 3.

To remain in full generality, consider the following Assumption.

###### Assumption A.2.

There exists constants and such that

 ED[∥∇ψν(x)∥2]≤2A′(f(x)−f(x∗))+D′ (27)

for all .

Note that it is sufficient to have convex and smooth in order to satisfy Assumption A.2, as Lemma A.3 states.

###### Lemma A.3.

Let . If are convex and -smooth, then Assumption A.2 holds for and , where

 L≤maxiLinpi. (28)

If moreover for all , then Assumption A.2 holds for and .

Next, Lemma A.4 states that Algorithm 3 indeed satisfies Assumption 4.1.

###### Lemma A.4.

Suppose that Assumption A.2 holds. Then is unbiased; i.e. . Further,

 ED[∥∥gk∥∥2]≤2A′+2L(τ−1)τ(f(xk)−f(x∗))+D′τ.

Thus, parameters from Table 2 are validated. As a direct consequence of Theorem 4.1 we get Corollary A.3.

###### Corollary A.3.

As long as , we have

 E∥∥xk−x∗∥∥2≤(1−γμ)k∥∥x0−x∗∥∥2+γD′μτ. (29)
###### Remark A.1.

For , SGD-MB is a special of the method from , Section 3.2. However, for , this is a different method; the difference lies in the with-replacement sampling. Note that with-replacement trick allows for efficient and implementation of independent importance sampling 888Distribution of random sets for which random variables and are independent for . with complexity . In contrast, implementation of without-replacement importance sampling has complexity , which can be significantly more expensive to the cost of evaluating .

##### Proof of Lemma a.4

Notice first that

 ED[gk] 1ττ∑i=1ED[1npνki∇fνki(xk)] = ED[1npν∇fν(xk)] n∑i=1pi1npi∇fi(xk) = ∇f(xk).

So, is an unbiased estimator of the gradient . Next,

 ED[∥∥gk∥∥2] = ED⎡⎣∥∥ ∥∥1ττ∑i=1∇ψνki(xk)∥∥ ∥∥2⎤⎦ = 1τ2ED[τ∑i=1∥∥∇ψνki(xk)∥∥2+2∑i
##### Proof of Lemma a.3

Let be any constant for which

 Eξ∼D∥∥∇ϕξ(x)−∇ϕξ(x∗)∥∥2≤2L(f(x)−f(x∗)) (30)

holds for all . This is the expected smoothness property (for a single item sampling) from . It was shown in [6, Proposition 3.7] that (30) holds, and that satisfies (28). The claim now follows by applying [6, Lemma 2.4].

#### a.4 SGD-star

Consider problem (19). Suppose that is known for all . In this section we present a novel algorithm — SGD-star — which is SGD-SR shifted by the stochastic gradient in the optimum. The method is presented under Expected Smoothness Assumption (20), obtaining general rates under arbitrary sampling. The algorithm is presented as Algorithm 4.

Suppose that