In this paper we are interested in the optimization problem
where is convex, differentiable with Lipschitz gradient, and is a proximable (proper closed convex) regularizer. In particular, we focus on situations when it is prohibitively expensive to compute the gradient of
, while an unbiased estimator of the gradient can be computed efficiently. This is typically the case for stochastic optimization problems, i.e., when
is a random variable, andis smooth for all
. Stochastic optimization problems are of key importance in statistical supervised learning theory. In this setup,
represents a machine learning model described byis an unknown distribution of labelled examples, represents the loss of model on datapoint , and is the generalization error. Problem (1) seeks to find the model
minimizing the generalization error. In statistical learning theory one assumes that whileis not known, samples are available. In such a case, is not computable, while , which is an unbiased estimator of the gradient of at , is easily computable.
Another prominent example, one of special interest in this paper, are functions which arise as averages of a very large number of smooth functions:
This problem often arises by approximation of the stochastic optimization loss function (2) via Monte Carlo integration, and is in this context known as the empirical risk minimization (ERM) problem. ERM is currently the dominant paradigm for solving supervised learning problems shai_book . If index is chosen uniformly at random from , is an unbiased estimator of . Typically, is about times more expensive to compute than .
Lastly, in some applications, especially in distributed training of supervised models, one considers problem (3), with being the number of machines, and each also having a finite sum structure, i.e.,
where corresponds to the number of training examples stored on machine .
2 The Many Faces of Stochastic Gradient Descent
Stochastic gradient descent (SGD) RobbinsMonro:1951 ; Nemirovski-Juditsky-Lan-Shapiro-2009 ; Vaswani2019-overparam is a state-of-the-art algorithmic paradigm for solving optimization problems (1) in situations when is either of structure (2) or (3). In its generic form, (proximal) SGD defines the new iterate by subtracting a multiple of a stochastic gradient from the current iterate, and subsequently applying the proximal operator of :
Here, is an unbiased estimator of the gradient (i.e., a stochastic gradient),
and . However, and this is the starting point of our journey in this paper, there are infinitely many
ways of obtaining a random vectorsatisfying (6). On the one hand, this gives algorithm designers the flexibility to construct stochastic gradients in various ways in order to target desirable properties such as convergence speed, iteration cost, parallelizability and generalization. On the other hand, this poses considerable challenges in terms of convergence analysis. Indeed, if one aims to, as one should, obtain the sharpest bounds possible, dedicated analyses are needed to handle each of the particular variants of SGD.
Vanilla111In this paper, by vanilla SGD we refer to SGD variants with or without importance sampling and mini-batching, but excluding variance-reduced variants, such as SAGA SAGA and SVRG SVRG . SGD. The flexibility in the design of efficient strategies for constructing has led to a creative renaissance in the optimization and machine learning communities, yielding a large number of immensely powerful new variants of SGD, such as those employing importance sampling IProx-SDCA ; NeedellWard2015 , and mini-batching mS2GD . These efforts are subsumed by the recently developed and remarkably sharp analysis of SGD under arbitrary sampling paradigm SGD_AS , first introduced in the study of randomized coordinate descent methods by NSync . The arbitrary sampling paradigm covers virtually all stationary mini-batch and importance sampling strategies in a unified way, thus making headway towards theoretical unification of two separate strategies for constructing stochastic gradients. For strongly convex , the SGD methods analyzed in SGD_AS converge linearly to a neighbourhood of the solution for a fixed stepsize
. The size of the neighbourhood is proportional to the second moment of the stochastic gradient at the optimum (), to the stepsize (), and inversely proportional to the modulus of strong convexity. The effect of various sampling strategies, such as importance sampling and mini-batching, is twofold: i) improvement of the linear convergence rate by enabling larger stepsizes, and ii) modification of . However, none of these strategies222Except for the full batch strategy, which is prohibitively expensive. is able to completely eliminate the adverse effect of . That is, SGD with a fixed stepsize does not reach the optimum, unless one happens to be in the overparameterized case characterized by the identity .
Variance reduced SGD. While sampling strategies such as importance sampling and mini-batching reduce the variance of the stochastic gradient, in the finite-sum case (3) a new type of variance reduction strategies has been developed over the last few years SAG ; SAGA ; SVRG ; SDCA ; QUARTZ ; nguyen2017sarah ; Loopless . These variance-reduced SGD methods differ from the sampling strategies discussed before in a significant way: they can iteratively learn the stochastic gradients at the optimum, and in so doing are able to eliminate the adverse effect of the gradient noise which, as mentioned above, prevents the iterates of vanilla SGD from converging to the optimum. As a result, for strongly convex , these new variance-reduced SGD methods converge linearly to , with a fixed stepsize. At the moment, these variance-reduced variants require a markedly different convergence theory from the vanilla variants of SGD. An exception to this is the situation when as then variance reduction is not needed; indeed, vanilla SGD already converges to the optimum, and with a fixed stepsize. We end the discussion here by remarking that this hints at a possible existence of a more unified theory, one that would include both vanilla and variance-reduced SGD.
Distributed SGD, quantization and variance reduction. When SGD is implemented in a distributed fashion, the problem is often expressed in the form (3), where is the number of workers/nodes, and corresponds to the loss based on data stored on node . Depending on the number of data points stored on each node, it may or may not be efficient to compute the gradient of in each iteration. In general, SGD is implemented in this way: each node first computes a stochastic gradient of at the current point (maintained individually by each node). These gradients are then aggregated by a master node DANE ; RDME , in-network by a switch switchML , or a different technique best suited to the architecture used. To alleviate the communication bottleneck, various lossy update compression strategies such as quantization 1bit ; Gupta:2015limited ; zipml , sparsification RDME ; alistarh2018convergence ; tonko and dithering alistarh2017qsgd were proposed. The basic idea is for each worker to apply a randomized transformation to , resulting in a vector which is still an unbiased estimator of the gradient, but one that can be communicated with fewer bits. Mathematically, this amounts to injecting additional noise into the already noisy stochastic gradient . The field of quantized SGD is still young, and even some basic questions remained open until recently. For instance, there was no distributed quantized SGD capable of provably solving (1) until the DIANA algorithm mishchenko2019distributed was introduced. DIANA applies quantization to gradient differences, and in so doing is able to learn the gradients at the optimum, which makes is able to work for any regularizer . DIANA has some structural similarities with SEGA hanzely2018sega —the first coordinate descent type method which works for non-separable regularizers—but a more precise relationship remains elusive. When the functions of are of a finite-sum structure as in (4), one can apply variance reduction to reduce the variance of the stochastic gradients together with quantization, resulting in the VR-DIANA method horvath2019stochastic . This is the first distributed quantized SGD method which provably converges to the solution of (1)+(4) with a fixed stepsize.
Randomized coordinate descent (RCD). Lastly, in a distinctly separate strain, there are SGD methods for the coordinate/subspace descent variety RCDM . While it is possible to see some RCD methods as special cases of (5)+(6), most of them do not follow this algorithmic template. First, standard RCD methods use different stepsizes for updating different coordinates ALPHA , and this seems to be crucial to their success. Second, until the recent discovery of the SEGA method, RCD methods were not able to converge with non-separable regularizers. Third, RCD methods are naturally variance-reduced in the case as partial derivatives at the optimum are all zero. As a consequence, attempts at creating variance-reduced RCD methods seem to be futile. Lastly, RCD methods are typically analyzed using different techniques. While there are deep links between standard SGD and RCD methods, these are often indirect and rely on duality SDCA ; FACE-OFF ; SDA .
As outlined in the previous section, the world of SGD is vast and beautiful. It is formed by many largely disconnected islands populated by elegant and efficient methods, with their own applications, intuitions, and convergence analysis techniques. While some links already exist (e.g., the unification of importance sampling and mini-batching variants under the arbitrary sampling umbrella), there is no comprehensive general theory. It is becoming increasingly difficult for the community to understand the relationships between these variants, both in theory and practice. New variants are yet to be discovered, but it is not clear what tangible principles one should adopt beyond intuition to aid the discovery. This situation is exacerbated by the fact that a number of different assumptions on the stochastic gradient, of various levels of strength, is being used in the literature.
The main contributions of this work include:
Unified analysis. In this work we propose a unifying theoretical framework which covers all of the variants of SGD outlined in Section 2. As a by-product, we obtain the first unified analysis of vanilla and variance-reduced SGD methods. For instance, our analysis covers as special cases vanilla SGD methods from nguyen2018sgd and SGD_AS , variance-reduced SGD methods such as SAGA SAGA , L-SVRG hofmann2015variance ; Loopless and JacSketch gower2018stochastic . Another by-product is the first unified analysis of SGD methods which include RCD. For instance, our theory covers the subspace descent method SEGA hanzely2018sega as a special case. Lastly, our framework is general enough to capture the phenomenon of quantization. For instance, we obtain the DIANA and VR-DIANA methods in special cases.
Generalization of existing methods. An important yet relatively minor contribution of our work is that it enables generalization of knowns methods. For instance, some particular methods we consider, such as L-SVRG (Alg 10) Loopless , were not analyzed in the proximal () case before. To illustrate how this can be done within our framework, we do it here for L-SVRG. Further, all methods we analyze can be extended to the arbitrary sampling paradigm.
Sharp rates. In all known special cases, the rates obtained from our general theorem (Theorem 4.1) are the best known rates for these methods.
New methods. Our general analysis provides estimates for a possibly infinite array of new and yet-to-be-developed variants of SGD. One only needs to verify that Assumption 4.1 holds, and a complexity estimate is readily furnished by Theorem 4.1. Selected existing and new methods that fit our framework are summarized in Table 1. This list is for illustration only, we believe that future work by us and others will lead to its rapid expansion.
Experiments. We show through extensive experimentation that some of the new and generalized methods proposed here and analyzed via our framework have some intriguing practical properties when compared against appropriately selected existing methods.
4 Main Result
We first introduce the key assumption on the stochastic gradients enabling our general analysis (Assumption 4.1), then state our assumptions on (Assumption 4.2), and finally state and comment on our unified convergence result (Theorem 4.1).
Notation. We use the following notation. is the standard Euclidean inner product, and is the induced norm. For simplicity we assume that (1) has a unique minimizer, which we denote . Let denote the Bregman divergence associated with : . We often write .
4.1 Key assumption
Our first assumption is of key importance. It is mainly an assumption on the sequence of stochastic gradients generated by an arbitrary randomized algorithm. Besides unbiasedness (see (7)), we require two recursions to hold for the iterates and the stochastic gradients of a randomized method. We allow for flexibility by casting these inequalities in a parametric manner.
Let be the random iterates produced by proximal SGD (Algorithm in Eq (5)). We first assume that the stochastic gradients are unbiased
for all . Further, we assume that there exist non-negative constants and a (possibly) random sequence such that the following two relations hold333For convex and -smooth , one can show that Hence, can be used as a measure of proximity for the gradients.
The expectation above is with respect to the randomness of the algorithm.
The unbiasedness assumption (7) is standard. The key innovation we bring is inequality (8) coupled with (9). We argue, and justify this statement by furnishing many examples in Section 5, that these inequalities capture the essence of a wide array of existing and some new SGD methods, including vanilla, variance reduced, arbitrary sampling, quantized and coordinate descent variants. Note that in the case when (e.g., when ), the inequalities in Assumption 4.1 reduce to
Similar inequalities can be found in the analysis of stochastic first-order methods. However, this is the first time that such inequalities are generalized, equipped with parameters, and elevated to the status of an assumption that can be used on its own, independently from any other details defining the underlying method that generated them.
4.2 Main theorem
For simplicity, we shall assume throughout that is -strongly quasi-convex, which is a generalization of -strong convexity. We leave an analysis under different assumptions on to future work.
Assumption 4.2 (-strong quasi-convexity).
There exists such that is -strongly quasi-convex. That is, the following inequality holds:
We are now ready to present our main convergence result.
This theorem establishes a linear rate for a wide range of proximal SGD methods up to a certain oscillation radius, controlled by the additive term in (14), and namely, by parameters and . As we shall see in Section A (refer to Table 2), the main difference between the vanilla and variance-reduced SGD methods is that while the former satisfy inequality (9) with or , which in view of (14) prevents them from reaching the optimum (using a fixed stepsize), the latter methods satisfy inequality (9) with , which in view of (14) enables them to reach the optimum.
5 The Classic, The Recent and The Brand New
In this section we deliver on the promise from the introduction and show how many existing and some new variants of SGD fit our general framework (see Table 1).
An overview. As claimed, our framework is powerful enough to include vanilla methods (✗ in the “VR” column) as well as variance-reduced methods (✓ in the “VR” column), methods which generalize to arbitrary sampling (✓ in the “AS” column), methods supporting gradient quantization (✓ in the “Quant” column) and finally, also RCD type methods (✓ in the “RCD” column).
|(1)+(2)||SGD||Alg 1||nguyen2018sgd||✗||✗||✗||✗||A.1||Cor A.1|
|(1)+(3)||SGD-SR||Alg 2||SGD_AS||✗||✓||✗||✗||A.2||Cor A.2|
|(1)+(3)||SGD-MB||Alg 3||NEW||✗||✗||✗||✗||A.3||Cor A.3|
|(1)+(3)||SGD-star||Alg 4||NEW||✓||✓||✗||✗||A.4||Cor A.4|
|(1)+(3)||SAGA||Alg 5||SAGA||✓||✗||✗||✗||A.5||Cor A.5|
|(1)+(3)||N-SAGA||Alg 6||NEW||✗||✗||✗||✗||A.6||Cor A.6|
|(1)||SEGA||Alg 7||hanzely2018sega||✓||✗||✗||✓||A.7||Cor A.7|
|(1)||N-SEGA||Alg 8||NEW||✗||✗||✗||✓||A.8||Cor A.8|
|(1)+(3)||SVRG||Alg 9||SVRG||✓||✗||✗||✗||A.9||Cor A.9|
|(1)+(3)||L-SVRG||Alg 10||hofmann2015variance ; Loopless||✓||✗||✗||✗||A.10||Cor A.10|
|(1)+(3)||DIANA||Alg 11||mishchenko2019distributed ; horvath2019stochastic||✗||✗||✓||✗||A.11||Cor A.11|
|(1)+(3)||DIANA||Alg 12||mishchenko2019distributed ; horvath2019stochastic||✓||✗||✓||✗||A.11||Cor A.12|
|(1)+(3)||Q-SGD-SR||Alg 13||NEW||✗||✓||✓||✗||A.12||Cor A.13|
|(1)+(3)+(4)||VR-DIANA||Alg 14||horvath2019stochastic||✓||✗||✓||✗||A.13||Cor A.15|
|(1)+(3)||JacSketch||Alg 15||gower2018stochastic||✓||✓✗||✗||✗||A.14||Cor A.16|
For existing methods we provide a citation; new methods developed in this paper are marked accordingly. Due to space restrictions, all algorithms are described (in detail) in the Appendix; we provide a link to the appropriate section for easy navigation. While these details are important, the main message of this paper, i.e., the generality of our approach, is captured by Table 1. The “Result” column of Table 1 points to a corollary of Theorem 4.1; these corollaries state in detail the convergence statements for the various methods. In all cases where known methods are recovered, these corollaries of Theorem 4.1 recover the best known rates.
Note, for example, that for all methods the parameter is non-zero. Typically, this a multiple of an appropriately defined smoothness parameter (e.g., is the Lipschitz constant of the gradient of , and in SGD-SR444SGD-SR is first SGD method analyzed in the arbitrary sampling paradigm. It was developed using the stochastic reformulation approach (whence the “SR”) pioneered in ASDA in a numerical linear algebra setting, and later extended to develop the JacSketch variance-reduction technique for finite-sum optimization gower2018stochastic ., SGD-star and JacSketch are expected smoothness parameters). In the three variants of the DIANA method, captures the variance of the quantization operator . That is, one assumes that and for all . In view of (13), large means a smaller stepsize, which slows down the rate. Likewise, the variance also affects the parameter , which in view of (14) also has an adverse effect on the rate. Further, as predicted by Theorem 4.1, whenever either or , the corresponding method converges to an oscillation region only. These methods are not variance-reduced. All symbols used in Table 2 are defined in the appendix, in the same place where the methods are described and analyzed.
Five new methods. To illustrate the usefulness of our general framework, we develop 5 new variants of SGD never explicitly considered in the literature before (see Table 1). Here we briefly motivate them; details can be found in the Appendix.
SGD-MB (Algorithm 3). This method is specifically designed for functions of the finite-sum structure (4). As we show through experiments, this is a powerful mini-batch SGD method, with mini-batches formed with replacement as follows: in each iteration, we repeatedly ( times) and independently pick
with probability. Stochastic gradient is then formed by averaging the stochastic gradients for all selected indices (including each as many times as this index was selected).
SGD-star (Algorithm 4). This new method forms a bridge between vanilla and variance-reduced SGD methods. While not practical, it sheds light on the role of variance reduction. Again, we consider functions of the finite-sum form (4). This methods answers the following question: assuming that the gradients , are known, can they be used to design a more powerful SGD variant? The answer is yes, and SGD-star is the method. In its most basic form, SGD-star constructs the stochastic gradient via , where is chosen uniformly at random. That is, the standard stochastic gradient is perturbed by the stochastic gradient at the same index evaluated at the optimal point . Inferring from Table 2, where , this method converges to , and not merely to some oscillation region. Variance-reduced methods essentially work by iteratively constructing increasingly more accurate estimates of . Typically, the term in the Lyapunov function of variance reduced methods will contain a term of the form , with being the estimators maintained by the method. Remarkably, SGD-star was never explicitly considered in the literature before.
N-SAGA (Algorithm 6). This is a novel variant of SAGA SAGA , one in which one does not have access to the gradients of , but instead only has access to noisy stochastic estimators thereof (with noise ). Like SAGA, N-SAGA is able to reduce the variance inherent in the finite sum structure (4) of the problem. However, it necessarily pays the price of noisy estimates of , and hence, just like vanilla SGD methods, is ultimately unable to converge to . The oscillation region is governed by the noise level (refer to and in Table 2). This method will be of practical importance for problems where each is of the form (2), i.e., for problems of the “average of expectations” structure. Batch versions of N-SAGA would be well suited for distributed optimization, where each is owned by a different worker, as in such a case one wants the workers to work in parallel.
N-SEGA (Algorithm 8). This is a noisy extension of the RCD-type method SEGA, in complete analogy with the relationship between SAGA and N-SAGA. Here we assume that we only have noisy estimates of partial derivatives (with noise ). This situation is common in derivative-free optimization, where such a noisy estimate can be obtained by taking (a random) finite difference approximation nesterov2017randomDFO . Unlike SEGA, N-SEGA only converges to an oscillation region the size of which is governed by .
In this section we numerically verify the claims from the paper. We present only a fraction of experiments here, the rest is contained in Appendix B.
In Section A.3, we describe in detail the SGD-MB method already outlined before. The main advantage of SGD-MB is that the sampling procedure it employs can be implemented in just time. In contrast, even the simplest without-replacement sampling which selects each function into the minibatch with a prescribed probability independently (we will refer to it as independent SGD) requires calls of a uniform random generator. We demonstrate numerically that SGD-MB has essentially identical iteration complexity to independent SGD in practice. We consider logistic regression with Tikhonov regularization. For a fixed expected sampling size , consider two options for the probability of sampling the -th function:
, where is such that555An RCD version of this sampling was proposed in AccMbCd ; it was shown to be superior to uniform sampling both in theory and practice. .
The results can be found in Figure 1, where we also report the choice of stepsize and the choice of in the legend and title of the plot, respectively.
Indeed, iteration complexity of SGD-MB and independent SGD is almost identical. Since the cost of each iteration of SGD-MB is cheaper666The relative difference between iteration costs of SGD-MB and independent SGD can be arbitrary, especially for the case when cost of evaluating is cheap, is huge and . In such case, cost of one iteration of SGD-MB is while the cost of one iteration of independent SGD is ., we conclude superiority of SGD-MB to independent SGD.
7 Limitations and Extensions
Although our approach is rather general, we still see several possible directions for future extensions, including:
We believe our results can be extended to weakly convex functions. However, producing a comparable result in the nonconvex case remains a major open problem.
It would be further interesting to unify our theory with biased gradient estimators. If this was possible, one could recover methods as SAG SAG in special cases, or obtain rates for the zero-order optimization. We have some preliminary results in this direction already.
Although our theory allows for non-uniform stochasticity, it does not recover the best known rates for RCD type methods with importance sampling. It would be thus interesting to provide a more refined analysis capable of capturing importance sampling phenomena more accurately.
An extension of Assumption 4.1 to iteration dependent parameters would enable an array of new methods, such as SGD with decreasing stepsizes.
It would be interesting to provide a unified analysis of stochastic methods with acceleration and momentum. In fact, kulunchakov2019estimate provide (separately) a unification of some methods with and without variance reduction. Hence, an attempt to combine our insights with their approach seems to be a promising starting point in these efforts.
-  Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017.
-  Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, pages 5977–5987, 2018.
Chih-Chung Chang and Chih-Jen Lin.
LibSVM: A library for support vector machines.ACM transactions on intelligent systems and technology (TIST), 2(3):27, 2011.
-  Dominik Csiba and Peter Richtárik. Coordinate descent face-off: primal or dual? In JMLR Workshop and Conference Proceedings, The 29th International Conference on Algorithmic Learning Theory, 2018.
-  Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
-  Robert M Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richtárik. SGD: General analysis and improved rates. arXiv preprint arXiv:1901.09401, 2019.
-  Robert M Gower and Peter Richtárik. Randomized iterative methods for linear systems. SIAM Journal on Matrix Analysis and Applications, 36(4):1660–1690, 2015.
-  Robert M Gower and Peter Richtárik. Stochastic dual ascent for solving linear systems. arXiv:1512.06890, 2015.
-  Robert M Gower, Peter Richtárik, and Francis Bach. Stochastic quasi-gradient methods: Variance reduction via Jacobian sketching. arXiv preprint arXiv:1805.02632, 2018.
-  Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 1737–1746. JMLR.org, 2015.
-  Filip Hanzely, Konstantin Mishchenko, and Peter Richtárik. SEGA: Variance reduction via gradient sketching. In Advances in Neural Information Processing Systems 31, pages 2082–2093, 2018.
-  Filip Hanzely and Peter Richtárik. Accelerated coordinate descent with arbitrary sampling and best rates for minibatches. In Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 304–312. PMLR, 16–18 Apr 2019.
-  Thomas Hofmann, Aurelien Lucchi, Simon Lacoste-Julien, and Brian McWilliams. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems, pages 2305–2313, 2015.
-  Samuel Horváth, Dmitry Kovalev, Konstantin Mishchenko, Sebastian Stich, and Peter Richtárik. Stochastic distributed learning with gradient quantization and variance reduction. arXiv preprint arXiv:1904.05115, 2019.
-  Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26, pages 315–323, 2013.
-  Jakub Konečný, Jie Lu, Peter Richtárik, and Martin Takáč. Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10(2):242–255, 2016.
-  Jakub Konečný and Peter Richtárik. Randomized distributed mean estimation: accuracy vs communication. Frontiers in Applied Mathematics and Statistics, 4(62):1–11, 2018.
-  Dmitry Kovalev, Samuel Horváth, and Peter Richtárik. Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. arXiv preprint arXiv:1901.08689, 2019.
-  Andrei Kulunchakov and Julien Mairal. Estimate sequences for variance-reduced stochastic composite optimization. arXiv preprint arXiv:1905.02374, 2019.
-  Konstantin Mishchenko, Eduard Gorbunov, Martin Takáč, and Peter Richtárik. Distributed learning with compressed gradient differences. arXiv preprint arXiv:1901.09269, 2019.
-  Konstantin Mishchenko, Filip Hanzely, and Peter Richtárik. 99% of parallel optimization is inevitably a waste of time. arXiv preprint arXiv:1901.09437, 2019.
-  Deanna Needell, Nathan Srebro, and Rachel Ward. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Mathematical Programming, 155(1–2):549–573, 2015.
-  Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
-  Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
-  Yurii Nesterov. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
-  Lam Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtárik, Katya Scheinberg, and Martin Takáč. SGD and Hogwild! Convergence without the bounded gradients assumption. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3750–3758, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
-  Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2613–2621. PMLR, 2017.
-  Zheng Qu and Peter Richtárik. Coordinate descent with arbitrary sampling I: Algorithms and complexity. Optimization Methods and Software, 31(5):829–857, 2016.
-  Zheng Qu, Peter Richtárik, and Tong Zhang. Quartz: Randomized dual coordinate ascent with arbitrary sampling. In Advances in Neural Information Processing Systems 28, pages 865–873, 2015.
-  Peter Richtárik and Martin Takáč. On optimal probabilities in stochastic coordinate descent methods. Optimization Letters, 10(6):1233–1243, 2016.
-  Peter Richtárik and Martin Takáč. Stochastic reformulations of linear systems: algorithms and convergence theory. arXiv:1706.01108, 2017.
-  H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
-  Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2663–2671, 2012.
-  Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan R. K. Ports, and Peter Richtárik. Scaling distributed machine learning with in-network aggregation. arXiv preprint ArXiv:1903.06701, 2019.
-  Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. pages 1058–1062. ISCA, 2014.
-  Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: from theory to algorithms. Cambridge University Press, 2014.
-  Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567–599, 2013.
-  Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-efficient distributed optimization using an approximate Newton-type method. In Proceedings of the 31st International Conference on Machine Learning, PMLR, volume 32, pages 1000–1008, 2014.
-  Quoc Tran-Dinh, Nhan H Pham, Dzung T Phan, and Lam M Nguyen. Hybrid stochastic gradient descent algorithms for stochastic nonconvex optimization. arXiv preprint arXiv:1905.05920, 2019.
Sharan Vaswani, Francis Bach, and Mark Schmidt.
Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron.In
22nd International Conference on Artificial Intelligence and Statistics, volume 89 of PMLR, pages 1195–1204, 2019.
-  Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pages 1306–1316, 2018.
-  Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, pages 1509–1519, 2017.
-  Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. ZipML: Training linear models with end-to-end low precision, and a little bit of deep learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 4035–4043, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
-  Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the 32nd International Conference on Machine Learning, PMLR, volume 37, pages 1–9, 2015.
Appendix A Special Cases
a.1 Proximal Sgd for stochastic optimization
We start with stating the problem, the assumptions on the objective and on the stochastic gradients for SGD . Consider the expectation minimization problem
where , is differentiable and -smooth almost surely in .
Lemma A.1 (Generalization of Lemmas 1,2 from ).
Assume that is convex in for every . Then for every
where . If further is -strongly convex with possibly non-convex , then for every
Assume that is convex in for every and is -strongly quasi-convex. Then SGD with satisfies
If we further assume that is -strongly convex with possibly non-convex , SGD with satisfies (18) as well.
Proof of Lemma a.1
In this section, we recover convergence result of SGD under expected smoothness property from . This setup allows obtaining tight convergence rates of SGD under arbitrary stochastic reformulation of finite sum minimization777For technical details on how to exploit expected smoothness for specific reformulations, see .
The stochastic reformulation is a special instance of (15):
where is a random vector from distribution such that for all : and (for all ) is smooth, possibly non-convex function. We next state the expextes smoothness assumption. A specific instances of this assumption allows to get tight convergence rates of SGD, which we recover in this section.
Assumption A.1 (Expected smoothness).
We say that is -smooth in expectation with respect to distribution if there exists such that
for all . For simplicity, we will write to say that (20) holds.
Lemma A.2 (Generalization of Lemma 2.4, ).
If , then
Assume that is -strongly quasi-convex and . Then SGD-SR with satisfies
Proof of Lemma a.2
Here we present the generalization of the proof of Lemma 2.4 from  for the case when . In this proof all expectations are conditioned on .
In this section, we present a specific practical formulation of (19) which was not considered in . The resulting algorithm (Algorithm 3) is novel; it was not considered in  as a specific instance of SGD-SR. The key idea behind SGD-MB is constructing unbiased gradient estimate via with-replacement sampling.
Consider random variable such that
Notice that if we define
So, we have rewritten the finite sum problem (3) into the equivalent stochastic optimization problem
We are now ready to describe our method. At each iteration we sample independently (), and define . Further, we use as a stochastic gradient, resulting in Algorithm 3.
To remain in full generality, consider the following Assumption.
There exists constants and such that
for all .
Suppose that Assumption A.2 holds. Then is unbiased; i.e. . Further,
As long as , we have
For , SGD-MB is a special of the method from , Section 3.2. However, for , this is a different method; the difference lies in the with-replacement sampling. Note that with-replacement trick allows for efficient and implementation of independent importance sampling 888Distribution of random sets for which random variables and are independent for . with complexity . In contrast, implementation of without-replacement importance sampling has complexity , which can be significantly more expensive to the cost of evaluating .
Proof of Lemma a.4
Notice first that
So, is an unbiased estimator of the gradient . Next,
Proof of Lemma a.3
Consider problem (19). Suppose that is known for all . In this section we present a novel algorithm — SGD-star — which is SGD-SR shifted by the stochastic gradient in the optimum. The method is presented under Expected Smoothness Assumption (20), obtaining general rates under arbitrary sampling. The algorithm is presented as Algorithm 4.