1 Introduction
Empirical risk minimization (aka finitesum) problems form the dominant paradigm for training supervised machine learning models such as ridge regression, support vector machines, logistic regression, and neural networks. In its most general form, a finite sum problem has the form
(1) 
where refers to the number of training data points (e.g., videos, images, molecules, text corpora), is the vector representation of a model using features, and is the loss of model on data point .
1.1 Variancereduced methods
One of the most remarkable algorithmic breakthroughs in recent years was the development of variancereduced stochastic gradient algorithms for solving (1). These methods are significantly faster than SGD (Nemirovsky & Yudin, 1983; Nemirovski et al., 2009; Takáč et al., 2013) in theory and practice on convex and strongly convex problems, and faster in theory on several classes on nonconvex problems (unfortunately, variancereduced methods are not, as of yet, stateoftheart methods for training productiongrade neural networks).
Two of the most notable and popular methods belonging to the family of variancereduced methods are SVRG Johnson & Zhang (2013) and its accelerated variant known as Katyusha AllenZhu (2017). The latter method accelerates the former via the employment of a novel “negative momentum” idea. Both of these methods have a double loop design. At the beginning of the outer loop, a full pass over the training data is made to compute the gradient of at a reference point , which is chosen as the freshest iterate (SVRG) or a weighted average of recent iterates (for Katyusha). This gradient is then used in the inner loop to adjust the stochastic gradient , where is sampled uniformly at random from , and is the current iterate, so as to reduce its variance. In particular, both SVRG and Katyusha perform the adjustment
Note that, like , the new search direction
is an unbiased estimator of
. Indeed,(2) 
where the expectation is taken over random choice of . However, it turns out that as the methods progress, the variance of , unlike that of , progressively decreases to zero. The total effect of this is significantly faster convergence.
1.2 Converegnce of Svrg and Katyusha for –smooth and –strongly convex functions
For instance, consider the regime where is –smooth for each , and is –strongly convex:
Assumption 1.1 (–smoothness).
Functions are –smooth for some . That is, for all
(3) 
Assumption 1.2 (–strong convexity).
Function is –strongly convex for . That is, for all
(4) 
In this regime, the iteration complexity of SVRG is
which is a vast improvement on the linear rate of gradient descent (GD), which is , and on the sublinear rate of SGD, which is , where and is the (necessarily unique) minimizer of . On the other hand, Katyusha enjoys the accelerated rate
which is superior to that of SVRG in the illconditioned regime where . This rate has been shown to be optimal in a certain precise sense (Nesterov, 2013).
In the past several years, an enormous effort of the machine learning and optimization communities was exerted into designing new efficient variancereduced methods algorithms to tackle problem (1). These developments have brought about a renaissance in the field. The historically first provably variancereduced method, the stochastic average gradient (SAG) method of Roux et al. (2012); Schmidt et al. (2017), was awarded the Lagrange prize in continuous optimization in 2018. The SAG method was later modified to an unbiased variant called SAGA (Defazio et al., 2014a), achieving the same theoretical rates.
Alternative variancereduced method include MISO (Mairal, 2015), FINITO (Defazio et al., 2014b), SDCA ShalevShwartz (2016), dfSDCA Csiba & Richtárik (2015), AdaSDCA Csiba et al. (2015), QUARTZ (Qu et al., 2015), SBFGS (Gower et al., 2016), SDNA (Qu et al., 2016), SARAH (Nguyen et al., 2017) and S2GD (Konečný & Richtárik, 2017), mS2GD (Konečný et al., 2016), RBCN (Doikov & Richtárik, 2018), JacSketch (Gower et al., 2018) and SAGD Bibi et al. (2018). Accelerated variancereduced method were developed by ShalevShwartz & Zhang (2014), Defazio (2016), Zhou (2018) and Zhou et al. (2018).
2 Contributions
As explained in the introduction, a trademark structural feature of SVRG and its accelerated variant, Katyusha, is the presence of the outer loop in which a full pass over the data is made. However, the presence of this outer loop is the source of several issues. First, the methods are harder to analyze. Second, one needs to decide at which point the inner loop is terminated and the outer loop entered. For SVRG, the theoretically optimal inner loop size depends on both and . However, is not always known. Moreover, even when an estimate is available, as is the case in regularized problems with an explicit strongly convex regularizer, the estimate can often be very loose. Because of these issues, one often chooses the inner loop size in a suboptimal way, such as by setting it to or .
2.1 Two loopless methods
In this paper we address the above issues by developing loopless variants of both SVRG and Katyusha; we refer to them as LSVRG and LKatyusha, respectively. In these methods, we dispose of the outer loop and replace its role by a biased coinflip, to be performed in every step of the methods, used to trigger the computation of the gradient via a pass over the data. In particular, in each step, with (a small) probability we perform a full pass over data and update the reference gradient . With probability we keep the previous reference gradient. This procedure can alternatively be interpreted as having an outer loop of a random length. However, the resulting methods are easier to write down, comprehend and analyze.
2.2 Fast rates are preserved
We show that LSVRG and LKatyusha enjoy the same fast theoretical rates as their loopy forefathers. However, the proofs are different and the complexity results more insightful.
Convergence of LSVRG. For LSVRG with fixed stepsize and probability , we show (see Theorem 4.5) that for the Lyapunov function
(5) 
we get as long as
In contrast, the classical SVRG result shows convergence of the expected functional suboptimality to zero at the same rate. Note that the classical result follows from our result by utilizing the inequality
which is a simple consequence of –smoothness. However, our result is provides a deeper insight into the behavior of the method. In particular, it follows that the gradients at the reference points converge to the gradients at the optimum. This is a key intuition behind the workings of SVRG, one not revealed by the classical analysis. Hereby we close the gap in the theoretical understanding of the the SVRG convergence mechanism.
Our theory predicts that as log as
(6) 
where , LSVRG will enjoy the optimal complexity . In the illconditioned regime , for instance, we roughly have . This is in contrast with the (loopy/standard) SVRG method the outer loop of which needs to be of the size . To the best of our knowledge, SVRG does not enjoy this rate for an outer loop of size , even though this is the setting most often used in practice.
Convergence of LKatyusha. For LKatyusha with stepsize we show convergence of the Lyapunov function
(7) 
where
(8) 
and
(9) 
and where and are iterates produced by the method, with the parameters defined by , , , . Our main result (Theorem 5.6) states that as long as
2.3 Simplified Analysis
An advantage of the loopless approach is that the analysis of the decrease of the Lyapunov function over a single iteration is sufficient to establish convergence. In contrast, one needs to perform elaborate aggregation across the inner loop to prove the convergence of the original loopy methods.
2.4 Superior empirical behaviour
We show through extensive numerical testing on both synthetic and real data that our loopless methods are superior to their loopy variants.
We show through experiments that LSVRG is very robust to the choice of from the optimal interval (6) predicted by our theory. Moreover, even the worst case for LSVRG outperforms the best case for SVRG. This shows how further randomization can significantly speed up and stabilize the algorithm.
3 Notations
Throughout the whole paper we use conditional expectation for LSVRG and for LKatyusha, but for simplicity we will denote these expectations as . If refers to unconditional expectation, it is directly mentioned.
4 Loopless SVRG (LSvrg)
In this section we describe in detail the Loopless SVRG method (LSVRG), and its convergence properties.
4.1 The algorithm
The LSVRG method, formalized as Algorithm 1, is inspired by the original SVRG (Johnson & Zhang, 2013) method. We remove the outer loop present in SVRG and instead use a probabilistic update of the full gradient.^{1}^{1}1This idea was indepdentnly explored by Hofmann et al. (2015); we have learned about this work after a first draft of our paper was finished.
This update can be also seen in a way that outer loop size is generated by geometric distribution similar to
(Konečný & Richtárik, 2017; Lei et al., 2017).Note that the reference point (at which a full gradient is computed) is updated in each iteration with probability to the current iterate , and is left unchanged with probability . Alternatively, the probability can be seen as a parameter that controls the expected time before next full pass over data. To be more precise, the expected time before next full pass over data is . Intuitively, we wish to keep small so that full passes over data are computed rarely enough. As we shall see next, the simple choice leads to complexity identical to that of original SVRG.
4.2 Convergence theory
A key role in the analysis is played by the gradient learning quantity
(10) 
and the Lyapunov function
(11) 
The analysis involves four lemmas, followed by the main theorem. We wish to mention the lemmas as they highlight the way in which the argument works. All lemmas combined, together with the main theorem, can be proved on a single page, which underlines the simplicity of our approach.
Our first lemma upper bounds the expected squared distance of from in terms of the same distance but for , function suboptimality, and variance of .
Lemma 4.1.
We have
(12) 
In our next lemma, we further bound the variance of in terms of function suboptimality and .
Lemma 4.2.
We have
(13) 
Finally, we bound in terms of and function suboptimality.
Lemma 4.3.
We have
(14) 
Putting the above three lemmas together naturally leads to the following result involving Lyapunov function (5).
Lemma 4.4.
Let the step size . Then for all the following inequality holds:
(15) 
In order to obtain a recursion involving the Lyapunov function on the righthand side of (15)
Theorem 4.5.
Let , . Then as long as
Proof.
As the corollary of Lemma 4.4 we have
Setting , and unrolling conditional probability one obtains
which concludes the proof. ∎
Note that the step size does not depend on the strong convexity parameter and yet the resulting complexity adapts to it.
4.3 Discussion
Examining (15), we can see that contraction of the Lyapunov function is . Due to the limitation of , the first term is at least , thus the complexity cannot better than . In terms of total complexity (number of stochastic gradient calls), LSVRG calls the stochastic gradient oracle in expectation times times in each iteration. Combining these two complexities together, one gets the total complexity
Note that any choice of
where , leads to the optimal total complexity . This fills the gap in SVRG theory, where the outer loop length (in our case in expectation) needs to be proportional to . Moreover, analysis for LSVRG is much simpler and provides more insights.
5 Loopless Katyusha (LKatyusha)
In this section we describe in detail the Loopless Katyusha method (LKatyusha), and its convergence properties.
5.1 The algorithm
The LKatyusha method, formalized as Algorithm 2, is inspired by the original Katyusha (AllenZhu, 2017) method. We use the same technique as for Algorithm 1, where we remove the outer loop present in Katyusha and instead use a probabilistic update of the full gradient.
The exact analogy applies to the reference point (at which a full gradient is computed) as for LSVRG. Instead of updating this point in a deterministic way every iteration, we use the probabilistic update with parameter , when we update to the current iterate with this probability and is left unchanged with probability . As we shall see next, the same choice as for LSVRG leads to complexity identical to that of original Katyusha.
5.2 Convergence theory
In comparison to LSVRG, we don’t use gradient mapping as the key component of our analysis. Instead, we prove convergence of functional values in and pointwise convergence of . This is summarized in the following Lyapunov function:
(16) 
where
(17) 
and
(18) 
Note that even if is not in this function, its pointwise convergence is directly implied by the convergence of due to the definition of in Algorithm 2 and smoothness of .
The analysis involves five lemmas, followed by the iteration complexity summarized in the main theorem. The lemmas highlight important steps of our analysis. The simplicity of our approach is still preserved, where all lemmas combined, together with the main theorem, can be proved on not more than two pages.
Our first lemma upper bounds the variance of the gradient estimator , which eventually goes to zero as our algorithm progresses.
Lemma 5.1.
We have
(19) 
Next two lemmas are more technical, but essential for proving the convergence.
Lemma 5.2.
We have
(20) 
Lemma 5.3.
We have
(21) 
Finally, we use the update of Algorithm 2 to decompose in terms of and , which is one of the main components that allow for simpler analysis than the one of original Katyusha.
Lemma 5.4.
We have
(22) 
Putting all lemmas together, we obtain the following contraction of the Lyapunov function defined in (7).
Lemma 5.5.
Let , , and , then we have
(23) 
In order to obtain a recursion involving the Lyapunov function on the righthand side of (23)
Theorem 5.6.
Let , , . Then after the following number of iterations:
Proof.
As a corollary of Lemma 5.5, we have
Setting , , , and unrolling conditional probability one obtains , where Choosing concludes the proof. ∎
5.3 Discussion
6 Numerical Experiments
In this section, we perform experiments with logistic regression for binary classification with
regularizer, where our loss function has the form
where . Hence, is smooth and strongly convex. We use four LIBSVM library^{2}^{2}2The LIBSVM dataset collection is available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/: a9a, w8a, mushrooms, phishing, codrna.
We compare our methods LSVRG and LKatyusha with their original version. It is wellknown that whenever practical, SAGA is a bit faster than SVRG. While a comparison to SAGA seems natural as it also does not have a double loop structure, we position our loopless methods for applications where the high memory requirements of SAGA prevent it to be applied. Thus, we do not compare to SAGA.
For Section 6.1, the parameters are chosen as suggested by theory. For LSVRG and LKatyusha they are chosen based on Theorems 4.5 and 5.6, respectively. For SVRG and Katyusha we also choose the parameters based on the theory, as described in the original papers. The initial point is chosen to be the origin. In Section 6.2, we provide several choices of parameters. Plots are constructed in such a way that the axis displays for LSVRG and for LKatyusha, where
were obtained by running gradient descent for a large number of epochs. The
axis displays the number of epochs (full gradient evaluations). That is, computations of equals one epoch.6.1 Superior practical behaviour of the loopless approach
In this section, we show that the replacement of the outer loop in SVRG and Katyusha brings not just a simpler analysis but also speed up in the experiments.
In theory, both the loopy and the loopless methods are the same. However, as we can see from Figure 9, the improvement of the loopless approach can be significant. One can see that for these datasets, LSVRG is always better than SVRG, and can be faster by several orders of magnitude!
Looking at Figure 16, we see that the performance of LKatyusha is at least as good that of Katyusha, and can be significantly faster in some cases. These experiments support the claim that not just the theoretical analysis is simpler, but also the performance is affected in a positive way.
6.2 Different choices of probability/ outer loop size
In this section, we compare several choices of the probability of updating the full gradient for SVRG and several outer loop sizes for SVRG. Since our analysis guarantees the optimal rate for any choice of between and for well condition problems, we decided to perform experiments for within this range. More precisely, we choose values of
, uniformly distributed in logarithmic scale across this interval, and thus our choices are
,, , , and , where , denoted in the figures by , respectively. Since the expected “outer loop” length (length for which reference point stays the same) is , for SVRG we choose .Looking at Figure 20, one can see that LSVRG is very robust to the choice of from the “optimal interval” predicted by our theory. Moreover, even the worst case for LSVRG outperforms the best case for SVRG.
6.3 All methods together
As the last visualization we provide all algorithms together in one graph for different datasets with different regularizer weight, thus with different condition number, displayed in Figure 39. As for the previous experiments, loopless methods are not worse and sometimes significantly better.
References

AllenZhu (2017)
AllenZhu, Z.
Katyusha: The first direct acceleration of stochastic gradient
methods.
In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
, pp. 1200–1205. ACM, 2017.  Bibi et al. (2018) Bibi, A., Sailanbayev, A., Ghanem, B., Gower, R. M., and Richtárik, P. Improving SAGA via a probabilistic interpolation with gradient descent. arXiv: 1806.05633, 2018.
 Csiba & Richtárik (2015) Csiba, D. and Richtárik, P. Primal method for ERM with flexible minibatching schemes and nonconvex losses. arXiv:1506.02227, 2015.
 Csiba et al. (2015) Csiba, D., Qu, Z., and Richtárik, P. Stochastic dual coordinate ascent with adaptive probabilities. In Proceedings of the 32nd International Conference on Machine Learning, pp. 674–683, 2015.
 Defazio (2016) Defazio, A. A simple practical accelerated method for finite sums. In Advances in Neural Information Processing Systems, pp. 676–684, 2016.
 Defazio et al. (2014a) Defazio, A., Bach, F., and LacosteJulien, S. SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646–1654, 2014a.
 Defazio et al. (2014b) Defazio, A., Caetano, T., and Domke, J. Finito: A faster, permutable incremental gradient method for Big Data problems. The 31st International Conference on Machine Learning, 2014b.
 Doikov & Richtárik (2018) Doikov, N. and Richtárik, P. Randomized block cubic Newton method. In Proceedings of the 35th International Conference on Machine Learning, 2018.
 Gower et al. (2016) Gower, R. M., Goldfarb, D., and Richtárik, P. Stochastic block BFGS: squeezing more curvature out of data. In 33rd International Conference on Machine Learning, pp. 1869–1878, 2016.
 Gower et al. (2018) Gower, R. M., Richtárik, P., and Bach, F. Stochastic quasigradient methods: variance reduction via Jacobian sketching. arXiv:1805.02632, 2018.

Hofmann et al. (2015)
Hofmann, T., Lucchi, A., LacosteJulien, S., and McWilliams, B.
Variance reduced stochastic gradient descent with neighbors.
In Advances in Neural Information Processing Systems, pp. 2305–2313, 2015.  Johnson & Zhang (2013) Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315–323, 2013.
 Konečný & Richtárik (2017) Konečný, J. and Richtárik, P. S2GD: Semistochastic gradient descent methods. Frontiers in Applied Mathematics and Statistics, pp. 1–14, 2017.
 Konečný et al. (2016) Konečný, J., Lu, J., Richtárik, P., and Takáč, M. Minibatch semistochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10(2):242–255, 2016.
 Lei et al. (2017) Lei, L., Ju, C., Chen, J., and Jordan, M. I. Nonconvex finitesum optimization via scsg methods. In Advances in Neural Information Processing Systems, pp. 2348–2358, 2017.
 Mairal (2015) Mairal, J. Incremental majorizationminimization optimization with application to largescale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015.
 Nemirovski et al. (2009) Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
 Nemirovsky & Yudin (1983) Nemirovsky, A. and Yudin, D. B. Problem complexity and method efficiency in optimization. Wiley, New York, 1983.
 Nesterov (2013) Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
 Nguyen et al. (2017) Nguyen, L. M., Liu, J., Scheinberg, K., and Takáč, M. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pp. 2613–2621, 2017.
 Qu et al. (2015) Qu, Z., Richtárik, P., and Zhang, T. Quartz: Randomized dual coordinate ascent with arbitrary sampling. In Advances in Neural Information Processing Systems 28, pp. 865–873, 2015.
 Qu et al. (2016) Qu, Z., Richtárik, P., Takáč, M., and Fercoq, O. SDNA: Stochastic dual Newton ascent for empirical risk minimization. In The 33rd International Conference on Machine Learning, pp. 1823–1832, 2016.
 Roux et al. (2012) Roux, N. L., Schmidt, M., and Bach, F. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pp. 2663–2671, 2012.
 Schmidt et al. (2017) Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(12):83–112, 2017.
 ShalevShwartz (2016) ShalevShwartz, S. SDCA without duality, regularization, and individual convexity. In International Conference on Machine Learning, pp. 747–754, 2016.
 ShalevShwartz & Zhang (2014) ShalevShwartz, S. and Zhang, T. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In International Conference on Machine Learning, pp. 64–72, 2014.
 Takáč et al. (2013) Takáč, M., Bijral, A., Richtárik, P., and Srebro, N. Minibatch primal and dual methods for SVMs. In 30th International Conference on Machine Learning, pp. 537–552, 2013.
 Zhou (2018) Zhou, K. Direct acceleration of SAGA using sampled negative momentum. arXiv preprint arXiv:1806.11048, 2018.
 Zhou et al. (2018) Zhou, K., Shang, F., and Cheng, J. A simple stochastic variance reduced algorithm with fast convergence rates. arXiv preprint arXiv:1806.11027, 2018.
Appendix A Auxiliary Lemmas
Lemma A.1.
For random vector and any , the variance can be decomposed as
(24) 
Lemma A.2.
For any vectors , we have as a consequence of Jensen’s inequality applied to :
(25) 
Appendix B Proofs for Algorithm 1 (LSvrg)
Let .
Proof of Lemma 4.1.
Definition of and unbiasness of guarantee that
∎
Proof of Lemma 4.2.
Using definition of
∎
Proof of Lemma 4.3.
∎
Appendix C Proofs for Algorithm 2 (LKatyusha)
Proof of Lemma 5.3.
where the last inequality uses the Young’s inequality in the form of for , which concludes the proof. ∎
Proof of Lemma 5.4.
Proof of Lemma 5.5.
Combining all the previous lemmas together we obtain
Comments
There are no comments yet.