Don't Jump Through Hoops and Remove Those Loops: SVRG and Katyusha are Better Without the Outer Loop

01/24/2019 ∙ by Dmitry Kovalev, et al. ∙ King Abdullah University of Science and Technology 0

The stochastic variance-reduced gradient method (SVRG) and its accelerated variant (Katyusha) have attracted enormous attention in the machine learning community in the last few years due to their superior theoretical properties and empirical behaviour on training supervised machine learning models via the empirical risk minimization paradigm. A key structural element in both of these methods is the inclusion of an outer loop at the beginning of which a full pass over the training data is made in order to compute the exact gradient, which is then used to construct a variance-reduced estimator of the gradient. In this work we design loopless variants of both of these methods. In particular, we remove the outer loop and replace its function by a coin flip performed in each iteration designed to trigger, with a small probability, the computation of the gradient. We prove that the new methods enjoy the same superior theoretical convergence properties as the original methods. However, we demonstrate through numerical experiments that our methods have substantially superior practical behavior.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Empirical risk minimization (aka finite-sum) problems form the dominant paradigm for training supervised machine learning models such as ridge regression, support vector machines, logistic regression, and neural networks. In its most general form, a finite sum problem has the form

(1)

where refers to the number of training data points (e.g., videos, images, molecules, text corpora), is the vector representation of a model using features, and is the loss of model on data point .

1.1 Variance-reduced methods

One of the most remarkable algorithmic breakthroughs in recent years was the development of variance-reduced stochastic gradient algorithms for solving (1). These methods are significantly faster than SGD (Nemirovsky & Yudin, 1983; Nemirovski et al., 2009; Takáč et al., 2013) in theory and practice on convex and strongly convex problems, and faster in theory on several classes on nonconvex problems (unfortunately, variance-reduced methods are not, as of yet, state-of-the-art methods for training production-grade neural networks).

Two of the most notable and popular methods belonging to the family of variance-reduced methods are SVRG Johnson & Zhang (2013) and its accelerated variant known as Katyusha Allen-Zhu (2017). The latter method accelerates the former via the employment of a novel “negative momentum” idea. Both of these methods have a double loop design. At the beginning of the outer loop, a full pass over the training data is made to compute the gradient of at a reference point , which is chosen as the freshest iterate (SVRG) or a weighted average of recent iterates (for Katyusha). This gradient is then used in the inner loop to adjust the stochastic gradient , where is sampled uniformly at random from , and is the current iterate, so as to reduce its variance. In particular, both SVRG and Katyusha perform the adjustment

Note that, like , the new search direction

is an unbiased estimator of

. Indeed,

(2)

where the expectation is taken over random choice of . However, it turns out that as the methods progress, the variance of , unlike that of , progressively decreases to zero. The total effect of this is significantly faster convergence.

1.2 Converegnce of Svrg and Katyusha for –smooth and –strongly convex functions

For instance, consider the regime where is –smooth for each , and is –strongly convex:

Assumption 1.1 (–smoothness).

Functions are –smooth for some . That is, for all

(3)
Assumption 1.2 (–strong convexity).

Function is –strongly convex for . That is, for all

(4)

In this regime, the iteration complexity of SVRG is

which is a vast improvement on the linear rate of gradient descent (GD), which is , and on the sublinear rate of SGD, which is , where and is the (necessarily unique) minimizer of . On the other hand, Katyusha enjoys the accelerated rate

which is superior to that of SVRG in the ill-conditioned regime where . This rate has been shown to be optimal in a certain precise sense (Nesterov, 2013).

In the past several years, an enormous effort of the machine learning and optimization communities was exerted into designing new efficient variance-reduced methods algorithms to tackle problem (1). These developments have brought about a renaissance in the field. The historically first provably variance-reduced method, the stochastic average gradient (SAG) method of Roux et al. (2012); Schmidt et al. (2017), was awarded the Lagrange prize in continuous optimization in 2018. The SAG method was later modified to an unbiased variant called SAGA (Defazio et al., 2014a), achieving the same theoretical rates.

Alternative variance-reduced method include MISO (Mairal, 2015), FINITO (Defazio et al., 2014b), SDCA Shalev-Shwartz (2016), dfSDCA Csiba & Richtárik (2015), AdaSDCA Csiba et al. (2015), QUARTZ (Qu et al., 2015), SBFGS (Gower et al., 2016), SDNA (Qu et al., 2016), SARAH (Nguyen et al., 2017) and S2GD (Konečný & Richtárik, 2017), mS2GD (Konečný et al., 2016), RBCN (Doikov & Richtárik, 2018), JacSketch (Gower et al., 2018) and SAGD Bibi et al. (2018). Accelerated variance-reduced method were developed by Shalev-Shwartz & Zhang (2014), Defazio (2016), Zhou (2018) and Zhou et al. (2018).

2 Contributions

As explained in the introduction, a trade-mark structural feature of SVRG and its accelerated variant, Katyusha, is the presence of the outer loop in which a full pass over the data is made. However, the presence of this outer loop is the source of several issues. First, the methods are harder to analyze. Second, one needs to decide at which point the inner loop is terminated and the outer loop entered. For SVRG, the theoretically optimal inner loop size depends on both and . However, is not always known. Moreover, even when an estimate is available, as is the case in regularized problems with an explicit strongly convex regularizer, the estimate can often be very loose. Because of these issues, one often chooses the inner loop size in a suboptimal way, such as by setting it to or .

2.1 Two loopless methods

In this paper we address the above issues by developing loopless variants of both SVRG and Katyusha; we refer to them as L-SVRG and L-Katyusha, respectively. In these methods, we dispose of the outer loop and replace its role by a biased coin-flip, to be performed in every step of the methods, used to trigger the computation of the gradient via a pass over the data. In particular, in each step, with (a small) probability we perform a full pass over data and update the reference gradient . With probability we keep the previous reference gradient. This procedure can alternatively be interpreted as having an outer loop of a random length. However, the resulting methods are easier to write down, comprehend and analyze.

2.2 Fast rates are preserved

We show that L-SVRG and L-Katyusha enjoy the same fast theoretical rates as their loopy forefathers. However, the proofs are different and the complexity results more insightful.

Convergence of L-SVRG. For L-SVRG with fixed stepsize and probability , we show (see Theorem 4.5) that for the Lyapunov function

(5)

we get as long as

In contrast, the classical SVRG result shows convergence of the expected functional suboptimality to zero at the same rate. Note that the classical result follows from our result by utilizing the inequality

which is a simple consequence of –smoothness. However, our result is provides a deeper insight into the behavior of the method. In particular, it follows that the gradients at the reference points converge to the gradients at the optimum. This is a key intuition behind the workings of SVRG, one not revealed by the classical analysis. Hereby we close the gap in the theoretical understanding of the the SVRG convergence mechanism.

Our theory predicts that as log as

(6)

where , L-SVRG will enjoy the optimal complexity . In the ill-conditioned regime , for instance, we roughly have . This is in contrast with the (loopy/standard) SVRG method the outer loop of which needs to be of the size . To the best of our knowledge, SVRG does not enjoy this rate for an outer loop of size , even though this is the setting most often used in practice.

Convergence of L-Katyusha. For L-Katyusha with stepsize we show convergence of the Lyapunov function

(7)

where

(8)

and

(9)

and where and are iterates produced by the method, with the parameters defined by , , , . Our main result (Theorem 5.6) states that as long as

2.3 Simplified Analysis

An advantage of the loopless approach is that the analysis of the decrease of the Lyapunov function over a single iteration is sufficient to establish convergence. In contrast, one needs to perform elaborate aggregation across the inner loop to prove the convergence of the original loopy methods.

2.4 Superior empirical behaviour

We show through extensive numerical testing on both synthetic and real data that our loopless methods are superior to their loopy variants.

We show through experiments that L-SVRG is very robust to the choice of from the optimal interval (6) predicted by our theory. Moreover, even the worst case for L-SVRG outperforms the best case for SVRG. This shows how further randomization can significantly speed up and stabilize the algorithm.

3 Notations

Throughout the whole paper we use conditional expectation for L-SVRG and for L-Katyusha, but for simplicity we will denote these expectations as . If refers to unconditional expectation, it is directly mentioned.

4 Loopless SVRG (L-Svrg)

In this section we describe in detail the Loopless SVRG method (L-SVRG), and its convergence properties.

4.1 The algorithm

The L-SVRG method, formalized as Algorithm 1, is inspired by the original SVRG (Johnson & Zhang, 2013) method. We remove the outer loop present in SVRG and instead use a probabilistic update of the full gradient.111This idea was indepdentnly explored by Hofmann et al. (2015); we have learned about this work after a first draft of our paper was finished.

This update can be also seen in a way that outer loop size is generated by geometric distribution similar to

(Konečný & Richtárik, 2017; Lei et al., 2017).

  Parameters: stepsize , probability
  Initialization:
  for   do
     Sample uniformly at random
     
     
     
  end for
Algorithm 1 Loopless SVRG (L-SVRG)

Note that the reference point (at which a full gradient is computed) is updated in each iteration with probability to the current iterate , and is left unchanged with probability . Alternatively, the probability can be seen as a parameter that controls the expected time before next full pass over data. To be more precise, the expected time before next full pass over data is . Intuitively, we wish to keep small so that full passes over data are computed rarely enough. As we shall see next, the simple choice leads to complexity identical to that of original SVRG.

4.2 Convergence theory

A key role in the analysis is played by the gradient learning quantity

(10)

and the Lyapunov function

(11)

The analysis involves four lemmas, followed by the main theorem. We wish to mention the lemmas as they highlight the way in which the argument works. All lemmas combined, together with the main theorem, can be proved on a single page, which underlines the simplicity of our approach.

Our first lemma upper bounds the expected squared distance of from in terms of the same distance but for , function suboptimality, and variance of .

Lemma 4.1.

We have

(12)

In our next lemma, we further bound the variance of in terms of function suboptimality and .

Lemma 4.2.

We have

(13)

Finally, we bound in terms of and function suboptimality.

Lemma 4.3.

We have

(14)

Putting the above three lemmas together naturally leads to the following result involving Lyapunov function (5).

Lemma 4.4.

Let the step size . Then for all the following inequality holds:

(15)

In order to obtain a recursion involving the Lyapunov function on the right-hand side of (15)

Theorem 4.5.

Let , . Then as long as

Proof.

As the corollary of Lemma 4.4 we have

Setting , and unrolling conditional probability one obtains

which concludes the proof. ∎

Note that the step size does not depend on the strong convexity parameter and yet the resulting complexity adapts to it.

4.3 Discussion

Examining (15), we can see that contraction of the Lyapunov function is . Due to the limitation of , the first term is at least , thus the complexity cannot better than . In terms of total complexity (number of stochastic gradient calls), L-SVRG calls the stochastic gradient oracle in expectation times times in each iteration. Combining these two complexities together, one gets the total complexity

Note that any choice of

where , leads to the optimal total complexity . This fills the gap in SVRG theory, where the outer loop length (in our case in expectation) needs to be proportional to . Moreover, analysis for L-SVRG is much simpler and provides more insights.

5 Loopless Katyusha (L-Katyusha)

In this section we describe in detail the Loopless Katyusha method (L-Katyusha), and its convergence properties.

5.1 The algorithm

The L-Katyusha method, formalized as Algorithm 2, is inspired by the original Katyusha (Allen-Zhu, 2017) method. We use the same technique as for Algorithm 1, where we remove the outer loop present in Katyusha and instead use a probabilistic update of the full gradient.

  Parameters: , probability
  Initialization: Choose , stepsize and set
  for   do
     
     Sample uniformly at random
     
     
     
     
  end for
Algorithm 2 Loopless Katyusha (L-Katyusha)

The exact analogy applies to the reference point (at which a full gradient is computed) as for L-SVRG. Instead of updating this point in a deterministic way every iteration, we use the probabilistic update with parameter , when we update to the current iterate with this probability and is left unchanged with probability . As we shall see next, the same choice as for L-SVRG leads to complexity identical to that of original Katyusha.

5.2 Convergence theory

In comparison to L-SVRG, we don’t use gradient mapping as the key component of our analysis. Instead, we prove convergence of functional values in and point-wise convergence of . This is summarized in the following Lyapunov function:

(16)

where

(17)

and

(18)

Note that even if is not in this function, its point-wise convergence is directly implied by the convergence of due to the definition of in Algorithm 2 and -smoothness of .

The analysis involves five lemmas, followed by the iteration complexity summarized in the main theorem. The lemmas highlight important steps of our analysis. The simplicity of our approach is still preserved, where all lemmas combined, together with the main theorem, can be proved on not more than two pages.

Our first lemma upper bounds the variance of the gradient estimator , which eventually goes to zero as our algorithm progresses.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 9: Comparison of SVRG and L-SVRG for different datasets and regularizer weights .
Lemma 5.1.

We have

(19)

Next two lemmas are more technical, but essential for proving the convergence.

Lemma 5.2.

We have

(20)
Lemma 5.3.

We have

(21)

Finally, we use the update of Algorithm 2 to decompose in terms of and , which is one of the main components that allow for simpler analysis than the one of original Katyusha.

Lemma 5.4.

We have

(22)

Putting all lemmas together, we obtain the following contraction of the Lyapunov function defined in (7).

Lemma 5.5.

Let , , and , then we have

(23)

In order to obtain a recursion involving the Lyapunov function on the right-hand side of (23)

Theorem 5.6.

Let , , . Then after the following number of iterations:

Proof.

As a corollary of Lemma 5.5, we have

Setting , , , and unrolling conditional probability one obtains , where Choosing concludes the proof. ∎

5.3 Discussion

One can show by analyzing (23) that for ill-conditioned problems (), the iteration complexity is . Algorithm 2 calls stochastic gradient oracle times per iteration in expectation. Thus, the total complexity is . One can see that leads to optimal rate.

6 Numerical Experiments

(a)
(b)
(c)
(d)
(e)
(f)
Figure 16: Comparison of Katyusha and L-Katyusha for different datasets and regularizer weights .
(a)
(b)
(c)
Figure 20: Comparison of SVRG (S) and L-SVRG (L) for several choices of expected outer loop length (L-SVRG) or deterministic outer loop length (SVRG). Numbers correspondent to loop-lengths ,, , , , respectively, where .

In this section, we perform experiments with logistic regression for binary classification with

regularizer, where our loss function has the form

where . Hence, is smooth and strongly convex. We use four LIBSVM library222The LIBSVM dataset collection is available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/: a9a, w8a, mushrooms, phishing, cod-rna.

We compare our methods L-SVRG and L-Katyusha with their original version. It is well-known that whenever practical, SAGA is a bit faster than SVRG. While a comparison to SAGA seems natural as it also does not have a double loop structure, we position our loopless methods for applications where the high memory requirements of SAGA prevent it to be applied. Thus, we do not compare to SAGA.

For Section 6.1, the parameters are chosen as suggested by theory. For L-SVRG and L-Katyusha they are chosen based on Theorems 4.5 and 5.6, respectively. For SVRG and Katyusha we also choose the parameters based on the theory, as described in the original papers. The initial point is chosen to be the origin. In Section 6.2, we provide several choices of parameters. Plots are constructed in such a way that the -axis displays for L-SVRG and for L-Katyusha, where

were obtained by running gradient descent for a large number of epochs. The

-axis displays the number of epochs (full gradient evaluations). That is, computations of equals one epoch.

6.1 Superior practical behaviour of the loopless approach

In this section, we show that the replacement of the outer loop in SVRG and Katyusha brings not just a simpler analysis but also speed up in the experiments.

In theory, both the loopy and the loopless methods are the same. However, as we can see from Figure 9, the improvement of the loopless approach can be significant. One can see that for these datasets, L-SVRG is always better than SVRG, and can be faster by several orders of magnitude!

Looking at Figure 16, we see that the performance of L-Katyusha is at least as good that of Katyusha, and can be significantly faster in some cases. These experiments support the claim that not just the theoretical analysis is simpler, but also the performance is affected in a positive way.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
Figure 39: All methods together for different datasets and different regularizer weights.

6.2 Different choices of probability/ outer loop size

In this section, we compare several choices of the probability of updating the full gradient for SVRG and several outer loop sizes for SVRG. Since our analysis guarantees the optimal rate for any choice of between and for well condition problems, we decided to perform experiments for within this range. More precisely, we choose values of

, uniformly distributed in logarithmic scale across this interval, and thus our choices are

,, , , and , where , denoted in the figures by , respectively. Since the expected “outer loop” length (length for which reference point stays the same) is , for SVRG we choose .

Looking at Figure 20, one can see that L-SVRG is very robust to the choice of from the “optimal interval” predicted by our theory. Moreover, even the worst case for L-SVRG outperforms the best case for SVRG.

6.3 All methods together

As the last visualization we provide all algorithms together in one graph for different datasets with different regularizer weight, thus with different condition number, displayed in Figure 39. As for the previous experiments, loopless methods are not worse and sometimes significantly better.

References

  • Allen-Zhu (2017) Allen-Zhu, Z. Katyusha: The first direct acceleration of stochastic gradient methods. In

    Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

    , pp. 1200–1205. ACM, 2017.
  • Bibi et al. (2018) Bibi, A., Sailanbayev, A., Ghanem, B., Gower, R. M., and Richtárik, P. Improving SAGA via a probabilistic interpolation with gradient descent. arXiv: 1806.05633, 2018.
  • Csiba & Richtárik (2015) Csiba, D. and Richtárik, P. Primal method for ERM with flexible mini-batching schemes and non-convex losses. arXiv:1506.02227, 2015.
  • Csiba et al. (2015) Csiba, D., Qu, Z., and Richtárik, P. Stochastic dual coordinate ascent with adaptive probabilities. In Proceedings of the 32nd International Conference on Machine Learning, pp. 674–683, 2015.
  • Defazio (2016) Defazio, A. A simple practical accelerated method for finite sums. In Advances in Neural Information Processing Systems, pp. 676–684, 2016.
  • Defazio et al. (2014a) Defazio, A., Bach, F., and Lacoste-Julien, S. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646–1654, 2014a.
  • Defazio et al. (2014b) Defazio, A., Caetano, T., and Domke, J. Finito: A faster, permutable incremental gradient method for Big Data problems. The 31st International Conference on Machine Learning, 2014b.
  • Doikov & Richtárik (2018) Doikov, N. and Richtárik, P. Randomized block cubic Newton method. In Proceedings of the 35th International Conference on Machine Learning, 2018.
  • Gower et al. (2016) Gower, R. M., Goldfarb, D., and Richtárik, P. Stochastic block BFGS: squeezing more curvature out of data. In 33rd International Conference on Machine Learning, pp. 1869–1878, 2016.
  • Gower et al. (2018) Gower, R. M., Richtárik, P., and Bach, F. Stochastic quasi-gradient methods: variance reduction via Jacobian sketching. arXiv:1805.02632, 2018.
  • Hofmann et al. (2015) Hofmann, T., Lucchi, A., Lacoste-Julien, S., and McWilliams, B.

    Variance reduced stochastic gradient descent with neighbors.

    In Advances in Neural Information Processing Systems, pp. 2305–2313, 2015.
  • Johnson & Zhang (2013) Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315–323, 2013.
  • Konečný & Richtárik (2017) Konečný, J. and Richtárik, P. S2GD: Semi-stochastic gradient descent methods. Frontiers in Applied Mathematics and Statistics, pp. 1–14, 2017.
  • Konečný et al. (2016) Konečný, J., Lu, J., Richtárik, P., and Takáč, M. Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10(2):242–255, 2016.
  • Lei et al. (2017) Lei, L., Ju, C., Chen, J., and Jordan, M. I. Non-convex finite-sum optimization via scsg methods. In Advances in Neural Information Processing Systems, pp. 2348–2358, 2017.
  • Mairal (2015) Mairal, J. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015.
  • Nemirovski et al. (2009) Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
  • Nemirovsky & Yudin (1983) Nemirovsky, A. and Yudin, D. B. Problem complexity and method efficiency in optimization. Wiley, New York, 1983.
  • Nesterov (2013) Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
  • Nguyen et al. (2017) Nguyen, L. M., Liu, J., Scheinberg, K., and Takáč, M. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pp. 2613–2621, 2017.
  • Qu et al. (2015) Qu, Z., Richtárik, P., and Zhang, T. Quartz: Randomized dual coordinate ascent with arbitrary sampling. In Advances in Neural Information Processing Systems 28, pp. 865–873, 2015.
  • Qu et al. (2016) Qu, Z., Richtárik, P., Takáč, M., and Fercoq, O. SDNA: Stochastic dual Newton ascent for empirical risk minimization. In The 33rd International Conference on Machine Learning, pp. 1823–1832, 2016.
  • Roux et al. (2012) Roux, N. L., Schmidt, M., and Bach, F. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pp. 2663–2671, 2012.
  • Schmidt et al. (2017) Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
  • Shalev-Shwartz (2016) Shalev-Shwartz, S. SDCA without duality, regularization, and individual convexity. In International Conference on Machine Learning, pp. 747–754, 2016.
  • Shalev-Shwartz & Zhang (2014) Shalev-Shwartz, S. and Zhang, T. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In International Conference on Machine Learning, pp. 64–72, 2014.
  • Takáč et al. (2013) Takáč, M., Bijral, A., Richtárik, P., and Srebro, N. Mini-batch primal and dual methods for SVMs. In 30th International Conference on Machine Learning, pp. 537–552, 2013.
  • Zhou (2018) Zhou, K. Direct acceleration of SAGA using sampled negative momentum. arXiv preprint arXiv:1806.11048, 2018.
  • Zhou et al. (2018) Zhou, K., Shang, F., and Cheng, J. A simple stochastic variance reduced algorithm with fast convergence rates. arXiv preprint arXiv:1806.11027, 2018.

Appendix A Auxiliary Lemmas

Lemma A.1.

For random vector and any , the variance can be decomposed as

(24)
Lemma A.2.

For any vectors , we have as a consequence of Jensen’s inequality applied to :

(25)

Appendix B Proofs for Algorithm 1 (L-Svrg)

Let .

Proof of Lemma 4.1.

Definition of and unbiasness of guarantee that

Proof of Lemma 4.2.

Using definition of

Proof of Lemma 4.3.

Proof of Lemma 4.4.

Combining Lemmas 4.1 and 4.3 we obtain

Now we use the fact that and obtain the desired inequality:

Appendix C Proofs for Algorithm 2 (L-Katyusha)

Proof of Lemma 5.1.

To upper bound the variance of we first uses its definition

Proof of Lemma 5.2.

We start with the definition of

which implies

which further implies that

Proof of Lemma 5.3.

where the last inequality uses the Young’s inequality in the form of for , which concludes the proof. ∎

Proof of Lemma 5.4.

From the definition of in Algorithm 2 we have

(26)

The rest of proof follows from the definition of (22). ∎

Proof of Lemma 5.5.

Combining all the previous lemmas together we obtain