Direct Acceleration of SAGA using Sampled Negative Momentum

06/28/2018 ∙ by Kaiwen Zhou, et al. ∙ The Chinese University of Hong Kong 0

Variance reduction is a simple and effective technique that accelerates convex (or non-convex) stochastic optimization. Among existing variance reduction methods, SVRG and SAGA adopt unbiased gradient estimators and have become the most popular variance reduction methods in recent years. Although various accelerated variants of SVRG (e.g., Katyusha, Acc-Prox-SVRG) have been proposed, the direct acceleration of SAGA still remains unknown. In this paper, we propose a direct accelerated variant of SAGA using Sampled Negative Momentum (SSNM), which achieves the best known oracle complexities for strongly convex problems. Consequently, our work fills the void of direct accelerated SAGA.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we consider optimizing the following composite finite-sum problem, which arises frequently in machine learning and statistics such as supervised learning and regularized empirical risk minimization (ERM):

(1)

where is an average of smooth and convex function , and is a simple and convex (but possibly non-differentiable) function. Here, we also define that will be used in the paper.

We focus on achieving very high accuracy for Problem (1), although for practical optimization tasks, such as supervised learning, low empirical risk may result in high generalization error. In this paper, we treat Problem (1) as a pure optimization problem.

When in Problem (1) is strongly convex, traditional analysis shows that gradient descent (GD) yields a fast linear convergence rate but with a high per-iteration cost, and thus may not be suitable for problems with a very large . As an alternative for large problems, SGD (Robbins and Monro, 1951) uses only one or a mini-batch of gradients in each iteration, and thus enjoys significantly lower per-iteration complexity than GD. However, due to the variance of gradient estimator, vanilla SGD is shown to yield only a sub-linear convergence rate. Recently, stochastic variance reduced methods (e.g., SAG (Roux et al., 2012), SVRG (Johnson and Zhang, 2013), SAGA (Defazio et al., 2014), and their proximal variants, such as (Schmidt et al., 2017),  (Xiao and Zhang, 2014) and (Konečný et al., 2016)) were proposed to solve Problem (1). All these methods are equipped with various variance reduction techniques, which help them achieve low per-iteration complexities comparable with SGD and at the same time maintain the fast linear convergence rate of GD. In terms of oracle complexity111Oracle complexity in this paper, denoted by , is the number of calls to Incremental First-order Oracle (IFO) + Proximal Oracle (PO)., these methods all achieve an complexity222We denote throughout the paper, known as the condition number of an -smooth and -strongly convex function., as compared with for accelerated deterministic methods (e.g., Nesterov’s accelerated gradient descent (Nesterov, 2004)).

Inspired by the acceleration technique proposed in Nesterov’s accelerated gradient descent (Nesterov, 2004), accelerated variants of stochastic methods have been proposed in recent years, such as Acc-Prox-SVRG (Nitanda, 2014), APCG (Lin et al., 2014), APPA (Frostig et al., 2015), Catalyst (Lin et al., 2015), SPDC (Zhang and Xiao, 2015) and Katyusha (Allen-Zhu, 2017). Among these accelerated algorithms, APPA and Catalyst use some kinds of reduction technique, which result in additional log factors in their overall oracle complexities. Katyusha, as the first direct accelerated variant of SVRG, introduced and combined the idea of negative momentum in stochastic optimization with Nesterov’s Momentum, which results in the best-known oracle complexity . More recent work (Zhou et al., 2018) shows that adding only negative momentum to SVRG is enough to achieve the best known oracle complexity for strongly-convex problems, which results in a simple and scalable algorithm called MiG.

Although a considerable amount of work has been done for accelerated SVRG, another popular stochastic variance reduced method, SAGA, does not have a direct accelerated variant until recently. Accelerating frameworks such as APPA or Catalyst can be used to accelerate SAGA, but the reduction techniques proposed in these works are always difficult to implement and may also result in additional log factors in the overall oracle complexity. A notable variant of SAGA is Point-SAGA (Defazio, 2016). Point-SAGA requires the proximal oracle of the entire objective and with the help of that, it can adopt a much larger learning rate than SAGA, which results in the same accelerated complexity . Accelerated variants of SVRG and SAGA are summarized in Table 1

. However, the proximal oracle of the entire objective may not be efficiently computed in practice. Even for logistic regression, we need to run an individual loop (Newton’s method) for its proximal oracle. Therefore, a direct accelerated variant of SAGA is of real interests.

Indirect Direct
SVRG (or Prox-SVRG) APPA & Catalyst Katyusha & MiG
SAGA this work
Point-SAGA
Table 1: Comparison of accelerated variants of SVRG and SAGA (we regard using reductions or proximal point variants as “Indirect” acceleration).

Following the idea of adding only negative momentum to SVRG (Zhou et al., 2018), we consider adding negative momentum to SAGA. However, unlike SVRG, which keeps a constant snapshot in each inner loop, the “snapshot” of SAGA is a table of points, each corresponding to the position that the component function gradient is evaluated at the last time. Thus, to directly accelerate SAGA requires some non-trivial effort, which partially explains why direct acceleration of SAGA is left unsolved for a long time. In this paper, we propose a novel Sampled Negative Momentum for SAGA. We further show that adding such a momentum have the same acceleration effect as adding negative momentum for SVRG.

2 Preliminaries

In this paper, we consider Problem (1) in standard Euclidean space with the Euclidean norm denoted by . We use

to denote that the expectation is taken with respect to all randomness in one epoch. In order to further categorize the objective functions, we define that a convex function

is said to be -smooth if for all , it holds that

(2)

and -strongly convex if for all ,

(3)

where , the set of sub-gradient of at for non-differentiable . If is differentiable, we can simply replace with . Then we make the following assumption to identify the main objective condition (strongly convex) that is the focus of this paper:

Assumption 1 (Strongly Convex).

In Problem (1), each 333In fact, if each is -smooth, the averaged function is itself

-smooth — but probably with a smaller

. We keep using as the smoothness constant for a consistent analysis. is -smooth and convex, is -strongly convex.

3 Direct Acceleration of SAGA

0:  Iterations number , initial point , learning rate , parameter .
0:  “Points” table with and a running average for the gradients of points table.
1:  for  do
2:     1. Sample uniformly in and compute the gradient estimator using the running average.
3:         ;
4:         ;
5:     2. Perform a proximal gradient step.
6:         
7:     3. Sample uniformly in , take and then update the running average corresponding to the change in the “points” table.
8:  end for
8:  
Algorithm 1 SSNM

Our proposed algorithm SSNM (SAGA with Sampled Negative Momentum) is formally given in Algorithm 1. As we can see, there are some unusual tricks used in Algorithm 1. Thus we elaborate some ideas behind Algorithm 1 by making the following remarks:

  • Coupled point correlates to the randomness of . Unlike the negative momentum used for SVRG, which comes from a fixed snapshot , the negative momentum of SAGA can only be found on a “points” table that changes over time. Thus, in SSNM, we choose to use the th entry of “points” table to provide the negative momentum, which makes the coupled point correlate to the randomness of sample . In fact, all the possible coupled points forms a “coupled table”. Although the table is never explicitly computed, we shall see that the concept of “coupled table” is critical in the proof of SSNM. The rd step in Algorithm 1 can thus be regarded as sampling a point in such a table.

  • Independent samples and . The additional sample is crucial for the convergence analysis of Algorithm 1. It chooses an index to store the updated point in the “points” table. The major insight of this choice is that it separates the randomness of and the update index in the “points” table so as to make certain inequalities valid.

  • Two learning rates for two cases. Using different parameter settings for different objective conditions (ill-condition and well-condition) is common for accelerated methods (Shalev-Shwartz and Zhang, 2014; Allen-Zhu, 2017; Zhou et al., 2018). If parameters such as , are unknown, SSNM is still a practical algorithm with tuning only and , as compared with Katyusha which has potentially 4 parameters that need to be tuned. Note that we have tried to make the parameter settings in SSNM similar to Katyusha and MiG. We believe that it can help conduct some fair experimental comparisons with these methods.

  • Only one variable vector with simple algorithm structure. Same as MiG in (Zhou et al., 2018)

    , SSNM only has one variable vector in the main loop. Coupled point

    can be computed whenever used and do not need to be explicitly stored. Moreover, SSNM has a one loop structure compared to those variants of SVRG. Such a structure is good for asynchronous implementation since algorithms with two loops in this setting always require a synchronization after each inner loop (Mania et al., 2017). Moreover, the algorithm structure of SSNM is more elegant than Katyusha and MiG, both of which require a tricky weighted averaged scheme at the end of each inner loop444These two algorithms can adopt an uniformly average scheme, but in this case, both algorithms require certain restarting tricks, which make them less implementable..

Since algorithms such as Point-SAGA and SAGA are closely related to SSNM, in the next subsection, we compare in details these different variants of SAGA.

3.1 Comparison with SAGA and Point-SAGA

Complexity Requirements Memory
SAGA IFO of , PO of or for linear models.
Point-SAGA PO of each or for linear models.
SSNM IFO of , PO of
Table 2: Comparison of variants of SAGA (Complexity is for strongly convex objectives).

As summarized in Table 2, in comparison, SSNM yields a fast convergence rate as Point-SAGA while keeping the same objective assumption as SAGA, which is the advantage of direct acceleration. Weaker assumptions on the objective function make the algorithm more implementable. However, since SSNM requires storing the “points” table, the memory complexity of SSNM is always

. This may be a disadvantage when the objective is a linear model such as linear logistic regression and ridge regression. It is known that for these linear models, each gradient is just a weighting of the corresponding data vector. Thus, we can simply store a scalar to represent a gradient, which helps SAGA and Point-SAGA have an

memory complexity for these problems.

For a general objective, all three methods have the same memory complexity. In such a case, SSNM is apparently superior to the other two algorithms. Note that the Proximal Oracle of a general objective is always hard to be efficiently evaluated.

4 Theory

In this section, we theoretically analyze the performance of SSNM. First, we give a variance bound shown in Lemma 1. Since the stochastic gradient estimator of SSNM is computed at a coupled point that contains randomness, the variance bound for SSNM, as we can see, is unlike all the variance bounds in previous work.

Lemma 1 (Variance Bound).

Using the same notations as in Algorithm 1, we can bound the variance of stochastic gradient estimator as

Proof.

where follows from and uses Theorem 2.1.5 in (Nesterov, 2004). ∎

Now we can formally present the main theorem of SSNM below. As stated in (Allen-Zhu, 2017), the major task of the negative momentum is to cancel the additional inner product term shown in the variance bound so as to keep a close connection in each iteration. As we shall see shortly, our proposed sampled negative momentum effectively cancels the inner product term, which is where the acceleration comes from.

Main Theorem.

Let be the solution of Problem 1. If Assumption 1 holds, then by running Algorithm 1 for iterations, we have the following inequalities in expectation for the corresponding cases:

(I) (For ill-conditioned problems). If , with it holds that

Moreover, if we use the same objective assumption as in Point-SAGA (Defazio, 2016), where each is -smooth and -strongly convex, we have the following inequality:

The above two inequalities all imply that in order to reduce the squared norm distance to , we have an oracle complexity as in expectation.

(II) (For well-conditioned problems). If , by choosing , we have

or

with the same stronger assumption on smoothness as in case (I).

These two inequalities all imply that in this case we have an oracle complexity as in expectation.

That, for strongly convex objectives, SSNM yields a fast , which keeps up with the best known oracle complexity achieved by accelerated SVRG (Frostig et al., 2015) (Allen-Zhu, 2017).

4.1 Proof of Main Theorem

In order to prove the Main Theorem, we need the following useful lemma to bound the new iterate after the proximal gradient step:

Lemma 2.

If two vectors , satisfy with a constant vector and a -strongly convex function , then for all , we have

Proof.

This Lemma is identical to Lemma 3.5 in (Allen-Zhu, 2017). ∎

First, we analyze Algorithm 1 at the th iteration, given that the randomness from previous iterations are fixed.

We start with the convexity of at . By definition, we have

where uses the definition of the th entry of “coupled table” that .

As we will see, the first term in the right hand side is used to cancel the inner product term in the variance bound.

By taking expectation with respect to sample and using the unbiasedness that , we obtain

(4)

In order to bound , we use the -smoothness of at , which is

Taking expectation with respect to sample and using our choice of as well as the definition of “coupled table”, we conclude that

Taking expectation with respect to sample , we obtain

(5)

Here we see the effect of the independent sample . It decouples the randomness of and the update position so as to make the above inequalities valid.

By upper bounding (4) using (5) and Lemma 2 (with -strongly convex and ), we obtain

Here we add a constraint that , which is identical to the constraint used in (Zhou et al., 2018). Using Young’s inequality to upper bound with , we can simplify the above inequality as

By applying Lemma 1 to upper bound the variance term, we see that the additional variance term in the variance bound is canceled by the sampled momentum, which comes to

(6)

Using the convexity of and that , we have

After taking expectation with respect to sample and sample , we obtain

Combining the above inequality with (6) and using the definition that , we can write (6) as

Dividing the above inequality by and adding both sides by , we obtain

In order to give a clean proof, we denote and , then we can write the contraction as

(7)

Case I: Consider the first case with , choosing and , we first evaluate the parameter constraint:

which means that the constraint is satisfied by our parameter choices.

Moreover, with this choice of , we have

Thus, the contraction (7) can be written as

After telescoping the above contraction from and taking expectation with respect to all randomness, we have

Note that . After substituting the parameter choices, we have

If we further have the -smoothness of , by using , we can write the above inequality as

Case II: Consider another case with , choosing , . Again, we first evaluate the constraint:

Then by rewriting the contraction (7), telescoping from and taking expectation with respect to all randomness, we obtain

By substituting the parameter choices, we have

or

with a slightly stronger assumption on smoothness.

5 Experiments

In this section, we perform some experiments to examine the practical performance of SSNM as well as to justify the proofs. All the algorithms were implemented in C++ and executed through a MATLAB interface for fair comparison. We ran experiments on an HP Z440 machine with a single Intel Xeon E5-1630v4 with 3.70GHz cores, 16GB RAM, Ubuntu 16.04 LTS with GCC 4.9.0, MATLAB R2017b.

We are optimizing the following binary problems with , , :

where is the regularization parameter and all the datasets used were normalized before the experiments.

The experiments were designed as some ill-conditioned problems (with very small ), since ill-condition is where all the accelerated first-order methods take effect. We test the following algorithms with their corresponding parameter settings:

  • SAGA. We set the learning rate as , which is analyzed theoretically in (Defazio et al., 2014).

  • SSNM. We use the same settings as suggested in Algorithm 1, which are and .

  • Katyusha. As suggested by the author, we fixed , set and chose  (Allen-Zhu, 2017) (In the notations of the original work).

  • MiG. We set and chose as analyzed in (Zhou et al., 2018).

Experimental results are shown in Figure 1. As we can see in the results, SSNM converges very fast on the ill-conditioned problems, which justify the theoretical improvement of . In fact, we were actually surprised that SSNM performs so good on the experiments with the covtype dataset. In these two results, SSNM is even significantly faster than Katyusha and MiG in terms of the number of epoch, both of which yield the same convergence rate as SSNM. These results may be explained by the choice of in Katyusha and MiG, which is always empirically set to but actually, the tuning of will greatly affect the performance of Katyusha and MiG.

Figure 1: Evaluations of SAGA, SSNM, Katyusha and MiG on a9a with and (the first two figures), covtype with and (the last two figures).
Figure 2: Comparison of using sample (SSNM-i) or (SSNM-I) in th step of SSNM on covtype with .

However, as shown in the results, the convergence of SSNM, although very fast, is somewhat unstable compared with the other three methods. This can partially explained by the double sampling trick used in SSNM, which greatly increases the uncertainty inside each iteration.

5.1 Effectiveness of sample

A natural question is that: can we use sample (the sample of stochastic gradient) instead of an independent sample in the th step of Algorithm 1? We empirically evaluate the effect of sample as shown in Figure 2. As we can see, using sample makes the algorithm even more unstable and slower in convergence comparing with that using an independent sample . This effect can probably be explained by some kind of variance cumulation when using the sample .

6 Conclusions

In this paper, we proposed SSNM, an accelerated variant of SAGA, which uses the negative momentum trick in a novel way. Theoretical result shows that SSNM enjoys the best known bound for strongly convex problems and our experiments justified such improvement for the ill-conditioned problems. Admittedly, the memory consumption is a little bit high for SSNM, but considering the good performance and the general objective assumption, SSNM is still a valuable algorithm with high potential. We hope such a method will inspire researchers to further develop and utilize the acceleration tricks in stochastic optimization.

References

  • Allen-Zhu [2017] Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In STOC, 2017.
  • Defazio [2016] A. Defazio. A simple practical accelerated method for finite sums. In NIPS, pages 676–684, 2016.
  • Defazio et al. [2014] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS, pages 1646–1654, 2014.
  • Frostig et al. [2015] R. Frostig, R. Ge, S. Kakade, and A. Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In ICML, pages 2540–2548, 2015.
  • Johnson and Zhang [2013] R. Johnson and T. Zhang.

    Accelerating stochastic gradient descent using predictive variance reduction.

    In NIPS, pages 315–323, 2013.
  • Konečný et al. [2016] J. Konečný, J. Liu, P. Richtárik, , and M. Takáč. Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Sign. Proces., 10(2):242–255, 2016.
  • Lin et al. [2015] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In NIPS, pages 3366–3374, 2015.
  • Lin et al. [2014] Q. Lin, Z. Lu, and L. Xiao. An accelerated proximal coordinate gradient method. In NIPS, pages 3059–3067, 2014.
  • Mania et al. [2017] H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM J. Optim., 27(4):2202–2229, 2017.
  • Nesterov [2004] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publ., Boston, 2004.
  • Nitanda [2014] A. Nitanda. Stochastic proximal gradient descent with acceleration techniques. In NIPS, pages 1574–1582, 2014.
  • Robbins and Monro [1951] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400–407, 1951.
  • Roux et al. [2012] N. L. Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In NIPS, pages 2672–2680, 2012.
  • Schmidt et al. [2017] M. Schmidt, N. L. Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Math. Program., 162:83–112, 2017.
  • Shalev-Shwartz and Zhang [2014] S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In ICML, pages 64–72, 2014.
  • Xiao and Zhang [2014] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim., 24(4):2057–2075, 2014.
  • Zhang and Xiao [2015] Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. In ICML, pages 353–361, 2015.
  • Zhou et al. [2018] K. Zhou, F. Shang, and J. Cheng. A simple stochastic variance reduced algorithm with fast convergence rates. In ICML, 2018.