1 Introduction
In this paper, we consider optimizing the following composite finitesum problem, which arises frequently in machine learning and statistics such as supervised learning and regularized empirical risk minimization (ERM):
(1) 
where is an average of smooth and convex function , and is a simple and convex (but possibly nondifferentiable) function. Here, we also define that will be used in the paper.
We focus on achieving very high accuracy for Problem (1), although for practical optimization tasks, such as supervised learning, low empirical risk may result in high generalization error. In this paper, we treat Problem (1) as a pure optimization problem.
When in Problem (1) is strongly convex, traditional analysis shows that gradient descent (GD) yields a fast linear convergence rate but with a high periteration cost, and thus may not be suitable for problems with a very large . As an alternative for large problems, SGD (Robbins and Monro, 1951) uses only one or a minibatch of gradients in each iteration, and thus enjoys significantly lower periteration complexity than GD. However, due to the variance of gradient estimator, vanilla SGD is shown to yield only a sublinear convergence rate. Recently, stochastic variance reduced methods (e.g., SAG (Roux et al., 2012), SVRG (Johnson and Zhang, 2013), SAGA (Defazio et al., 2014), and their proximal variants, such as (Schmidt et al., 2017), (Xiao and Zhang, 2014) and (Konečný et al., 2016)) were proposed to solve Problem (1). All these methods are equipped with various variance reduction techniques, which help them achieve low periteration complexities comparable with SGD and at the same time maintain the fast linear convergence rate of GD. In terms of oracle complexity^{1}^{1}1Oracle complexity in this paper, denoted by , is the number of calls to Incremental Firstorder Oracle (IFO) + Proximal Oracle (PO)., these methods all achieve an complexity^{2}^{2}2We denote throughout the paper, known as the condition number of an smooth and strongly convex function., as compared with for accelerated deterministic methods (e.g., Nesterov’s accelerated gradient descent (Nesterov, 2004)).
Inspired by the acceleration technique proposed in Nesterov’s accelerated gradient descent (Nesterov, 2004), accelerated variants of stochastic methods have been proposed in recent years, such as AccProxSVRG (Nitanda, 2014), APCG (Lin et al., 2014), APPA (Frostig et al., 2015), Catalyst (Lin et al., 2015), SPDC (Zhang and Xiao, 2015) and Katyusha (AllenZhu, 2017). Among these accelerated algorithms, APPA and Catalyst use some kinds of reduction technique, which result in additional log factors in their overall oracle complexities. Katyusha, as the first direct accelerated variant of SVRG, introduced and combined the idea of negative momentum in stochastic optimization with Nesterov’s Momentum, which results in the bestknown oracle complexity . More recent work (Zhou et al., 2018) shows that adding only negative momentum to SVRG is enough to achieve the best known oracle complexity for stronglyconvex problems, which results in a simple and scalable algorithm called MiG.
Although a considerable amount of work has been done for accelerated SVRG, another popular stochastic variance reduced method, SAGA, does not have a direct accelerated variant until recently. Accelerating frameworks such as APPA or Catalyst can be used to accelerate SAGA, but the reduction techniques proposed in these works are always difficult to implement and may also result in additional log factors in the overall oracle complexity. A notable variant of SAGA is PointSAGA (Defazio, 2016). PointSAGA requires the proximal oracle of the entire objective and with the help of that, it can adopt a much larger learning rate than SAGA, which results in the same accelerated complexity . Accelerated variants of SVRG and SAGA are summarized in Table 1
. However, the proximal oracle of the entire objective may not be efficiently computed in practice. Even for logistic regression, we need to run an individual loop (Newton’s method) for its proximal oracle. Therefore, a direct accelerated variant of SAGA is of real interests.
Indirect  Direct  

SVRG (or ProxSVRG)  APPA & Catalyst  Katyusha & MiG 
SAGA  this work  
PointSAGA 
Following the idea of adding only negative momentum to SVRG (Zhou et al., 2018), we consider adding negative momentum to SAGA. However, unlike SVRG, which keeps a constant snapshot in each inner loop, the “snapshot” of SAGA is a table of points, each corresponding to the position that the component function gradient is evaluated at the last time. Thus, to directly accelerate SAGA requires some nontrivial effort, which partially explains why direct acceleration of SAGA is left unsolved for a long time. In this paper, we propose a novel Sampled Negative Momentum for SAGA. We further show that adding such a momentum have the same acceleration effect as adding negative momentum for SVRG.
2 Preliminaries
In this paper, we consider Problem (1) in standard Euclidean space with the Euclidean norm denoted by . We use
to denote that the expectation is taken with respect to all randomness in one epoch. In order to further categorize the objective functions, we define that a convex function
is said to be smooth if for all , it holds that(2) 
and strongly convex if for all ,
(3) 
where , the set of subgradient of at for nondifferentiable . If is differentiable, we can simply replace with . Then we make the following assumption to identify the main objective condition (strongly convex) that is the focus of this paper:
3 Direct Acceleration of SAGA
Our proposed algorithm SSNM (SAGA with Sampled Negative Momentum) is formally given in Algorithm 1. As we can see, there are some unusual tricks used in Algorithm 1. Thus we elaborate some ideas behind Algorithm 1 by making the following remarks:

Coupled point correlates to the randomness of . Unlike the negative momentum used for SVRG, which comes from a fixed snapshot , the negative momentum of SAGA can only be found on a “points” table that changes over time. Thus, in SSNM, we choose to use the th entry of “points” table to provide the negative momentum, which makes the coupled point correlate to the randomness of sample . In fact, all the possible coupled points forms a “coupled table”. Although the table is never explicitly computed, we shall see that the concept of “coupled table” is critical in the proof of SSNM. The rd step in Algorithm 1 can thus be regarded as sampling a point in such a table.

Independent samples and . The additional sample is crucial for the convergence analysis of Algorithm 1. It chooses an index to store the updated point in the “points” table. The major insight of this choice is that it separates the randomness of and the update index in the “points” table so as to make certain inequalities valid.

Two learning rates for two cases. Using different parameter settings for different objective conditions (illcondition and wellcondition) is common for accelerated methods (ShalevShwartz and Zhang, 2014; AllenZhu, 2017; Zhou et al., 2018). If parameters such as , are unknown, SSNM is still a practical algorithm with tuning only and , as compared with Katyusha which has potentially 4 parameters that need to be tuned. Note that we have tried to make the parameter settings in SSNM similar to Katyusha and MiG. We believe that it can help conduct some fair experimental comparisons with these methods.

Only one variable vector with simple algorithm structure. Same as MiG in (Zhou et al., 2018)
, SSNM only has one variable vector in the main loop. Coupled point
can be computed whenever used and do not need to be explicitly stored. Moreover, SSNM has a one loop structure compared to those variants of SVRG. Such a structure is good for asynchronous implementation since algorithms with two loops in this setting always require a synchronization after each inner loop (Mania et al., 2017). Moreover, the algorithm structure of SSNM is more elegant than Katyusha and MiG, both of which require a tricky weighted averaged scheme at the end of each inner loop^{4}^{4}4These two algorithms can adopt an uniformly average scheme, but in this case, both algorithms require certain restarting tricks, which make them less implementable..
Since algorithms such as PointSAGA and SAGA are closely related to SSNM, in the next subsection, we compare in details these different variants of SAGA.
3.1 Comparison with SAGA and PointSAGA
Complexity  Requirements  Memory  

SAGA  IFO of , PO of  or for linear models.  
PointSAGA  PO of each  or for linear models.  
SSNM  IFO of , PO of 
As summarized in Table 2, in comparison, SSNM yields a fast convergence rate as PointSAGA while keeping the same objective assumption as SAGA, which is the advantage of direct acceleration. Weaker assumptions on the objective function make the algorithm more implementable. However, since SSNM requires storing the “points” table, the memory complexity of SSNM is always
. This may be a disadvantage when the objective is a linear model such as linear logistic regression and ridge regression. It is known that for these linear models, each gradient is just a weighting of the corresponding data vector. Thus, we can simply store a scalar to represent a gradient, which helps SAGA and PointSAGA have an
memory complexity for these problems.For a general objective, all three methods have the same memory complexity. In such a case, SSNM is apparently superior to the other two algorithms. Note that the Proximal Oracle of a general objective is always hard to be efficiently evaluated.
4 Theory
In this section, we theoretically analyze the performance of SSNM. First, we give a variance bound shown in Lemma 1. Since the stochastic gradient estimator of SSNM is computed at a coupled point that contains randomness, the variance bound for SSNM, as we can see, is unlike all the variance bounds in previous work.
Lemma 1 (Variance Bound).
Using the same notations as in Algorithm 1, we can bound the variance of stochastic gradient estimator as
Proof.
where follows from and uses Theorem 2.1.5 in (Nesterov, 2004). ∎
Now we can formally present the main theorem of SSNM below. As stated in (AllenZhu, 2017), the major task of the negative momentum is to cancel the additional inner product term shown in the variance bound so as to keep a close connection in each iteration. As we shall see shortly, our proposed sampled negative momentum effectively cancels the inner product term, which is where the acceleration comes from.
Main Theorem.
Let be the solution of Problem 1. If Assumption 1 holds, then by running Algorithm 1 for iterations, we have the following inequalities in expectation for the corresponding cases:
(I) (For illconditioned problems). If , with it holds that
Moreover, if we use the same objective assumption as in PointSAGA (Defazio, 2016), where each is smooth and strongly convex, we have the following inequality:
The above two inequalities all imply that in order to reduce the squared norm distance to , we have an oracle complexity as in expectation.
(II) (For wellconditioned problems). If , by choosing , we have
or
with the same stronger assumption on smoothness as in case (I).
These two inequalities all imply that in this case we have an oracle complexity as in expectation.
That, for strongly convex objectives, SSNM yields a fast , which keeps up with the best known oracle complexity achieved by accelerated SVRG (Frostig et al., 2015) (AllenZhu, 2017).
4.1 Proof of Main Theorem
In order to prove the Main Theorem, we need the following useful lemma to bound the new iterate after the proximal gradient step:
Lemma 2.
If two vectors , satisfy with a constant vector and a strongly convex function , then for all , we have
Proof.
This Lemma is identical to Lemma 3.5 in (AllenZhu, 2017). ∎
First, we analyze Algorithm 1 at the th iteration, given that the randomness from previous iterations are fixed.
We start with the convexity of at . By definition, we have
where uses the definition of the th entry of “coupled table” that .
As we will see, the first term in the right hand side is used to cancel the inner product term in the variance bound.
By taking expectation with respect to sample and using the unbiasedness that , we obtain
(4) 
In order to bound , we use the smoothness of at , which is
Taking expectation with respect to sample and using our choice of as well as the definition of “coupled table”, we conclude that
Taking expectation with respect to sample , we obtain
(5) 
Here we see the effect of the independent sample . It decouples the randomness of and the update position so as to make the above inequalities valid.
Here we add a constraint that , which is identical to the constraint used in (Zhou et al., 2018). Using Young’s inequality to upper bound with , we can simplify the above inequality as
By applying Lemma 1 to upper bound the variance term, we see that the additional variance term in the variance bound is canceled by the sampled momentum, which comes to
(6) 
Using the convexity of and that , we have
After taking expectation with respect to sample and sample , we obtain
Dividing the above inequality by and adding both sides by , we obtain
In order to give a clean proof, we denote and , then we can write the contraction as
(7) 
Case I: Consider the first case with , choosing and , we first evaluate the parameter constraint:
which means that the constraint is satisfied by our parameter choices.
Moreover, with this choice of , we have
Thus, the contraction (7) can be written as
After telescoping the above contraction from and taking expectation with respect to all randomness, we have
Note that . After substituting the parameter choices, we have
If we further have the smoothness of , by using , we can write the above inequality as
Case II: Consider another case with , choosing , . Again, we first evaluate the constraint:
Then by rewriting the contraction (7), telescoping from and taking expectation with respect to all randomness, we obtain
By substituting the parameter choices, we have
or
with a slightly stronger assumption on smoothness.
5 Experiments
In this section, we perform some experiments to examine the practical performance of SSNM as well as to justify the proofs. All the algorithms were implemented in C++ and executed through a MATLAB interface for fair comparison. We ran experiments on an HP Z440 machine with a single Intel Xeon E51630v4 with 3.70GHz cores, 16GB RAM, Ubuntu 16.04 LTS with GCC 4.9.0, MATLAB R2017b.
We are optimizing the following binary problems with , , :
where is the regularization parameter and all the datasets used were normalized before the experiments.
The experiments were designed as some illconditioned problems (with very small ), since illcondition is where all the accelerated firstorder methods take effect. We test the following algorithms with their corresponding parameter settings:

SAGA. We set the learning rate as , which is analyzed theoretically in (Defazio et al., 2014).

SSNM. We use the same settings as suggested in Algorithm 1, which are and .

Katyusha. As suggested by the author, we fixed , set and chose (AllenZhu, 2017) (In the notations of the original work).

MiG. We set and chose as analyzed in (Zhou et al., 2018).
Experimental results are shown in Figure 1. As we can see in the results, SSNM converges very fast on the illconditioned problems, which justify the theoretical improvement of . In fact, we were actually surprised that SSNM performs so good on the experiments with the covtype dataset. In these two results, SSNM is even significantly faster than Katyusha and MiG in terms of the number of epoch, both of which yield the same convergence rate as SSNM. These results may be explained by the choice of in Katyusha and MiG, which is always empirically set to but actually, the tuning of will greatly affect the performance of Katyusha and MiG.
However, as shown in the results, the convergence of SSNM, although very fast, is somewhat unstable compared with the other three methods. This can partially explained by the double sampling trick used in SSNM, which greatly increases the uncertainty inside each iteration.
5.1 Effectiveness of sample
A natural question is that: can we use sample (the sample of stochastic gradient) instead of an independent sample in the th step of Algorithm 1? We empirically evaluate the effect of sample as shown in Figure 2. As we can see, using sample makes the algorithm even more unstable and slower in convergence comparing with that using an independent sample . This effect can probably be explained by some kind of variance cumulation when using the sample .
6 Conclusions
In this paper, we proposed SSNM, an accelerated variant of SAGA, which uses the negative momentum trick in a novel way. Theoretical result shows that SSNM enjoys the best known bound for strongly convex problems and our experiments justified such improvement for the illconditioned problems. Admittedly, the memory consumption is a little bit high for SSNM, but considering the good performance and the general objective assumption, SSNM is still a valuable algorithm with high potential. We hope such a method will inspire researchers to further develop and utilize the acceleration tricks in stochastic optimization.
References
 AllenZhu [2017] Z. AllenZhu. Katyusha: The first direct acceleration of stochastic gradient methods. In STOC, 2017.
 Defazio [2016] A. Defazio. A simple practical accelerated method for finite sums. In NIPS, pages 676–684, 2016.
 Defazio et al. [2014] A. Defazio, F. Bach, and S. LacosteJulien. SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In NIPS, pages 1646–1654, 2014.
 Frostig et al. [2015] R. Frostig, R. Ge, S. Kakade, and A. Sidford. Unregularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In ICML, pages 2540–2548, 2015.

Johnson and Zhang [2013]
R. Johnson and T. Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.
In NIPS, pages 315–323, 2013.  Konečný et al. [2016] J. Konečný, J. Liu, P. Richtárik, , and M. Takáč. Minibatch semistochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Sign. Proces., 10(2):242–255, 2016.
 Lin et al. [2015] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for firstorder optimization. In NIPS, pages 3366–3374, 2015.
 Lin et al. [2014] Q. Lin, Z. Lu, and L. Xiao. An accelerated proximal coordinate gradient method. In NIPS, pages 3059–3067, 2014.
 Mania et al. [2017] H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM J. Optim., 27(4):2202–2229, 2017.
 Nesterov [2004] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publ., Boston, 2004.
 Nitanda [2014] A. Nitanda. Stochastic proximal gradient descent with acceleration techniques. In NIPS, pages 1574–1582, 2014.
 Robbins and Monro [1951] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400–407, 1951.
 Roux et al. [2012] N. L. Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In NIPS, pages 2672–2680, 2012.
 Schmidt et al. [2017] M. Schmidt, N. L. Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Math. Program., 162:83–112, 2017.
 ShalevShwartz and Zhang [2014] S. ShalevShwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In ICML, pages 64–72, 2014.
 Xiao and Zhang [2014] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim., 24(4):2057–2075, 2014.
 Zhang and Xiao [2015] Y. Zhang and L. Xiao. Stochastic primaldual coordinate method for regularized empirical risk minimization. In ICML, pages 353–361, 2015.
 Zhou et al. [2018] K. Zhou, F. Shang, and J. Cheng. A simple stochastic variance reduced algorithm with fast convergence rates. In ICML, 2018.