Several machine learning and optimization problems involve the minimization of a smooth, convex and separable cost function :
where the -dimensional variable represents model parameters, and each of the functions
depends on a single data point. Linear regression is such an example: given pointsin , one seeks that minimizes the sum of ,
. Training of neural networks[7, 13]
, multi-class logistic regression[13, 23], image classification , matrix factorization  and many more tasks in machine learning entail an optimization of similar form.
Batch gradient descent schemes can effectively solve small- or moderate-scale instances of (1). Often though, the volume of input data outgrows our computational capacity, posing major challenges. Classic batch optimization methods [19, 6] perform several passes over the entire input dataset to compute the full gradient, or even the Hessian111In this work, we will focus on first-order methods only. Extensions to higher-order schemes is left for future work., in each iteration, incurring a prohibitive cost for very large problems.
Stochastic optimization methods overcome this hurdle by computing only a surrogate of the full gradient , based on a small subset of the input data. For instance, the popular SGD  scheme in each iteration takes a small step in a direction determined by a single, randomly selected data point. This imperfect gradient step results in smaller progress per-iteration, though manyfold in the time it would take for a batch gradient descent method to compute a full gradient .
Nevertheless, the approximate ‘gradients’ of stochastic methods introduce variance in the course of the optimization. Notably, vanilla SGD methods can deviate from the optimum, even if the initialization point is the optimum . To ensure convergence, the learning rate has to decay to zero, which results to sublinear convergence rates , a significant degradation from the linear rate achieved by batch gradient methods.
A recent line of work [23, 13, 14, 15] has made promising steps towards the middle ground of these two extremes. A full gradient computation is occasionally interleaved with the inexpensive steps of SGD, dividing the course of the optimization in epochs
. Within an epoch, descent directions are formed as a linear combination of an approximate gradient (as in vanillaSGD
) and a full gradient vector computed at the beginning of the epoch. Though not always up-to-date, the full gradient information reduces the variance of gradient estimates and provably speeds up the convergence.
Yet, as the size of the problem grows, even an infrequent computation of the full gradient may severely impede the progress of these variance-reduction approaches. For instance, when training large neural networks [20, 29, 7]), the volume of the input data rules out the possibility of computing a full gradient within any reasonable time window. Moreover, in a distributed setting, accessing the entire dataset may incur significant tail latencies . On the other hand, traditional stochastic methods exhibit low convergence rates and in practice frequently fail to come close to the optimal solution in reasonable amount of time.
The above motivate to design algorithms that try to compromise the two extremes circumventing the costly computation of the full gradient, while admitting favorable convergence rate guarantees. In this work, we reconsider the computational resource allocation problem in stochastic variance-reduction schemes: given a limited budget of atomic gradient computations, how can we utilize those resources in the course of the optimization to achieve faster convergence? Our contributions can be summarized as follows:
We propose CheapSVRG, a variant of the popular Svrg scheme . Similarly to Svrg, our algorithm divides time into epochs, but at the beginning of each epoch computes only a surrogate of the full gradient using a subset of the input data. Then, it computes a sequence of estimates using a modified version of SGD steps. Overall, CheapSVRG can be seen as a family of stochastic optimization schemes encompassing Svrg and vanilla SGD. It exposes a set of tuning knobs that control trade-offs between the per-iteration computational complexity and the convergence rate.
We supplement our theoretical analysis with experiments on synthetic and real data. Empirical evaluation supports our claims for linear convergence and shows that CheapSVRG performs at least competitively with the state of the art.
2 Related work
There is extensive literature on classic SGD approaches. We refer the reader to [4, 3] and references therein for useful pointers. Here, we focus on works related to variance reduction using gradients, and consider only primal methods; see [27, 9, 17] for dual.
Roux et al. in  are among the first that considered variance reduction methods in stochastic optimization. Their proposed scheme, Sag, achieves linear convergence under smoothness and strong convexity assumptions and is computationally efficient: it performs only one atomic gradient calculation per iteration. However, it is not memory efficient222The authors show how to reduce memory requirements in the case where depends on a linear combination of . as it requires storing all intermediate atomic gradients to generate approximations of the full gradient and, ultimately, achieve variance reduction.
In , Johnson and Zhang improve upon  with their Stochastic Variance-Reduced Gradient (Svrg) method, which both achieves linear convergence rates and does not require the storage of the full history of atomic gradients. However, Svrg requires a full gradient computation per epoch. The S2gd method of  follows similar steps with Svrg, with the main difference lying in the number of iterations within each epoch, which is chosen according to a specific geometric law. Both  and  rely on the assumptions that is strongly convex and ’s are smooth.
Recently, Defazio et al. propose Saga , a fast incremental gradient method in the spirit of Sag and Svrg. Saga works for both strongly and plain convex objective functions, as well as in proximal settings. However, similarly to its predecessor , it does not admit low storage cost.
While writing this paper, we also became aware of the independent work of  and , where similar ideas are developed. In the first case, the authors consider a streaming version of Svrg algorithm and, the purpose of that research is different from the present work, as well as the theoretical analysis followed. The results presented in the latter work are of the same flavor with ours.
3 Our variance reduction scheme
We consider the minimization in (1). In the th iteration, vanilla SGD generates a new estimate
based on the previous estimate and the atomic gradient of a component , where index is selected uniformly at random from . The intuition behind SGD is that in expectation its update direction aligns with the gradient descent update. But, contrary to gradient descent, SGD is not guaranteed to move towards the optimum in each single iteration. To guarantee convergence, it employs a decaying sequence of step sizes , which in turn impacts the rate at which convergence occurs.
Svrg  alleviates the need for decreasing step size by dividing time into epochs and interleaving a computation of the full gradient between consecutive epochs. The full gradient information where is the estimate available at the beginning of the th epoch, is used to steer the subsequent steps and counterbalance the variance introduced by the randomness of the stochastic updates. Within the th epoch, Svrg computes a sequence of estimates where , and
is a linear combination of full and atomic gradient information. Based on this sequence, it computes the next estimate , which is passed down to the next epoch. Note that is an unbiased estimator of the gradient , i.e., .
As the number of components grows large, the computation of the full gradient , at the beginning of each epoch, becomes a computational bottleneck. A natural alternative is to compute a surrogate , using only a small subset of the input data.
We propose CheapSVRG, a variance-reduction stochastic optimization scheme. Our algorithm can be seen as a unifying scheme of existing stochastic methods including Svrg and vanilla SGD. Its steps are outlined in Alg. 1.
CheapSVRG divides time into epochs. The th epoch begins at an estimate , inherited from the previous epoch. For the first epoch, that estimate is given as input, . The algorithm selects a set uniformly at random, with cardinality , for some parameter . Using only the components of indexed by , it computes
a surrogate of the full-gradient .
Within the th epoch, the algorithm generates a sequence of estimates , , through an equal number of SGD-like iterations, using a modified, ‘biased’ update rule. Similarly to Svrg, starting from , in the th iteration, it computes
where is a constant step-size and
The index is selected uniformly at random from , independently across iterations.333In the Appendix, we also consider the case where the inner loop uses a mini-batch instead of a single component . The cardinality is a user parameter. The estimates obtained from the iterations of the inner loop (lines -), are averaged to yield the estimate of the current epoch, and is used to initialize the next.
Note that during this SGD phase, the index set is fixed. Taking the expectation w.r.t. index , we have
Unless , the update direction is a biased estimator of . This is a key difference from the update direction used by Svrg in . Of course, since is selected uniformly at random in each epoch, then across epochs where the expectation is with respect to the random choice of . Hence, on expectation, the update direction can be considered an unbiased surrogate of .
Our algorithm can be seen as a unifying framework, encompassing existing stochastic optimization methods. If the tunning parameter is set equal to , the algorithm reduces to vanilla SGD, while for , we recover Svrg. Intuitively, establishes a trade-off between the quality of the full-gradient surrogate generated at the beginning of each epoch and the associated computational cost.
4 Convergence analysis
In this section, we provide a theoretical analysis of our algorithm under standard assumptions, along the lines of [25, 18]. We begin by defining those assumptions and the notation used in the remainder of this section.
We use to denote the set . For an index in , denotes the atomic gradient on the th component . We use
to denote the expectation with respect the random variable. With a slight abuse of notation, we use to denote the expectation with respect to .
Our analysis is based on the following assumptions, which are common across several works in the stochastic optimization literature.
Assumption 1 (Lipschitz continuity of ).
Each in (1) has -Lipschitz continuous gradients, i.e., there exists a constant such that for any ,
Assumption 2 (Strong convexity of ).
The function is -strongly convex for some constant , i.e., for any ,
Assumption 3 (Component-wise bounded gradient).
There exists such that , in the domain of , for all .
Observe that Asm. 3 is satisfied if the components are -Lipschitz functions. Alternatively, Asm. 3 is satisfied when is -Lipschitz function and . This is known as the strong growth condition .444This condition is rarely satisfied in many practical cases. However, similar assumptions have been used to show convergence of Gauss-Newton-based schemes , as well as deterministic incremental gradient methods [28, 30].
Assumption 4 (Bounded Updates).
For each of the estimates , we assume that the expected distance is upper bounded by a constant. Equivalently, there exists such that
We note that Asm. 4 is non-standard, but was required for our analysis. An analysis without this assumption is an interesting open problem.
We show that, under Asm. 1-4, the algorithm will converge –in expectation– with respect to the objective value, achieving a linear rate, up to a constant neighborhood of the optimal, depending on the configuration parameters and the problem at hand. Similar results have been reported for SGD , as well as deterministic incremental gradient methods .
Theorem 4.1 (Convergence).
We remark the following:
The condition ensures convergence up to a neighborhood around . In turn, we require that
for appropriate .
The value of in Thm. 4.1 is similar to that of : for sufficiently large , there is a -factor deterioration in the convergence rate, due to the parameter . We note, however, that our result differs from  in that Thm. 4.1 guarantees convergence up to a neighborhood around . To achieve the same convergence rate with , we require , which in turn implies that . To see this, consider a case where the condition number is constant and . Based on the above, we need . This further implies that, in order to bound the additive term in Thm. 4.1, is required for .
The following theorem establishes the analytical complexity of CheapSVRG; the proof is provided in the Appendix.
Theorem 4.2 (Complexity).
For some accuracy parameter , if , then for suitable , , and
Alg. 1 outputs such that . Moreover, the total complexity is atomic gradient computations.
4.2 Proof of Theorem 4.1
We proceed with an analysis of Alg. 1, and in turn the proof of Thm. 4.1, starting from its core inner loop (Lines -) and show that in expectation, the steps of the inner loop make progress towards the optimum point. Then, we move outwards to the ‘wrapping’ loop that defines consecutive epochs.
Inner loop. Fix an epoch, say the th iteration of the outer loop. Starting from a point (which is effectively the estimate of the previous epoch), the inner loop performs steps, using the partial gradient information vector , for a fixed set .
Consider the th iteration of the inner loop. For now, let the estimates generated during previous iterations be known history. By the update in Line , we have
where the expectation is with respect to the choice of index . We develop an upper bound for the right hand side of (3). By the definition of in Line 10,
Inequality (6) establishes an upper bound on the distance of the -th iteration estimate from the optimum, conditioned on the random choices of previous iterations. Taking expectation over , for any , we have
Note that , , , and, are constants w.r.t. . Summing over ,
For the second term on the right hand side, we have:
By the convexity of ,
Also, by the strong convexity of (Asm. 2),
Continuing from (10), taking into account the above and recalling that , we obtain
The last sum in (11) can be further upper bounded:
The first inequality follows from Cauchy-Schwarz, the second from the fact that is independent from the random variables ( and are fixed), while the last one follows from Asm. 4. Incorporating the above upper bound in (11), we obtain
where The inequality in (12) effectively establishes a recursive bound on using only the estimate sequence produced by the epochs.
Outer Loop. Taking expectation over , assuming that is such that , we have
To further bound the right-hand side, note that:
Let . Also, let
Finally, unfolding the above recursion, we obtain where is defined in Thm. 4.1. This completes the proof of the theorem.
We empirically evaluate CheapSVRG on synthetic and real data and compare mainly with Svrg . We show that in some cases it improves upon existing stochastic optimization methods, and discuss its properties, strengths and weaknesses.
5.1 Properties of CheapSVRG
We consider a synthetic linear regression problem: given a set of training samples , , , where and , we seek the solution to
We generate an instance of the problem as follows. First, we randomly select a point
from a spherical Gaussian distribution and rescale to unit-norm; this point serves as our ‘ground truth’. Then, we randomly generate a sequence of ’s i.i.d. according to a Gaussian distribution. Let be the matrix formed by stacking the samples , . We compute , where is a noise term drawn from , with -norm rescaled to a desired value controlling the noise level.
We set and . Let where
denotes the maximum singular value of. We run the classic SGD method with decreasing step size , the Svrg method of Johnson and Zhang  and, our CheapSVRG for parameter values , which covers a wide spectrum of possible configurations for .
Step size selection.
We study the effect of the step size on the performance of the algorithms; see Figure 1. The horizontal axis represents the number effective passes over the data: evaluating component gradients, or computing a single full gradient is considered as one effective pass. The vertical axis depicts the progress of the objective in (15).
We plot the performance for three step sizes: , for and . Observe that Svrg becomes slower if the step size is either too big or too small, as also reported in [13, 31]. The middle value was the best555Determined via binary search. for Svrg in the range we considered. Note that each algorithm achieves its peak performance for a different value of the step size. In subsequent experiments, however, we will use the above value which was best for Svrg.
Overall, we observed that CheapSVRG is more ‘flexible’ in the choice of the step size. In Figure 1 (right), with a suboptimal choice of step size, Svrg oscillates and progresses slowly. On the contrary, CheapSVRG converges nice even for . It is also worth noting CheapSVRG with , i.e., effectively combining two datapoints in each stochastic update, achieves a substantial improvement compared to vanilla SGD.
Resilience to noise.
We study the behavior of the algorithms with respect to the noise magnitude. We consider the cases . In Figure 3, we focus on four distinct noise levels and plot the distance of the estimate from the ground truth vs the number of effective data passes. For SGD, we use the sequence of step sizes .
We also note the following surprising result: in the noiseless case, it appears that is sufficient for linear convergence in practice; see Figure 3. In contrast, CheapSVRG is less resilient to noise than Svrg – however, we can still get to a good solution with less computational complexity per iteration.
with standard deviation(noiseless), , , and . Each scheme is executed times/instance. We plot the median over the Monte Carlo iterations.
Number of inner loop iterations.
Let denote a budget of atomic gradient computations. We study how the objective value decreases with respect to percentage perc of the budget allocated to the inner loop. We first run a classic gradient descent with step size which converges within iterations. Based on this, we choose our global budget to be . We consider the following values for perc: . E.g., when perc %, only atomic gradient calculations are spent in outer loop iterations. The results are depicted in Fig. 2.
We observe that convergence is slower as fewer computations are spent in outer iterations. Also, in contrast to Svrg, our algorithm appears to be sensitive to the choice of perc: for , our scheme diverges, while Svrg finds relatively good solution.
5.2 -regularized logistic regression
We consider the regularized logistic regression problem, i.e., the minimization
Here, , where indicates the binary label in a classification problem, represents the predictor, and is a regularization parameter.
We focus on the training loss in such a task.666By , we already know that such two-stage SGD schemes perform better than vanilla SGD. We use the real datasets listed in Table 1. We pre-process the data so that , as in . This leads to an upper bound on Lipschitz constants for each such that . We set for all algorithms under consideration, according to [13, 31], perc % and, for all problem cases.
Fig. LABEL:fig:exp4 depicts the convergence results for the marti0, reged0 and sido0 datasets. CheapSVRG achieves comparable performance to Svrg, while requiring less computational ‘effort’ per epoch: though smaller values of , such that or , lead to slower convergence, CheapSVRG still performs steps towards the solution, while the complexity per epoch is significantly diminished.
We proposed CheapSVRG, a new variance-reduction scheme for stochastic optimization, based on . The main difference is that instead of computing a full gradient in each epoch, our scheme computes a surrogate utilizing only part of the data, thus, reducing the per-epoch complexity. CheapSVRG comes with convergence guarantees: under assumptions, it achieves a linear convergence rate up to some constant neighborhood of the optimal. We empirically evaluated our method and discussed its strengths and weaknesses.
There are several future directions. In the theory front, it would be interesting to maintain similar convergence guarantees under fewer assumptions, extend our results beyond the smooth convex optimization, e.g., to the proximal setting, or develop distributed variants. Finally, we seek to apply our CheapSVRG to large-scale problems, e.g., for training large neural networks. We hope that this will help us better understand the properties of CheapSVRG and the trade-offs associated with its various configuration parameters.
-  Zeyuan Allen-Zhu and Yang Yuan. Univr: A universal variance reduction framework for proximal stochastic gradient method. arXiv preprint arXiv:1506.01972, 2015.
-  Dimitri P Bertsekas. Nonlinear programming. 1999.
-  Dimitri P Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010:1–38, 2011.
-  Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
-  Guillaume Bouchard, Théo Trouillon, Julien Perez, and Adrien Gaidon. Accelerating stochastic gradient descent via online learning to sample. arXiv preprint arXiv:1506.09016, 2015.
-  Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
-  Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, and Quoc V Le. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012.
-  Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
-  Aaron J Defazio, Tibério S Caetano, and Justin Domke. Finito: A faster, permutable incremental gradient method for big data problems. arXiv preprint arXiv:1407.2710, 2014.
-  Roy Frostig, Rong Ge, Sham M Kakade, and Aaron Sidford. Competing with the empirical risk minimizer in a single pass. arXiv preprint arXiv:1412.6606, 2014.
-  Isabelle Guyon. A pharmacology dataset. 2008.
-  Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Konecny, and Scott Sallinen. Stop wasting my gradients: Practical SVRG. In Advances in Neural Information Processing Systems, pages 2242–2250, 2015.
-  Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
-  Jakub Konecny and Peter Richtarik. Semi-stochastic gradient descent methods. arXiv preprint arXiv:1312.1666, 2013.
-  Jakub Konevcny, Jie Liu, Peter Richtarik, and Martin Takac. ms2gd: Mini-batch semi-stochastic gradient descent in the proximal setting. arXiv preprint arXiv:1410.4744, 2014.
-  Jason Lee, Tengyu Ma, and Qihang Lin. Distributed stochastic variance reduced gradient methods. arXiv preprint arXiv:1507.07595, 2015.
-  Julien Mairal. Optimization with first-order surrogate functions. arXiv preprint arXiv:1305.3120, 2013.
-  Angelia Nedić and Dimitri Bertsekas. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications, pages 223–264. Springer, 2001.
-  Yurii Nesterov. Introductory lectures on convex optimization, volume 87. Springer Science & Business Media, 2004.
Jiquan Ngiam, Adam Coates, Ahbik Lahiri, Bobby Prochnow, Quoc V Le, and
Andrew Y Ng.
On optimization methods for deep learning.In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 265–272, 2011.
-  Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alex Smola. On variance reduction in stochastic gradient descent and its asynchronous variants. arXiv preprint arXiv:1506.06840, 2015.
-  Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
-  Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2663–2671, 2012.
-  Christopher D Sa, Christopher Re, and Kunle Olukotun. Global convergence of stochastic gradient descent for some non-convex matrix problems. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 2332–2341, 2015.
-  Mark Schmidt. Convergence rate of stochastic gradient with constant step size. 2014.
-  Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370, 2013.
-  Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. The Journal of Machine Learning Research, 14(1):567–599, 2013.
-  Mikhail V Solodov. Incremental gradient algorithms with stepsizes bounded away from zero. Computational Optimization and Applications, 11(1):23–35, 1998.
-  Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th international conference on machine learning (ICML-13), pages 1139–1147, 2013.
-  Paul Tseng. An incremental gradient (-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506–531, 1998.
-  Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
-  Ruiliang Zhang, Shuai Zheng, and James T Kwok. Fast distributed asynchronous sgd with variance reduction. arXiv preprint arXiv:1508.01633, 2015.
-  Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603, 2010.
Appendix A Proof of Theorem 4.2
By assumptions of the theorem, we have:
As mentioned in the remarks of Section 4, the above conditions are sufficient to guarantee , for some . Further, for given accuracy parameter , we assume .
Let us define
as in the proof of Theorem 4.1. In order to satisfy , it is sufficient to find the required number of iterations such that . In particular: