1 Introduction
Several machine learning and optimization problems involve the minimization of a smooth, convex and separable cost function :
(1) 
where the dimensional variable represents model parameters, and each of the functions
depends on a single data point. Linear regression is such an example: given points
in , one seeks that minimizes the sum of ,. Training of neural networks
[7, 13], multiclass logistic regression
[13, 23], image classification [5], matrix factorization [24] and many more tasks in machine learning entail an optimization of similar form.Batch gradient descent schemes can effectively solve small or moderatescale instances of (1). Often though, the volume of input data outgrows our computational capacity, posing major challenges. Classic batch optimization methods [19, 6] perform several passes over the entire input dataset to compute the full gradient, or even the Hessian^{1}^{1}1In this work, we will focus on firstorder methods only. Extensions to higherorder schemes is left for future work., in each iteration, incurring a prohibitive cost for very large problems.
Stochastic optimization methods overcome this hurdle by computing only a surrogate of the full gradient , based on a small subset of the input data. For instance, the popular SGD [22] scheme in each iteration takes a small step in a direction determined by a single, randomly selected data point. This imperfect gradient step results in smaller progress periteration, though manyfold in the time it would take for a batch gradient descent method to compute a full gradient [4].
Nevertheless, the approximate ‘gradients’ of stochastic methods introduce variance in the course of the optimization. Notably, vanilla SGD methods can deviate from the optimum, even if the initialization point is the optimum [13]. To ensure convergence, the learning rate has to decay to zero, which results to sublinear convergence rates [22], a significant degradation from the linear rate achieved by batch gradient methods.
A recent line of work [23, 13, 14, 15] has made promising steps towards the middle ground of these two extremes. A full gradient computation is occasionally interleaved with the inexpensive steps of SGD, dividing the course of the optimization in epochs
. Within an epoch, descent directions are formed as a linear combination of an approximate gradient (as in vanilla
SGD) and a full gradient vector computed at the beginning of the epoch. Though not always uptodate, the full gradient information reduces the variance of gradient estimates and provably speeds up the convergence.
Yet, as the size of the problem grows, even an infrequent computation of the full gradient may severely impede the progress of these variancereduction approaches. For instance, when training large neural networks [20, 29, 7]), the volume of the input data rules out the possibility of computing a full gradient within any reasonable time window. Moreover, in a distributed setting, accessing the entire dataset may incur significant tail latencies [33]. On the other hand, traditional stochastic methods exhibit low convergence rates and in practice frequently fail to come close to the optimal solution in reasonable amount of time.
Contributions.
The above motivate to design algorithms that try to compromise the two extremes circumventing the costly computation of the full gradient, while admitting favorable convergence rate guarantees. In this work, we reconsider the computational resource allocation problem in stochastic variancereduction schemes: given a limited budget of atomic gradient computations, how can we utilize those resources in the course of the optimization to achieve faster convergence? Our contributions can be summarized as follows:

We propose CheapSVRG, a variant of the popular Svrg scheme [13]. Similarly to Svrg, our algorithm divides time into epochs, but at the beginning of each epoch computes only a surrogate of the full gradient using a subset of the input data. Then, it computes a sequence of estimates using a modified version of SGD steps. Overall, CheapSVRG can be seen as a family of stochastic optimization schemes encompassing Svrg and vanilla SGD. It exposes a set of tuning knobs that control tradeoffs between the periteration computational complexity and the convergence rate.

We supplement our theoretical analysis with experiments on synthetic and real data. Empirical evaluation supports our claims for linear convergence and shows that CheapSVRG performs at least competitively with the state of the art.
2 Related work
There is extensive literature on classic SGD approaches. We refer the reader to [4, 3] and references therein for useful pointers. Here, we focus on works related to variance reduction using gradients, and consider only primal methods; see [27, 9, 17] for dual.
Roux et al. in [23] are among the first that considered variance reduction methods in stochastic optimization. Their proposed scheme, Sag, achieves linear convergence under smoothness and strong convexity assumptions and is computationally efficient: it performs only one atomic gradient calculation per iteration. However, it is not memory efficient^{2}^{2}2The authors show how to reduce memory requirements in the case where depends on a linear combination of . as it requires storing all intermediate atomic gradients to generate approximations of the full gradient and, ultimately, achieve variance reduction.
In [13], Johnson and Zhang improve upon [23] with their Stochastic VarianceReduced Gradient (Svrg) method, which both achieves linear convergence rates and does not require the storage of the full history of atomic gradients. However, Svrg requires a full gradient computation per epoch. The S2gd method of [14] follows similar steps with Svrg, with the main difference lying in the number of iterations within each epoch, which is chosen according to a specific geometric law. Both [13] and [14] rely on the assumptions that is strongly convex and ’s are smooth.
Recently, Defazio et al. propose Saga [8], a fast incremental gradient method in the spirit of Sag and Svrg. Saga works for both strongly and plain convex objective functions, as well as in proximal settings. However, similarly to its predecessor [23], it does not admit low storage cost.
While writing this paper, we also became aware of the independent work of [10] and [12], where similar ideas are developed. In the first case, the authors consider a streaming version of Svrg algorithm and, the purpose of that research is different from the present work, as well as the theoretical analysis followed. The results presented in the latter work are of the same flavor with ours.
3 Our variance reduction scheme
We consider the minimization in (1). In the th iteration, vanilla SGD generates a new estimate
based on the previous estimate and the atomic gradient of a component , where index is selected uniformly at random from . The intuition behind SGD is that in expectation its update direction aligns with the gradient descent update. But, contrary to gradient descent, SGD is not guaranteed to move towards the optimum in each single iteration. To guarantee convergence, it employs a decaying sequence of step sizes , which in turn impacts the rate at which convergence occurs.
Svrg [13] alleviates the need for decreasing step size by dividing time into epochs and interleaving a computation of the full gradient between consecutive epochs. The full gradient information where is the estimate available at the beginning of the th epoch, is used to steer the subsequent steps and counterbalance the variance introduced by the randomness of the stochastic updates. Within the th epoch, Svrg computes a sequence of estimates where , and
is a linear combination of full and atomic gradient information. Based on this sequence, it computes the next estimate , which is passed down to the next epoch. Note that is an unbiased estimator of the gradient , i.e., .
As the number of components grows large, the computation of the full gradient , at the beginning of each epoch, becomes a computational bottleneck. A natural alternative is to compute a surrogate , using only a small subset of the input data.
Our scheme.
We propose CheapSVRG, a variancereduction stochastic optimization scheme. Our algorithm can be seen as a unifying scheme of existing stochastic methods including Svrg and vanilla SGD. Its steps are outlined in Alg. 1.
CheapSVRG divides time into epochs. The th epoch begins at an estimate , inherited from the previous epoch. For the first epoch, that estimate is given as input, . The algorithm selects a set uniformly at random, with cardinality , for some parameter . Using only the components of indexed by , it computes
(2) 
a surrogate of the fullgradient .
Within the th epoch, the algorithm generates a sequence of estimates , , through an equal number of SGDlike iterations, using a modified, ‘biased’ update rule. Similarly to Svrg, starting from , in the th iteration, it computes
where is a constant stepsize and
The index is selected uniformly at random from , independently across iterations.^{3}^{3}3In the Appendix, we also consider the case where the inner loop uses a minibatch instead of a single component . The cardinality is a user parameter. The estimates obtained from the iterations of the inner loop (lines ), are averaged to yield the estimate of the current epoch, and is used to initialize the next.
Note that during this SGD phase, the index set is fixed. Taking the expectation w.r.t. index , we have
Unless , the update direction is a biased estimator of . This is a key difference from the update direction used by Svrg in [13]. Of course, since is selected uniformly at random in each epoch, then across epochs where the expectation is with respect to the random choice of . Hence, on expectation, the update direction can be considered an unbiased surrogate of .
Our algorithm can be seen as a unifying framework, encompassing existing stochastic optimization methods. If the tunning parameter is set equal to , the algorithm reduces to vanilla SGD, while for , we recover Svrg. Intuitively, establishes a tradeoff between the quality of the fullgradient surrogate generated at the beginning of each epoch and the associated computational cost.
4 Convergence analysis
In this section, we provide a theoretical analysis of our algorithm under standard assumptions, along the lines of [25, 18]. We begin by defining those assumptions and the notation used in the remainder of this section.
Notation.
We use to denote the set . For an index in , denotes the atomic gradient on the th component . We use
to denote the expectation with respect the random variable
. With a slight abuse of notation, we use to denote the expectation with respect to .Assumptions.
Our analysis is based on the following assumptions, which are common across several works in the stochastic optimization literature.
Assumption 1 (Lipschitz continuity of ).
Each in (1) has Lipschitz continuous gradients, i.e., there exists a constant such that for any ,
Assumption 2 (Strong convexity of ).
The function is strongly convex for some constant , i.e., for any ,
Assumption 3 (Componentwise bounded gradient).
There exists such that , in the domain of , for all .
Observe that Asm. 3 is satisfied if the components are Lipschitz functions. Alternatively, Asm. 3 is satisfied when is Lipschitz function and . This is known as the strong growth condition [26].^{4}^{4}4This condition is rarely satisfied in many practical cases. However, similar assumptions have been used to show convergence of GaussNewtonbased schemes [2], as well as deterministic incremental gradient methods [28, 30].
Assumption 4 (Bounded Updates).
For each of the estimates , we assume that the expected distance is upper bounded by a constant. Equivalently, there exists such that
We note that Asm. 4 is nonstandard, but was required for our analysis. An analysis without this assumption is an interesting open problem.
4.1 Guarantees
We show that, under Asm. 14, the algorithm will converge –in expectation– with respect to the objective value, achieving a linear rate, up to a constant neighborhood of the optimal, depending on the configuration parameters and the problem at hand. Similar results have been reported for SGD [25], as well as deterministic incremental gradient methods [18].
Theorem 4.1 (Convergence).
We remark the following:

[leftmargin=0.5cm]

The condition ensures convergence up to a neighborhood around . In turn, we require that
for appropriate .

The value of in Thm. 4.1 is similar to that of [13]: for sufficiently large , there is a factor deterioration in the convergence rate, due to the parameter . We note, however, that our result differs from [13] in that Thm. 4.1 guarantees convergence up to a neighborhood around . To achieve the same convergence rate with [13], we require , which in turn implies that . To see this, consider a case where the condition number is constant and . Based on the above, we need . This further implies that, in order to bound the additive term in Thm. 4.1, is required for .
The following theorem establishes the analytical complexity of CheapSVRG; the proof is provided in the Appendix.
Theorem 4.2 (Complexity).
For some accuracy parameter , if , then for suitable , , and
Alg. 1 outputs such that . Moreover, the total complexity is atomic gradient computations.
4.2 Proof of Theorem 4.1
We proceed with an analysis of Alg. 1, and in turn the proof of Thm. 4.1, starting from its core inner loop (Lines ) and show that in expectation, the steps of the inner loop make progress towards the optimum point. Then, we move outwards to the ‘wrapping’ loop that defines consecutive epochs.
Inner loop. Fix an epoch, say the th iteration of the outer loop. Starting from a point (which is effectively the estimate of the previous epoch), the inner loop performs steps, using the partial gradient information vector , for a fixed set .
Consider the th iteration of the inner loop. For now, let the estimates generated during previous iterations be known history. By the update in Line , we have
(3) 
where the expectation is with respect to the choice of index . We develop an upper bound for the right hand side of (3). By the definition of in Line 10,
(4) 
Similarly,
(5) 
The first inequality follows from the fact that , for any , . The second inequality is due to eq. (8) in [13]. Continuing from (3) and taking into account (4) and (5), we have
(6) 
Inequality (6) establishes an upper bound on the distance of the th iteration estimate from the optimum, conditioned on the random choices of previous iterations. Taking expectation over , for any , we have
(7) 
Note that , , , and, are constants w.r.t. . Summing over ,
(8) 
For the second term on the right hand side, we have:
(9) 
where the inequality follows from the convexity of . Continuing from (8), taking into account (9) and the fact that we obtain
(10) 
By the convexity of ,
Also, by the strong convexity of (Asm. 2),
Continuing from (10), taking into account the above and recalling that , we obtain
(11) 
The last sum in (11) can be further upper bounded:
The first inequality follows from CauchySchwarz, the second from the fact that is independent from the random variables ( and are fixed), while the last one follows from Asm. 4. Incorporating the above upper bound in (11), we obtain
(12) 
where The inequality in (12) effectively establishes a recursive bound on using only the estimate sequence produced by the epochs.
Outer Loop. Taking expectation over , assuming that is such that , we have
(13) 
To further bound the righthand side, note that:
where is due to and due to eq. (8) in [13]. By Asm. 3, . Using similar reasoning, under Asm. 3, Combining the above with (13), we get
(14) 
Let . Also, let
and as defined in Thm. 4.1. Taking expectation with respect to , (14) becomes
Finally, unfolding the above recursion, we obtain where is defined in Thm. 4.1. This completes the proof of the theorem.
5 Experiments
We empirically evaluate CheapSVRG on synthetic and real data and compare mainly with Svrg [13]. We show that in some cases it improves upon existing stochastic optimization methods, and discuss its properties, strengths and weaknesses.
5.1 Properties of CheapSVRG
We consider a synthetic linear regression problem: given a set of training samples , , , where and , we seek the solution to
(15) 
We generate an instance of the problem as follows. First, we randomly select a point
from a spherical Gaussian distribution and rescale to unit
norm; this point serves as our ‘ground truth’. Then, we randomly generate a sequence of ’s i.i.d. according to a Gaussian distribution. Let be the matrix formed by stacking the samples , . We compute , where is a noise term drawn from , with norm rescaled to a desired value controlling the noise level.We set and . Let where
denotes the maximum singular value of
. We run the classic SGD method with decreasing step size , the Svrg method of Johnson and Zhang [13] and, our CheapSVRG for parameter values , which covers a wide spectrum of possible configurations for .Step size selection.
We study the effect of the step size on the performance of the algorithms; see Figure 1. The horizontal axis represents the number effective passes over the data: evaluating component gradients, or computing a single full gradient is considered as one effective pass. The vertical axis depicts the progress of the objective in (15).
We plot the performance for three step sizes: , for and . Observe that Svrg becomes slower if the step size is either too big or too small, as also reported in [13, 31]. The middle value was the best^{5}^{5}5Determined via binary search. for Svrg in the range we considered. Note that each algorithm achieves its peak performance for a different value of the step size. In subsequent experiments, however, we will use the above value which was best for Svrg.
Overall, we observed that CheapSVRG is more ‘flexible’ in the choice of the step size. In Figure 1 (right), with a suboptimal choice of step size, Svrg oscillates and progresses slowly. On the contrary, CheapSVRG converges nice even for . It is also worth noting CheapSVRG with , i.e., effectively combining two datapoints in each stochastic update, achieves a substantial improvement compared to vanilla SGD.
Resilience to noise.
We study the behavior of the algorithms with respect to the noise magnitude. We consider the cases . In Figure 3, we focus on four distinct noise levels and plot the distance of the estimate from the ground truth vs the number of effective data passes. For SGD, we use the sequence of step sizes .
We also note the following surprising result: in the noiseless case, it appears that is sufficient for linear convergence in practice; see Figure 3. In contrast, CheapSVRG is less resilient to noise than Svrg – however, we can still get to a good solution with less computational complexity per iteration.
with standard deviation
(noiseless), , , and . Each scheme is executed times/instance. We plot the median over the Monte Carlo iterations.Number of inner loop iterations.
Let denote a budget of atomic gradient computations. We study how the objective value decreases with respect to percentage perc of the budget allocated to the inner loop. We first run a classic gradient descent with step size which converges within iterations. Based on this, we choose our global budget to be . We consider the following values for perc: . E.g., when perc %, only atomic gradient calculations are spent in outer loop iterations. The results are depicted in Fig. 2.
We observe that convergence is slower as fewer computations are spent in outer iterations. Also, in contrast to Svrg, our algorithm appears to be sensitive to the choice of perc: for , our scheme diverges, while Svrg finds relatively good solution.
5.2 regularized logistic regression
We consider the regularized logistic regression problem, i.e., the minimization
(16) 
Here, , where indicates the binary label in a classification problem, represents the predictor, and is a regularization parameter.
We focus on the training loss in such a task.^{6}^{6}6By [13], we already know that such twostage SGD schemes perform better than vanilla SGD. We use the real datasets listed in Table 1. We preprocess the data so that , as in [31]. This leads to an upper bound on Lipschitz constants for each such that . We set for all algorithms under consideration, according to [13, 31], perc % and, for all problem cases.
Fig. LABEL:fig:exp4 depicts the convergence results for the marti0, reged0 and sido0 datasets. CheapSVRG achieves comparable performance to Svrg, while requiring less computational ‘effort’ per epoch: though smaller values of , such that or , lead to slower convergence, CheapSVRG still performs steps towards the solution, while the complexity per epoch is significantly diminished.
Dataset  

marti0  
reged0  
sido0 
6 Conclusions
We proposed CheapSVRG, a new variancereduction scheme for stochastic optimization, based on [13]. The main difference is that instead of computing a full gradient in each epoch, our scheme computes a surrogate utilizing only part of the data, thus, reducing the perepoch complexity. CheapSVRG comes with convergence guarantees: under assumptions, it achieves a linear convergence rate up to some constant neighborhood of the optimal. We empirically evaluated our method and discussed its strengths and weaknesses.
There are several future directions. In the theory front, it would be interesting to maintain similar convergence guarantees under fewer assumptions, extend our results beyond the smooth convex optimization, e.g., to the proximal setting, or develop distributed variants. Finally, we seek to apply our CheapSVRG to largescale problems, e.g., for training large neural networks. We hope that this will help us better understand the properties of CheapSVRG and the tradeoffs associated with its various configuration parameters.
References
 [1] Zeyuan AllenZhu and Yang Yuan. Univr: A universal variance reduction framework for proximal stochastic gradient method. arXiv preprint arXiv:1506.01972, 2015.
 [2] Dimitri P Bertsekas. Nonlinear programming. 1999.
 [3] Dimitri P Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010:1–38, 2011.
 [4] Léon Bottou. Largescale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
 [5] Guillaume Bouchard, Théo Trouillon, Julien Perez, and Adrien Gaidon. Accelerating stochastic gradient descent via online learning to sample. arXiv preprint arXiv:1506.09016, 2015.
 [6] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
 [7] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, and Quoc V Le. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012.
 [8] Aaron Defazio, Francis Bach, and Simon LacosteJulien. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
 [9] Aaron J Defazio, Tibério S Caetano, and Justin Domke. Finito: A faster, permutable incremental gradient method for big data problems. arXiv preprint arXiv:1407.2710, 2014.
 [10] Roy Frostig, Rong Ge, Sham M Kakade, and Aaron Sidford. Competing with the empirical risk minimizer in a single pass. arXiv preprint arXiv:1412.6606, 2014.
 [11] Isabelle Guyon. A pharmacology dataset. 2008.
 [12] Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Konecny, and Scott Sallinen. Stop wasting my gradients: Practical SVRG. In Advances in Neural Information Processing Systems, pages 2242–2250, 2015.
 [13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
 [14] Jakub Konecny and Peter Richtarik. Semistochastic gradient descent methods. arXiv preprint arXiv:1312.1666, 2013.
 [15] Jakub Konevcny, Jie Liu, Peter Richtarik, and Martin Takac. ms2gd: Minibatch semistochastic gradient descent in the proximal setting. arXiv preprint arXiv:1410.4744, 2014.
 [16] Jason Lee, Tengyu Ma, and Qihang Lin. Distributed stochastic variance reduced gradient methods. arXiv preprint arXiv:1507.07595, 2015.
 [17] Julien Mairal. Optimization with firstorder surrogate functions. arXiv preprint arXiv:1305.3120, 2013.
 [18] Angelia Nedić and Dimitri Bertsekas. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications, pages 223–264. Springer, 2001.
 [19] Yurii Nesterov. Introductory lectures on convex optimization, volume 87. Springer Science & Business Media, 2004.

[20]
Jiquan Ngiam, Adam Coates, Ahbik Lahiri, Bobby Prochnow, Quoc V Le, and
Andrew Y Ng.
On optimization methods for deep learning.
In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 265–272, 2011.  [21] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alex Smola. On variance reduction in stochastic gradient descent and its asynchronous variants. arXiv preprint arXiv:1506.06840, 2015.
 [22] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
 [23] Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2663–2671, 2012.
 [24] Christopher D Sa, Christopher Re, and Kunle Olukotun. Global convergence of stochastic gradient descent for some nonconvex matrix problems. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pages 2332–2341, 2015.
 [25] Mark Schmidt. Convergence rate of stochastic gradient with constant step size. 2014.
 [26] Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370, 2013.
 [27] Shai ShalevShwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. The Journal of Machine Learning Research, 14(1):567–599, 2013.
 [28] Mikhail V Solodov. Incremental gradient algorithms with stepsizes bounded away from zero. Computational Optimization and Applications, 11(1):23–35, 1998.
 [29] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th international conference on machine learning (ICML13), pages 1139–1147, 2013.
 [30] Paul Tseng. An incremental gradient (projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506–531, 1998.
 [31] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
 [32] Ruiliang Zhang, Shuai Zheng, and James T Kwok. Fast distributed asynchronous sgd with variance reduction. arXiv preprint arXiv:1508.01633, 2015.
 [33] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603, 2010.
Appendix A Proof of Theorem 4.2
By assumptions of the theorem, we have:
As mentioned in the remarks of Section 4, the above conditions are sufficient to guarantee , for some . Further, for given accuracy parameter , we assume .
Let us define
as in the proof of Theorem 4.1. In order to satisfy , it is sufficient to find the required number of iterations such that . In particular:
Comments
There are no comments yet.