1 Introduction
The problem we are interested in is to minimize a sum of two convex functions,
(1) 
where is the average of a large number of smooth convex functions , i.e.,
(2) 
We further make the following assumptions:
Assumption 1.
The regularizer is convex and closed. The functions are differentiable and have Lipschitz continuous gradients with constant . That is,
for all , where is L2 norm.
Hence, the gradient of is also Lipschitz continuous with the same constant .
Assumption 2.
is strongly convex with parameter , i.e., ,
(3) 
where is the subdifferential of at .
Let and be the strong convexity parameters of and , respectively (we allow both of them to be equal to , so this is not an additional assumption). We assume that we have lower bounds available ( and ).
Related work
There has been intensive interest and activity in solving problems of (1) in the past years. An algorithm that made its way into many applied areas is FISTA [1]. However, this method is impractical in largescale setting (big ) as it needs to process all functions in each iteration. Two classes of methods address this issue – randomized coordinate descent methods [13, 15, 16, 11, 5, 19, 9, 10, 14, 4] and stochastic gradient methods [22, 12, 6, 20]
. This brief paper is closely related to the works on stochastic gradient methods with a technique of explicit variance reduction of stochastic approximation of the gradient. In particular, our method is a minibatch variant of S2GD
[8]; the proximal setting was motivated by SVRG [7, 21].A typical stochastic gradient descent (SGD) method will randomly sample function and then update the variable using
— an estimate of
. Important limitation of SGD is that it is inherently sequential, and thus it is difficult to parallelize them. In order to enable parallelism, minibatching—samples multiple examples per iteration—is often employed [17, 3, 2, 23, 20, 18].Our Contributions.
In this work, we combine the variance reduction ideas for stochastic gradient methods with minibatching. In particular, develop and analyze mS2GD (Algorithm 1) – a minibatch proximal variant of S2GD [8]. To the best of our knowledge, this is the first minibatch stochastic gradient method with reduced variance for problem (1). We show that the method enjoys twofold benefit compared to previous methods. Apart from admitting a parallel implementation (and hence speedup in clocktime in an HPC environment), our results show that in order attain a specified accuracy , our minibatching scheme can get by with less gradient evaluations. This is formalized in Theorem 2, which predicts more than linear speedup up to some — the size of the minibatches. Another advantage, compared to [21], is that we do not need to average the points in each loop, but we instead simply continue from the last one (this is the approach employed in S2GD [8]).
2 Proximal Algorithms
A popular proximal gradient approach to solving (1) is to form a sequence via
Note that in an upper bound on if is a stepsize parameter satisfying . This procedure can be equivalently written using the proximal operator as follows:
In largescale setting it is more efficient to instead consider the stochastic proximal gradient approach, in which the proximal operator is applied to a stochastic gradient step:
(4) 
where is a stochastic estimate of the gradient . Of particular relevance to our work are the the SVRG [7], S2GD [8] and ProxSVRG [21] methods where the stochastic estimate of is of the form
(5) 
where is an “old” reference point for which the gradient was already computed in the past, and is a random index equal to
with probability
. Notice thatis an unbiased estimate of the gradient:
Methods such as SVRG [7], S2GD [8] and ProxSVRG [21] update the points in an inner loop, and the reference point in an outer loop. This ensures that has low variance, which ultimately leads to extremely fast convergence.
3 Minibatch S2GD
We now describe the mS2GD method (Algorithm 1).
The main step of our method (Step 8) is given by the update (4), however with the stochastic estimate of the gradient instead formed using a minibatch of examples of size . We run the inner loop for iterations, where with probability given by
(6) 
4 Complexity Result
In this section, we state our main complexity result and comment on how to optimally choose the parameters of the method.
Theorem 1.
Remark 1.
4.1 Minibatch speedup
In order to be able to see the speedup we can gain from the minibatch strategy, and due to many parameters in the complexity result (Theorem 1) we need to fix some of the parameters. For simplicity, we will use and equal to , so we can analyse (8) instead of (7). Let us consider the case when we also fix (number of outer iterations). Once the parameter is fixed and in order to get some accuracy, we get the value of which will guarantee the result.
Let us now fix target decrease in single epoch . For any , define to be the optimal pair stepsizesize of the inner loop, such that . This pair is optimal in the sense that is the smallest possible — because we are interested in minimizing the computational effort, thus minimizing . If we set , we recover the optimal choice of parameters without minibatching. If , then we can reach the same accuracy with less evaluations of gradient of a function . The following Theorem states the formula for and . Equation (9) shows that as long as the condition is satisfied, is decreasing at a rate faster than . Hence, we can attain the same accuracy with less work, compared to the case when .
Theorem 2.
Fix target , where is given by (8) and . Then, if we consider the minibatch size to be fixed, the choice of stepsize and size of inner loop that minimizes the work done — the number of gradients evaluated — while having , is given by the following formulas:
If then and
(9) 
Otherwise and
5 Experiments
In this section we present a preliminary experiment, and an insight into the possible speedup by parallelism. Figure 2
shows experiments on L2regularized logistic regression on the RCV1 dataset.
^{1}^{1}1Available at http://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/. We compare S2GD (blue, squares) and mS2GD (green circles) with minibatch size , without any parallelism. The figure demonstrates that one can achieve the same accuracy with less work. The green dashed line is the ideal (most likely practically unachievable) result with parallelism (we divide passes through data by ). For comparison, we also include SGD with constant stepsize (purple, stars), chosen in hindsight to optimize performance. Figure 2 shows the possible speedup in terms of work done, formalized in Theorem 2. Notice that up to a certain threshold, we do not need any more work to achieve the same accuracy (red straight line is ideal speedup; blue curvy line is what mS2GD achieves).References
 [1] Amir Beck and Marc Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(1):183–202, 2009.
 [2] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better minibatch algorithms via accelerated gradient methods. In NIPS, pages 1647–1655, 2011.
 [3] Ofer Dekel, Ran GiladBachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using minibatches. JMLR, 13(1):165–202, 2012.

[4]
Olivier Fercoq, Zheng Qu, Peter Richtárik, and Martin Takáč.
Fast distributed coordinate descent for nonstrongly convex losses.
In
IEEE Workshop on Machine Learning for Signal Processing
, 2014.  [5] Olivier Fercoq and Peter Richtárik. Accelerated, parallel and proximal coordinate descent. arXiv:1312.5799, 2013.
 [6] Martin Jaggi, Virginia Smith, Martin Takáč, Jonathan Terhorst, Thomas Hofmann, and Michael I Jordan. Communicationefficient distributed dual coordinate ascent. NIPS, 2014.
 [7] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. NIPS, pages 315–323, 2013.
 [8] Jakub Konečný and Peter Richtárik. Semistochastic gradient descent methods. arXiv:1312.1666, 2013.
 [9] Jakub Mareček, Peter Richtárik, and Martin Takáč. Distributed block coordinate descent for minimizing partially separable functions. arXiv:1406.0238, 2014.
 [10] Ion Necoara and Dragos Clipici. Distributed coordinate descent methods for composite minimization. arXiv:1312.5302, 2013.
 [11] Ion Necoara and Andrei Patrascu. A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints. Comp. Optimization and Applications, 57(2):307–337, 2014.
 [12] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optimization, 19(4):1574–1609, 2009.
 [13] Yurii Nesterov. Efficiency of coordinate descent methods on hugescale optimization problems. SIAM J. Optimization, 22:341–362, 2012.
 [14] Peter Richtárik and Martin Takáč. Distributed coordinate descent method for learning with big data. arXiv:1310.2059, 2013.
 [15] Peter Richtárik and Martin Takáč. Iteration complexity of randomized blockcoordinate descent methods for minimizing a composite function. Mathematical Programming, 144(12):1–38, 2014.
 [16] Peter Richtárik and Martin Takáč. Parallel coordinate descent methods for big data optimization. arXiv:1212.0873, 2012.
 [17] Shai ShalevShwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated subgradient solver for svm. Mathematical Programming: Series A and B Special Issue on Optimization and Machine Learning, pages 3–30, 2011.
 [18] Shai ShalevShwartz and Tong Zhang. Accelerated minibatch stochastic dual coordinate ascent. In NIPS, pages 378–385, 2013.
 [19] Shai ShalevShwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. JMLR, 14(1):567–599, 2013.
 [20] Martin Takáč, Avleen Singh Bijral, Peter Richtárik, and Nathan Srebro. Minibatch primal and dual methods for SVMs. ICML, 2013.
 [21] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. arXiv:1403.4699, 2014.
 [22] Tong Zhang. Solving large scale linear prediction using stochastic gradient descent algorithms. In ICML, 2004.
 [23] Peilin Zhao and Tong Zhang. Accelerating minibatch stochastic gradient descent using stratified sampling. arXiv:1405.3080, 2014.
Comments
There are no comments yet.