On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants

06/23/2015 ∙ by Sashank J Reddi, et al. ∙ MIT Carnegie Mellon University 0

We study optimization algorithms based on variance reduction for stochastic gradient descent (SGD). Remarkable recent progress has been made in this direction through development of algorithms like SAG, SVRG, SAGA. These algorithms have been shown to outperform SGD, both theoretically and empirically. However, asynchronous versions of these algorithms---a crucial requirement for modern large-scale applications---have not been studied. We bridge this gap by presenting a unifying framework for many variance reduction techniques. Subsequently, we propose an asynchronous algorithm grounded in our framework, and prove its fast convergence. An important consequence of our general approach is that it yields asynchronous versions of variance reduction algorithms such as SVRG and SAGA as a byproduct. Our method achieves near linear speedup in sparse settings common to machine learning. We demonstrate the empirical performance of our method through a concrete realization of asynchronous SVRG.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There has been a steep rise in recent work [6, 10, 25, 29, 11, 12, 27, 7, 9] on “variance reduced” stochastic gradient algorithms for convex problems of the finite-sum form:

(1.1)

Under strong convexity assumptions, such variance reduced (VR) stochastic algorithms attain better convergence rates (in expectation) than stochastic gradient descent (SGD) [24, 18], both in theory and practice.111Though we should note that SGD also applies to the harder stochastic optimization problem , which need not be a finite-sum. The key property of these VR algorithms is that by exploiting problem structure and by making suitable space-time tradeoffs, they reduce the variance incurred due to stochastic gradients. This variance reduction has powerful consequences: it helps VR stochastic methods attain linear convergence rates, and thereby circumvents slowdowns that usually hit SGD.

Although these advances have great value in general, for large-scale problems we still require parallel or distributed processing. And in this setting, asynchronous variants of SGD remain indispensable [21, 30, 2, 28, 8, 13]. Therefore, a key question is how to extend the synchronous finite-sum VR algorithms to asynchronous parallel and distributed settings.

We answer one part of this question by developing new asynchronous parallel stochastic gradient methods that provably converge at a linear rate for smooth strongly convex finite-sum problems. Our methods are inspired by the influential Svrg [10], S2gd [12], Sag [25] and Saga [6] family of algorithms. We list our contributions more precisely below.

Contributions. Our paper makes two core contributions: (i) a formal general framework for variance reduced stochastic methods based on discussions in [6]; and (ii) asynchronous parallel VR algorithms within this framework. Our general framework presents a formal unifying view of several VR methods (e.g., it includes SAGA and SVRG as special cases) while expressing key algorithmic and practical tradeoffs concisely. Thus, it yields a broader understanding of VR methods, which helps us obtain asynchronous parallel variants of VR methods. Under sparse-data settings common to machine learning problems, our parallel algorithms attain speedups that scale near linearly with the number of processors.

As a concrete illustration, we present a specialization to an asynchronous Svrg-like method. We compare this specialization with non-variance reduced asynchronous SGD methods, and observe strong empirical speedups that agree with the theory.

Related work. As already mentioned, our work is closest to (and generalizes) Sag [25], Saga [6], Svrg [10] and S2gd [12], which are primal methods. Also closely related are dual methods such as sdca [27] and Finito [7], and in its convex incarnation Miso [16]; a more precise relation between these dual methods and VR stochastic methods is described in Defazio’s thesis [5]. By their algorithmic structure, these VR methods trace back to classical non-stochastic incremental gradient algorithms [4], but by now it is well-recognized that randomization helps obtain much sharper convergence results (in expectation). Proximal [29] and accelerated VR methods have also been proposed [26, 20]; we leave a study of such variants of our framework as future work. Finally, there is recent work on lower-bounds for finite-sum problems [1].

Within asynchronous SGD algorithms, both parallel [21] and distributed [17, 2] variants are known. In this paper, we focus our attention on the parallel setting. A different line of methods is that of (primal) coordinate descent methods, and their parallel and distributed variants [19, 23, 14, 15, 22]. Our asynchronous methods share some structural assumptions with these methods. Finally, the recent work [11] generalizes S2GD to the mini-batch setting, thereby also permitting parallel processing, albeit with more synchronization and allowing only small mini-batches.

2 A General Framework for VR Stochastic Methods

We focus on instances of (1.1) where the cost function has an -Lipschitz gradient, so that , and it is -strongly convex, i.e., for all ,

(2.1)

While our analysis focuses on strongly convex functions, we can extend it to just smooth convex functions along the lines of [6, 29].

Inspired by the discussion on a general view of variance reduced techniques in [6], we now describe a formal general framework for variance reduction in stochastic gradient descent. We denote the collection of functions that make up in  (1.1) by . For our algorithm, we maintain an additional parameter for each . We use to denote . The general iterative framework for updating the parameters is presented as Algorithm 1. Observe that the algorithm is still abstract, since it does not specify the subroutine ScheduleUpdate. This subroutine determines the crucial update mechanism of (and thereby of ). As we will see different schedules give rise to different fast first-order methods proposed in the literature. The part of the update based on is the key for these approaches and is responsible for variance reduction.

Data: , step size
Randomly pick a where ;
for  to  do
       Update iterate as ;
       ;
      
end for
return
ALGORITHM 1 Generic Stochastic Variance Reduction Algorithm

Next, we provide different instantiations of the framework and construct a new algorithm derived from it. In particular, we consider incremental methods Sag [25], Svrg [10] and Saga [6], and classic gradient descent GradientDescent for demonstrating our framework.

Figure 1 shows the schedules for the aforementioned algorithms. In case of Svrg, ScheduleUpdate is triggered every iterations (here denotes precisely the number of inner iterations used in [10]); so remains unchanged for the iterations and all are updated to the current iterate at the iteration. For Saga, unlike Svrg, changes at the iteration for all . This change is only to a single element of , and is determined by the index (the function chosen at iteration ). The update of Sag is similar to Saga insofar that only one of the is updated at each iteration. However, the update for is based on rather than

. This results in a biased estimate of the gradient, unlike

Svrg and Saga. Finally, the schedule for gradient descent is similar to Sag, except that all the ’s are updated at each iteration. Due to the full update we end up with the exact gradient at each iteration. This discussion highlights how the scheduler determines the resulting gradient method.

To motivate the design of another schedule, let us consider the computational and storage costs of each of these algorithms. For Svrg, since we update after every iterations, it is enough to store a full gradient, and hence, the storage cost is . However, the running time is at each iteration and

at the end of each epoch (for calculating the full gradient at the end of each epoch). In contrast, both

Sag and Saga have high storage costs of and running time of per iteration. Finally, GradientDescent has low storage cost since it needs to store the gradient at cost, but very high computational costs of at each iteration.

Svrg has an additional computation overhead at the end of each epoch due to calculation of the whole gradient. This is avoided in Sag and Saga at the cost of additional storage. When is very large, the additional computational overhead of Svrg amortized over all the iterations is small. However, as we will later see, this comes at the expense of slower convergence to the optimal solution. The tradeoffs between the epoch size , additional storage, frequency of updates, and the convergence to the optimal solution are still not completely resolved.

SVRG: for i = 1 to n do
       ;
      
end for
return
SAGA: for i = 1 to n do
       ;
      
end for
return
SAG: for i = 1 to n do
       ;
      
end for
return
GD:  for i = 1 to n do
       ;
      
end for
return
Figure 1: ScheduleUpdate function for Svrg (top left), Saga (top right), Sag (bottom left) and GradientDescent (bottom right). While Svrg is epoch-based, rest of algorithms perform updates at each iteration. Here denotes that divides .

A straightforward approach to design a new scheduler is to combine the schedules of the above algorithms. This allows us to tradeoff between the various aforementioned parameters of our interest. We call this schedule hybrid stochastic average gradient (Hsag). Here, we use the schedules of Svrg and Saga to develop Hsag. However, in general, schedules of any of these algorithms can be combined to obtain a hybrid algorithm. Consider some , the indices that follow Saga schedule. We assume that the rest of the indices follow an Svrg-like schedule with schedule frequency for all . Figure 2 shows the corresponding update schedule of Hsag. If then Hsag is equivalent to Saga, while at the other extreme, for and for all , it corresponds to Svrg. Hsag exhibits interesting storage, computational and convergence trade-offs that depend on . In general, while large cardinality of likely incurs high storage costs, the computational cost per iteration is relatively low. On the other hand, when cardinality of is small and ’s are large, storage costs are low but the convergence typically slows down.

Hsag:
for i = 1 to n do
      
end for
return
Figure 2: ScheduleUpdate for Hsag. This algorithm assumes access to some index set

and the schedule frequency vector

. Recall that denotes divides

Before concluding our discussion on the general framework, we would like to draw the reader’s attention to the advantages of studying Algorithm 1. First, note that Algorithm 1 provides a unifying framework for many incremental/stochastic gradient methods proposed in the literature. Second, and more importantly, it provides a generic platform for analyzing this class of algorithms. As we will see in Section 3, this helps us develop and analyze asynchronous versions for different finite-sum algorithms under a common umbrella. Finally, it provides a mechanism to derive new algorithms by designing more sophisticated schedules; as noted above, one such construction gives rise to Hsag.

2.1 Convergence Analysis

In this section, we provide convergence analysis for Algorithm 1 with Hsag schedules. As observed earlier, Svrg and Saga are special cases of this setup. Our analysis assumes unbiasedness of the gradient estimates at each iteration, so it does not encompass Sag. For ease of exposition, we assume that all for all . Since Hsag is epoch-based, our analysis focuses on the iterates obtained after each epoch. Similar to [10] (see Option II of Svrg in [10]), our analysis will be for the case where the iterate at the end of epoch, , is replaced with an element chosen randomly from

with probability

. For brevity, we use to denote the iterate chosen at the epoch. We also need the following quantity for our analysis:

Theorem 1.

For any positive parameters , step size and epoch size , we define the following quantities:

Suppose the probabilities , and that , step size and epoch size are chosen such that the following conditions are satisfied:

Then, for iterates of Algorithm 1 under the Hsag schedule, we have

As a corollary, we immediately obtain an expected linear rate of convergence for Hsag.

Corollary 1.

Note that and therefore, under the conditions specified in Theorem 1 and with we have

We emphasize that there exist values of the parameters for which the conditions in Theorem 1 and Corollary 1 are easily satisfied. For instance, setting , , and , the conditions in Theorem 1 are satisfied for sufficiently large . Additionally, in the high condition number regime of , we can obtain constant (say ) with epoch size (similar to [10, 6]). This leads to a computational complexity of for Hsag to achieve accuracy in the objective function as opposed to for batch gradient descent method. Please refer to the appendix for more details on the parameters in Theorem 1.

3 Asynchronous Stochastic Variance Reduction

We are now ready to present asynchronous versions of the algorithms captured by our general framework. We first describe our setup before delving into the details of these algorithms. Our model of computation is similar to the ones used in Hogwild! [21] and AsySCD [14]. We assume a multicore architecture where each core makes stochastic gradient updates to a centrally stored vector in an asynchronous manner. There are four key components in our asynchronous algorithm; these are briefly described below.

  1. Read: Read the iterate and compute the gradient for a randomly chosen .

  2. Read schedule iterate: Read the schedule iterate and compute the gradients required for update in Algorithm 1.

  3. Update: Update the iterate with the computed incremental update in Algorithm 1.

  4. Schedule Update: Run a scheduler update for updating .

Each processor repeatedly runs these procedures concurrently, without any synchronization. Hence, may change in between Step 1 and Step 3. Similarly, may change in between Steps 2 and 4. In fact, the states of iterates and can correspond to different time-stamps. We maintain a global counter to track the number of updates successfully executed. We use and to denote the particular -iterate and -iterate used for evaluating the update at the iteration. We assume that the delay in between the time of evaluation and updating is bounded by a non-negative integer , i.e., and . The bound on the staleness captures the degree of parallelism in the method: such parameters are typical in asynchronous systems (see e.g., [3, 14]). Furthermore, we also assume that the system is synchronized after every epoch i.e., for . We would like to emphasize that the assumption is not strong since such a synchronization needs to be done only once per epoch.

For the purpose of our analysis, we assume a consistent read model. In particular, our analysis assumes that the vector used for evaluation of gradients is a valid iterate that existed at some point in time. Such an assumption typically amounts to using locks in practice. This problem can be avoided by using random coordinate updates as in [21] (see Section 4 of [21]) but such a procedure is computationally wasteful in practice. We leave the analysis of inconsistent read model as future work. Nonetheless, we report results for both locked and lock-free implementations (see Section 4).

3.1 Convergence Analysis

The key ingredients to the success of asynchronous algorithms for multicore stochastic gradient descent are sparsity and “disjointness” of the data matrix [21]. More formally, suppose only depends on where i.e., acts only on the components of indexed by the set . Let denote ; then, the convergence depends on , the smallest constant such that . Intuitively, denotes the average frequency with which a feature appears in the data matrix. We are interested in situations where . As a warm up, let us first discuss convergence analysis for asynchronous Svrg. The general case is similar, but much more involved. Hence, it is instructive to first go through the analysis of asynchronous Svrg.

Theorem 2.

Suppose step size , epoch size are chosen such that the following condition holds:

Then, for the iterates of an asynchronous variant of Algorithm 1 with Svrg schedule and probabilities for all , we have

The bound obtained in Theorem 2 is useful when is small. To see this, as earlier, consider the indicative case where . The synchronous version of Svrg obtains a convergence rate of for step size and epoch size . For the asynchronous variant of Svrg, by setting , we obtain a similar rate with . To obtain this, set where and . Then, a simple calculation gives the following:

where is some constant. This follows from the fact that . Suppose . Then we can achieve nearly the same guarantees as the synchronous version, but times faster since we are running the algorithm asynchronously. For example, consider the sparse setting where ; then it is possible to get near linear speedup when . On the other hand, when , we can obtain a theoretical speedup of .

We finally provide the convergence result for the asynchronous algorithm in the general case. The proof is complicated by the fact that set , unlike in Svrg, changes during the epoch. The key idea is that only a single element of changes at each iteration. Furthermore, it can only change to one of the iterates in the epoch. This control provides a handle on the error obtained due to the staleness. Due to space constraints, the proof is relegated to the appendix.

Theorem 3.

For any positive parameters , step size and epoch size , we define the following quantities:

Suppose probabilities , parameters , step-size , and epoch size are chosen such that the following conditions are satisfied:

Then, for the iterates of asynchronous variant of Algorithm 1 with Hsag schedule we have

Corollary 2.

Note that and therefore, under the conditions specified in Theorem 3 and with , we have

By using step size normalized by (similar to Theorem 2) and parameters similar to the ones specified after Theorem 1 we can show speedups similar to the ones obtained in Theorem 2. Please refer to the appendix for more details on the parameters in Theorem 3.

Before ending our discussion on the theoretical analysis, we would like to highlight an important point. Our emphasis throughout the paper was on generality. While the results are presented here in full generality, one can obtain stronger results in specific cases. For example, in the case of Saga, one can obtain per iteration convergence guarantees (see [6]) rather than those corresponding to per epoch presented in the paper. Also, Saga can be analyzed without any additional synchronization per epoch. However, there is no qualitative difference in these guarantees accumulated over the epoch. Furthermore, in this case, our analysis for both synchronous and asynchronous cases can be easily modified to obtain convergence properties similar to those in [6].

4 Experiments

We present our empirical results in this section. For our experiments, we study the problem of binary classification via

-regularized logistic regression. More formally, we are interested in the following optimization problem:

(4.1)

where and is the corresponding label for each . In all our experiments, we set . Note that such a choice leads to high condition number.

A careful implementation of Svrg is required for sparse gradients since the implementation as stated in Algorithm 1 will lead to dense updates at each iteration. For an efficient implementation, a scheme like the ‘just-in-time’ update scheme, as suggested in [25], is required. Due to lack of space, we provide the implementation details in the appendix.

Figure 3: -regularized logistic regression. Speedup curves for Lock-Free Svrg and Locked Svrg on rcv1 (left), real-sim (left center), news20 (right center) and url (right) datasets. We report the speedup achieved by increasing the number of threads.

We evaluate the following algorithms for our experiments:

  • [leftmargin=+.2in]

  • Lock-Free Svrg: This is the lock-free asynchronous variant of Algorithm 1 using Svrg schedule; all threads can read and update the parameters with any synchronization. Parameter updates are performed through atomic compare-and-swap instruction [21]. A constant step size that gives the best convergence is chosen for the dataset.

  • Locked Svrg: This is the locked version of the asynchronous variant of Algorithm 1 using Svrg schedule. In particular, we use a concurrent read exclusive write locking model, where all threads can read the parameters but only one threads can update the parameters at a given time. The step size is chosen similar to Lock-Free Svrg.

  • Lock-Free Sgd: This is the lock-free asynchronous variant of the Sgd algorithm (see [21]). We compare two different versions of this algorithm: (i) Sgd with constant step size (referred to as CSgd). (ii) Sgd with decaying step size (referred to as DSgd), where constants and specify the scale and speed of decay. For each of these versions, step size is tuned for each dataset to give the best convergence progress.

All the algorithms were implemented in C++ 222All experiments were conducted on a Google Compute Engine n1-highcpu-32 machine with 32 processors and 28.8 GB RAM.. We run our experiments on datasets from LIBSVM website333http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. Similar to [29], we normalize each example in the dataset so that for all . Such a normalization leads to an upper bound of on the Lipschitz constant of the gradient of . The epoch size is chosen as (as recommended in [10]) in all our experiments. In the first experiment, we compare the speedup achieved by our asynchronous algorithm. To this end, for each dataset we first measure the time required for the algorithm to each an accuracy of (i.e., ). The speedup with threads is defined as the ratio of the runtime with a single thread to the runtime with threads. Results in Figure 3 show the speedup on various datasets. As seen in the figure, we achieve significant speedups for all the datasets. Not surprisingly, the speedup achieved by Lock-free Svrg is much higher than ones obtained by locking. Furthermore, the lowest speedup is achieved for rcv1 dataset. Similar speedup behavior was reported for this dataset in [21]. It should be noted that this dataset is not sparse and hence, is a bad case for the algorithm (similar to [21]).

For the second set of experiments we compare the performance of Lock-Free Svrg with stochastic gradient descent. In particular, we compare with the variants of stochastic gradient descent, DSgd and CSgd, described earlier in this section. It is well established that the performance of variance reduced stochastic methods is better than that of Sgd. We would like to empirically verify that such benefits carry over to the asynchronous variants of these algorithms. Figure 4 shows the performance of Lock-Free Svrg, DSgd and CSgd. Since the computation complexity of each epoch of these algorithms is different, we directly plot the objective value versus the runtime for each of these algorithms. We use 10 cores for comparing the algorithms in this experiment. As seen in the figure, Lock-Free Svrg outperforms both DSgd and CSgd. The performance gains are qualitatively similar to those reported in [10] for the synchronous versions of these algorithms. It can also be seen that the DSgd, not surprisingly, outperforms CSgd in all the cases. In our experiments, we observed that Lock-Free Svrg, in comparison to Sgd, is relatively much less sensitive to the step size and more robust to increasing threads.

Figure 4: -regularized logistic regression. Training loss residual versus time plot of Lock-Free Svrg, DSgd and CSgd on rcv1 (left), real-sim (left center), news20 (right center) and url (right) datasets. The experiments are parallelized over 10 cores.

5 Discussion & Future Work

In this paper, we presented a unifying framework based on [6], that captures many popular variance reduction techniques for stochastic gradient descent. We use this framework to develop a simple hybrid variance reduction method. The primary purpose of the framework, however, was to provide a common platform to analyze various variance reduction techniques. To this end, we provided convergence analysis for the framework under certain conditions. More importantly, we propose an asynchronous algorithm for the framework with provable convergence guarantees. The key consequence of our approach is that we obtain asynchronous variants of several algorithms like Svrg, Saga and S2gd. Our asynchronous algorithms exploits sparsity in the data to obtain near linear speedup in settings that are typically encountered in machine learning.

For future work, it would be interesting to perform an empirical comparison of various schedules. In particular, it would be worth exploring the space-time-accuracy tradeoffs of these schedules. We would also like to analyze the effect of these tradeoffs on the asynchronous variants.

References

  • [1] A. Agarwal and L. Bottou. A lower bound for the optimization of finite sums. arXiv:1410.0723, 2014.
  • [2] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011.
  • [3] D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, 1989.
  • [4] D. P. Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010:1–38, 2011.
  • [5] A. Defazio. New Optimization Methods for Machine Learning. PhD thesis, Australian National University, 2014.
  • [6] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS 27, pages 1646–1654. 2014.
  • [7] A. J. Defazio, T. S. Caetano, and J. Domke. Finito: A faster, permutable incremental gradient method for big data problems. arXiv:1407.2710, 2014.
  • [8] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. The Journal of Machine Learning Research, 13(1):165–202, 2012.
  • [9] M. Gürbüzbalaban, A. Ozdaglar, and P. Parrilo. A globally convergent incremental Newton method. Mathematical Programming, 151(1):283–313, 2015.
  • [10] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS 26, pages 315–323. 2013.
  • [11] J. Konečný, J. Liu, P. Richtárik, and M. Takáč. Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting. arXiv:1504.04407, 2015.
  • [12] J. Konečný and P. Richtárik. Semi-Stochastic Gradient Descent Methods. arXiv:1312.1666, 2013.
  • [13] M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication Efficient Distributed Machine Learning with the Parameter Server. In NIPS 27, pages 19–27, 2014.
  • [14] J. Liu, S. Wright, C. Ré, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. In ICML 2014, pages 469–477, 2014.
  • [15] J. Liu and S. J. Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence properties. SIAM Journal on Optimization, 25(1):351–376, 2015.
  • [16] J. Mairal. Optimization with first-order surrogate functions. arXiv:1305.3120, 2013.
  • [17] A. Nedić, D. P. Bertsekas, and V. S. Borkar. Distributed asynchronous incremental subgradient methods. Studies in Computational Mathematics, 8:381–407, 2001.
  • [18] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
  • [19] Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
  • [20] A. Nitanda. Stochastic Proximal Gradient Descent with Acceleration Techniques. In NIPS 27, pages 1574–1582, 2014.
  • [21] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In NIPS 24, pages 693–701, 2011.
  • [22] S. Reddi, A. Hefny, C. Downey, A. Dubey, and S. Sra. Large-scale randomized-coordinate descent methods with non-separable linear constraints. In UAI 31, 2015.
  • [23] P. Richtárik and M. Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(1-2):1–38, 2014.
  • [24] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
  • [25] M. W. Schmidt, N. L. Roux, and F. R. Bach. Minimizing Finite Sums with the Stochastic Average Gradient. arXiv:1309.2388, 2013.
  • [26] S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In NIPS 26, pages 378–385, 2013.
  • [27] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss. The Journal of Machine Learning Research, 14(1):567–599, 2013.
  • [28] O. Shamir and N. Srebro. On distributed stochastic optimization and learning. In Proceedings of the 52nd Annual Allerton Conference on Communication, Control, and Computing, 2014.
  • [29] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
  • [30] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In NIPS, pages 2595–2603, 2010.

Appendix A Appendix

Notation: We use to denote the Bregman divergence (defined below) for function .

For ease of exposition, we use

to denote the expectation the random variable

with respect to indices when depends on just these indices up to step . This dependence will be clear from the context. We use to denote the indicator function. We assume if .

We would like to clarify the definition of here. As noted in the main text, we assume that is replaced with an element chosen randomly from with probability at the end of the epoch. However, whenever appears in the analysis (proofs), it represents the iterate before this replacement.

Implementation Details

Since we are interested in sparse datasets, simply taking is not efficient as it requires updating the whole vector at each iteration. This is due to the regularization term in each of the ’s. Instead, similar to [21], we rewrite problem in (4.1) as follows:

where represents the non-zero components of vector , and . While this leads to sparse gradients at each iteration, updates in Svrg are still dense due to the part of the update that contains . This problem can be circumvented by using the following update scheme. First, recall that for Svrg, does not change during an epoch (see Figure 1). Therefore, during the epoch we have the following relationship:

We maintain each bracketed term separately. The updates to the first term in the above equation are sparse while those to the second term are just a simple scalar addition, since we already maintain the average gradient . When the gradient of at is needed, we only calculate components of required for on the fly by aggregating these two terms. Hence, each step of this update procedure can be implemented in a way that respects sparsity of the data.

Proof of Theorem 1

Proof.

We expand function as where and . Let the present epoch be . We define the following:

We first observe that . This follows from the unbiasedness of the gradient at each iteration. Using this observation, we have the following:

(A.1)

The last step follows from convexity of and the unbiasedness of . We have the following relationship between and .

(A.2)

This follows from the definition of the schedule of Hsag for indices in . Substituting the above relationship in Equation (A.1) we get the following.

We describe the bounds for (defined below).

The terms and can be bounded in the following fashion:

The bound on is due to strong convexity nature of function . The first inequality and second inequalities on directly follows from Lemma 3 of [6] and simple application of Lemma 1 respectively. The third inequality follows from the definition of and the fact that for all and .

Substituting these bounds and in , we get

(A.3)

The second inequality follows from Lemma 2. In particular, we use the fact that and . The third inequality follows from the following for the choice of our parameters:

Applying the recursive relationship on for m iterations, we get

where

Substituting the bound on from Equation (A.3) in the above equation we get the following inequality:

We now use the fact that is chosen randomly from with probabilities proportional to we have the following consequence of the above inequality.

For obtaining the above inequality, we used the strongly convex nature of function . Again, using the Bregman divergence based inequality (see Lemma 2)