Fast Variance Reduction Method with Stochastic Batch Size

08/07/2018 ∙ by Xuanqing Liu, et al. ∙ 0

In this paper we study a family of variance reduction methods with randomized batch size---at each step, the algorithm first randomly chooses the batch size and then selects a batch of samples to conduct a variance-reduced stochastic update. We give the linear convergence rate for this framework for composite functions, and show that the optimal strategy to achieve the optimal convergence rate per data access is to always choose batch size of 1, which is equivalent to the SAGA algorithm. However, due to the presence of cache/disk IO effect in computer architecture, the number of data access cannot reflect the running time because of 1) random memory access is much slower than sequential access, 2) when data is too big to fit into memory, disk seeking takes even longer time. After taking these into account, choosing batch size of 1 is no longer optimal, so we propose a new algorithm called SAGA++ and show how to calculate the optimal average batch size theoretically. Our algorithm outperforms SAGA and other existing batched and stochastic solvers on real datasets. In addition, we also conduct a precise analysis to compare different update rules for variance reduction methods, showing that SAGA++ converges faster than SVRG in theory.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we consider the following finite-sum composite optimization problem:

(1)

Here we assume each is a -strongly convex, -smooth function, the regularization term is convex but not necessarily differentiable. In machine learning applications, is the number of training samples, each

is the loss function such as logistic loss or

loss, and is the regularization term which can be non-smooth (e.g., regularization). For large data, SGD is preferred over gradient descent and has been widely used in large-scale applications. However, since the variance of stochastic gradient will not go to zero even when (the optimal solution), SGD has to gradually shrink the step size to guarantee convergence, at the cost of suboptimal rate. To speed up the convergence, there is a recent line of research on developing new algorithms with linear convergence rate using variance reduction techniques, the representative work includes SAG (Hofmann et al., 2015), SVRG (Johnson & Zhang, 2013), SAGA (Defazio et al., 2014), S2GD (Konecnỳ & Richtárik, 2013) etc. Further, one can accelerate this framework via concepts similar to the Nesterov’s momentum method (Lin et al., 2015; Allen-Zhu, 2017).

The motivation of this paper is to study the effect of batch size in variance reduction methods. The effect of batch size in SGD (without variance reduction) has been studied in the literature such as (Li et al., 2014; Bengio, 2012; Keskar et al., 2016). Assuming a subset of samples is chosen for SGD at each step, the theoretical analysis in (Dekel et al., 2012) suggests that the error is at the order of after iterations, and this bound is later improved to in (Li et al., 2014). When constraining on SVM-hinge loss, (Takác et al., 2013) also shows an order of iterations to get an -suboptimal solution. Since each iteration will take the time proportional to , these bounds suggest that the acceleration of convergence exactly covers the overhead of each iteration. It is thus interesting to see whether the same conclusion also applies for variance reduction methods.

To answer this question, we study a family of variance reduction methods with randomized batch sizes. At each iteration, the algorithm first randomly selects the batch size and then chooses a batch of samples to conduct a variance reduced stochastic update. Our main findings and contributions can be listed as follows:

  • We prove linear convergence rate for this family of stochastic batched variance reduction algorithms. Our result covers composite minimization problems with non-smooth regularizations, and any distribution of batch sizes.

  • Interestingly, with this unified analysis, we theoretically show that the convergence rate can be maximized if the algorithm always chooses batch size of 1. Therefore, increasing batch size does not help in terms of the number of data access.

  • However, the number of data access does not precisely reflect the actual running time due to the memory hierarchy and cache/disk IO effect in computer architectures—accessing a continuous block of memory is faster than accessing disjoint ones, and disk seeking costs even more. After taking these into account, we propose the SAGA++ algorithm, and show how to calculate the optimal average batch size in practice. Our algorithm outperforms existing algorithms in terms of running time.

  • In addition, we also develop a more precise analysis for comparing the convergence rates of variance reduction methods, and develop an algorithm to universally accelerate the stochastic methods for solving -regularized problems by lazy updates. Which rediscovers (Konečnỳ et al., 2016) independently.

Related Work

We will discuss the related variance reduction methods in next section. Here we describe some other related work on stochastic optimization.

Stochastic optimization has become popular due to their vast and far reaching applications in large-scale machine learning, and this is also one of our main focus in this paper. Among them, stochastic gradient descent has been widely used, and its variants 

(Duchi et al., 2011; Kingma & Ba, 2014)

are popular for training deep neural networks. There are also other examples, such as stochastic coordinate descent 

(Nesterov, 2012) and stochastic dual coordinate descent (Shalev-Shwartz & Zhang, 2013). At each iteration, SGD selects one sample to conduct the update, but its gradient often contains huge noise. To reduce the noise or variance, mini-batch SGD has been intensively studied in the literature, including (Li et al., 2014), and the recent work on big-batch SGD (De et al., 2016). Some theoretical results have been discussed in our introduction (Li et al., 2014; Dekel et al., 2012).

Although some recent works have discussed about mini-batch variance reduction algorithms (Hofmann et al., 2015; Harikandeh et al., 2015), there is no clear conclusion on whether increasing the batch size helps the convergen speed. Ideally the convergence rate should linearly depend on the batch size: ; if that is the case, simply by calculating the batched gradient in parallel we will see linear speed up. (Hofmann et al., 2015) suggests in big data regime and is independent of in ill-conditioned case, this can be regarded as an asymptotic situation of our result, which claims that is a increasing function of , but a larger batch size is less useful when the Hessian is ill conditioned. However, with a more precise bound in terms of , we are able to show that is always optimal in terms of number of data access.

As to the sampling techniques, the random sampling of batch size is seen in (Richtárik & Takáč, 2016) where the authors considered about the partially separable functions and apply block coordinate descent by randomly generate a set of blocks with arbitrary size. Similar idea is later exploited in (Qu et al., 2015; Csiba & Richtárik, 2015). Our idea differs from these previous works in that we put computer architecture effects into account when deciding whether we should choose full gradient or stochastic gradient to update parameters.

2 Framework: Variance Reduction with Stochastic Batch Size

Our proposed framework is shown in Algorithm 1: at each iteration, the algorithm first randomly chooses the batch size, ranging from to , and then samples a batch accordingly. We use as a random set to denote the mini-batch chosen at each step, and its batch size, denoted as

, is a random variable. Denote

as the previous gradient evaluated on sample and is the iterate at time . The update rule is given by:

(2)

where is the step size and

is the unbiased gradient estimator:

(3)

where

is stored and maintained in memory. Similar to SAGA, in general the algorithm needs to store all the vectors

in memory, but for many commonly used cases it only needs to store a scalar for each sample index . For example, in GLM problems where , since we only need to store a scalar for each .

  Input: training samples , initial guess
  Output:
  ;
  for iter to MAX_ITER do
     Choose a batch size randomly based on some distribution;
     Sample a batch , ;
     Calculate variance reduced gradient vector by (3);
     Apply update according to (2);
     Update gradient memory: , ;
     Update
  end for
  Return ;
Algorithm 1 Variance Reduction Method with Stochastic Batch Size

Algorithm 1 is very general and can include most of the existing variance reduction methods because our algorithm does not put any restriction on the choice of batch size, which can be either random or fixed. We discuss the connections between this framework and others:

  • When the batch size is

    with probability

    , Algorithm 1 will compute the full gradient at each iteration, which is equivalent to gradient descent.

  • When the batch size is always , the algorithm is equivalent to SAGA (Defazio et al., 2014). At each step, SAGA uniformly chooses one sample from and then update the iterates by the same variance reduced gradient defined in (3).

  • SVRG (Johnson & Zhang, 2013): This method adopts two layers of iterations. In each outer iteration SVRG calculates full gradient (also called gradient snapshot), and in each inner iteration it chooses one sample to update. SVRG does not update the gradient snapshot inside the inner iteration, so strictly speaking it cannot fit into our framework. However our algorithm, SAGA++, based partly on SVRG adopts a better update rule which will be discussed later.

  • S2GD, mS2GD: They are variants of SVRG when the number of inner iterations

    follows a probability distribution:

    , where is the lower bound of strongly convex factor , is a normalizing factor and is step size. (Konečnỳ et al., 2016) extends S2GD to mini-batched version.

3 Theoretical Analysis and New Algorithms

We discuss our theoretical results and new insights in this section. First, we prove the linear convergence rate of Algorithm 1 in Section 3.1, and then in Section 3.2 we will take the cache/disk IO effect into consideration to derive the new algorithm SAGA++. We show the SAGA-style update used in this paper is more efficient than SVRG-style update in Section 3.3 and then discuss a new technique to conduct lazy update for regularization in our Algorithm 1. We left the proof in appendix.

3.1 Convergence rate analysis

We assume the objective function is -strongly convex and -Lipschitz smooth, and is the condition number. We will use the following useful bounds in our analysis:

(4a)
(4b)

Hereafter we use to denote norm unless stated explicitly. To simplify notation, we define and as the Bregman divergence between and , and define to represent the averaged Bregman divergence between and the snapshots .

To show the convergence of Algorithm 1, we first calculate the expected change of after each update:

Lemma 1

For the update rule (2) and defined above, we have

Note that unlike (Hofmann et al., 2015) where is deterministic, here we generalize their result to allow random batch size. Similarly, the progress of can be bounded by:

Lemma 2

Define iteration progress by the distance to the optimal solution , then:

(5)

where is an arbitrary constant.

Combining Lemma 1 with 2, we can build a contraction on Lyapunov function as follows.

Theorem 3

Define Lyapunov function as , is a predefined constant, then we have where if the step size satisfies the following conditions:

(6)

Recall , are predefined constants.

However, the result above is too complex to interpret. To get a better understanding of how the averaged batch size , step size and contraction factor are related to each other, we simplify the result along different directions.

Proposition 1

(Adaptive step size) In this case we want the step size to be independent on strong convexity . To this end, we set , , so that .

This result is the same with the adaptive step size of SAGA, which trade simplicity with tightness (the step size and convergence rate is independent on so we can not see the benefit of larger batch). To develop a more informative result, we resort to the following proposition:

Proposition 2

(-dependent step size) If we set step size to and , is a constant, our algorithm converges linearly with a contraction factor , i.e. and .

The selection of step size in Proposition 2 is optimal in terms of maximizing the convergence rate, as proven in appendix. Admittedly, after developing these results, the convergence rate and step size are loose after many inequalities, so these bounds should be regarded as the worst case situation. Even so, as a quick verification, we can show that our result matches the bounds of gradient descent and SAGA in the following extreme cases:

  • Gradient descent: when setting , then this gives the same order as the standard rate of gradient descent ().

  • SAGA: , for the ill-conditioned case where is comparable to , . While in the well-conditioned case, , we have . These rates match the results in the original SAGA paper (Defazio et al., 2014).

Note that the step size, unlike SGD, is always bounded away from zero (we do not need to decrease the step size in each epoch). This can be seen from the fact that

.

Next we try to find out the optimal averaged batch size in order to achieve the optimal convergence rate in terms of number of data access. Based on Proposition 2, to return an -accurate solution we need iterations, which implies epochs if the averaged batch size is . We then derive the following corollary to show that simply increasing the batch size will slow down the convergence rate per data access:

Corollary 1

(Theoretically optimal batch size) Since the effective number of data access per iteration is proportional to the averaged batch size , the optimal batch size should maximize the decrement of function value with fixed gradient calculation, which can be formulated as follows:

(7)

By taking derivative it is easy to see that the function is monotone increasing with (we leave it in appendix). So theoretically (which is SAGA method) is optimal.

3.2 SAGA++: Optimal batch sizes when taking cache/disk IO effect into consideration

Figure 1: a) solution of (8) when cache/disk IO effect coefficient , the optimal batch size is the intersection of two lines (marked as blue and orange), in this plot the condition number ranges from to . b) To see the relation of and condition number more clearly, we solve (8) numerically, the optimal batch size drop rapidly when grows. At the same level of condition number, we should use a larger average batch when the cache/disk IO effect is strong ( small). c,d) Experiment on avazu dataset(cache/disk IO effect ratio ), with respect to both data access(gradient calculation) and running time.

According to Corollary 1, one should always choose in order to minimize the number of data accesses. However, small number of data access may not necessarily lead to short running time in practice—in modern computer architectures, “sequential accesses” of data stored in the memory is much faster than “random accesses”, because accessing the memory wildly can result in frequent cache miss or disk seeking. Therefore, calculating the full gradient will take less time than calculating random gradient components (see Table 1 for measurement result). This leads to a new variance reduction method with non-deterministic batch size selection strategy(SAGA++) that combines full gradient access and SAGA (): at each step we choose with probability and with probability . When , SAGA++ accesses the whole dataset and this can be relatively fast due to the sequential memory access pattern, while when it randomly accesses one sample. By changing we can smoothly change the average batch size from to . Next we show how to take the cache/disk IO effect into consideration and derive the “optimal” in theory, while in the experimental part we show that the optimal average batch size can be large, depending on the problem and data.

To derive the optimal average batch size that yields minimal running time, we assume the computer needs time to sequentially access samples, and to randomly access the same number of samples. Therefore, when , each update costs within time units. We call the ratio as the cache effect ratio.

Corollary 2

(Optimal batch size with cache/disk IO effect) If , , then the optimal average batch size will satisfy the following equations:

(8)

Note that is also determined by which makes the closed form solution intractable. However, if we know condition number and cache effect ratio , the optimal batch size can be computed numerically. To gain more insights, we plot as a function of in Figure 1(a), which shows the connection between the best batch size and : in the well-conditioned regime we can select a larger batch size and in the ill-conditioned case a smaller batch size is better. Furthermore, Figure 1(b) reveals the optimal average batch size changes with condition number and cache effect ratio : Conceptually, if is smaller (sequential accesses are much faster than random accesses), then we are expected to do the full gradient update more frequently.

Our algorithm looks similar to SVRG—sometimes do a full gradient update while other times select a single instance. However, we use a different book-keeping strategy—SVRG does not update the gradient snapshot and control variate (defined in (3)) in between the two outer iterations, while SAGA++ will keep updating them even when batch size . Since we always keep the latest information, the convergence speed is always better than SVRG, and we leave the detailed discussion to Section 3.3.

3.3 One step analysis: comparing Algorithm 1 with SVRG-style update

By far, we have only discussed the convergence speed under SAGA-style framework. In this section, we analyze why Algorithm 1 has faster convergence rate compared to the SVRG-style updates. This explains why SAGA++ (a special case of SVRG) is faster than SVRG, since they have the same data access pattern and differs in update rules. Here SVRG-style means we do not update the control variate in (3) before a new gradient snapshot is calculated, while in Algorithm 1 we store and update each gradient memorization as well as its summation once “fresher” gradient is available. Since the proposed framework in Algorithm 1 includes SAGA and SAGA++ as special cases, we call it “SAGA-style” update hereafter.

The main advantage of SVRG-style update is that it needs less memory, however, since many machine learning problems can be formulated as generalized linear model (GLM): , so the gradient is purely determined by , and SAGA-style update need only to store this scalar instead of the gradient vector for each sample. Therefore, for GLM problems the memory overhead of SAGA-style algorithm is simply an vector.

In terms of convergence rate, the following theory indicates that SAGA-style updates can better control the variance of gradient. First of all, we extend (3) to a more general variance reduced gradient defined by , where can be any zero-mean control variate. The update rules for SVRG, SAGA, and SAGA++ can be written as , where:

(9)

where is stored in memory, is the number of inner iterations inside each outer iteration and suppose the program have just finished the -th outer iteration. We only consider the case since we want to focus on the control variate rather than batch size. For each , by regarding as a random variable, we can calculate its probability distribution as follows:

(10)

To see the difference of convergence rate between those update rules, we introduce the following lemmas:

Lemma 4

If we use the distance as a metric to the sub-optimality, then we have:

(11)

where the expectation is taken over the choices of  ( is the -algebra at time ), is the Bregman divergence and we have .

The first two terms in (11) is related to the distance between the current and optimal solution, only the last term involves the control variate in different update rules, that is exactly what we are interested in, which can be further bounded by:

Lemma 5

For an algorithm with , we can upper bound the gradient difference term:

(12)

To see the change of with , we claim that those variance reduction methods is expected to decrease in each step as long as is small(but keep to some constant):

Proposition 3

For strongly convex function , define the update rule (we ignore the regularization term for simplicity). If we want the function value to be a super-martingale, i.e. , then for SGD we require . But for variance reduction methods, since the variance of goes to zero as fast as (see appendix for details), a constant step size is enough.

Finally we can compare the update rules listed in  (3.3) by Proposition 3: Since the upper bound of distance improvement in (11) is determined by the variation of control variate and it is further upper bounded by (12), this can be seen as a weighted sum of expected function suboptimal and further from Corollary 3 we know is expected to decrease at each iteration, so we can conclude that more “weight” should be put on smaller , in another word, a good update rule should keep all the stochastic gradient active, rather than stale for too long. Therefore, by (10) we can observe that the distribution of in SAGA++ is strictly better than both SAGA and SVRG, which indicates that SAGA++ has a faster convergence rate while keeping the same computational cost.

3.4 Lazy update for regularization

recover

Figure 2: The illustration of lazy-update technique. We count the proximal operations that have been delayed (in this figure there are two) and recover it at once.

For sparse datasets and , the stochastic gradient has the same indices of zero elements as . However, the vector in update rule (3) is a dense vector, which will lead to updating all the variables. To reduce the time complexity back to per step, a “lazy update” technique was discussed in (Schmidt et al., ) for regularization. The main idea is that instead of performing an immediate update to all the variables, we only update variables associate with nonzero elements in . In the following, we derive the lazy update technique for regularization. As an illustration, Figure 2 shows a simple case where index has two consecutive zero elements in data vectors chosen at time and , or . So the updates of and are:

where . Now it remains to calculate the nested proximal operations, which can be effectively calculated by following theorem:

Theorem 6

Let , for all we have: Where is a simple, piecewise linear function.

Due to the space limit, we left the detailed formulation of in appendix. Upon finishing this paper, we found lazy update for regularization has also been discussed recently in (Konečnỳ et al., 2016). However, we still include our formal proof here for the completeness of this paper.

3.5 Extension: parallel computing scenario

The fact that doing one step full gradient update is faster than -step stochastic gradient update dues not only to cache/IO read; similar idea can also be applied to a variety of parallel optimization algorithms, in a more implicit way: when doing full gradient descent, it is trivial to make use of multiprocessing to speedup our program. In contrast, for mini-batch stochastic gradient upgrade when batch size is much smaller than the available CPU cores, many of the computing resources are idle. Although many first order methods have their asynchronous versions that alleviate this problem to some degree (Recht et al., 2011; Leblond et al., 2016; Reddi et al., 2015; Hsieh et al., 2015), because of the inconsistent paces between workers, these algorithms are suboptimal. So if we come back to synchronous updates and given that only full gradient calculation can be significantly accelerated, then the same quantity becomes a deterministic factor that affects how frequently one should perform full batch update.

4 Experimental Results

Dataset Size(GB) #Sample #Feature nnz ratio Time to access whole data(sec)
Sequential Random
kddb 5.13 19,264,097 29,890,095 9.84e-7 3.91 11.43
avazu 5.04 25,832,830 999,962 1.50e-5 4.14 9.08
criteo 26.74 45,840,617 999,999 3.90e-5 14.07 30.51
Download from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
Table 1: Dataset statistics. Time for sequential accessing samples is measured by one computation of full gradient , while time for random accesses is measured by computing random gradient components .
Figure 3: Running time comparison on kddb dataset with different regularization parameters (). Result shows that our SAGA++ algorithm is faster than competitors under different regularization parameters.
Figure 4: Running time comparison among different data ( for all data). Meta-information can be found in Table 1.

We compare SAGA++ with SAGA (Defazio, 2014), SVRG (Johnson & Zhang, 2013) and LIBLINEAR (Fan et al., 2008) (proximal Newton method) on solving the

-regularized logistic regression:

(13)

where are feature-response pairs. To make a fair comparison, all the algorithms are implemented based on the LIBLINEAR code base, and we tried to optimize each algorithm. For each outer iteration in SVRG/SAGA++ we choose inner iterations, this amounts to and close to the experiments in (Johnson & Zhang, 2013) where . The lazy update for regularization is also implemented for all the variance reduced methods. All the datasets can be downloaded from LIBSVM website.

First, we compare all the algorithms on kddb dataset with different regularization parameters. The results in Figure 3 shows that SAGA++ outperforms other algorithms for all the three choices of parameters. Indeed, (the middle figure) is the best parameter in terms of prediction accuracy, so our comparison covers both large and small s.

Next, we compare the running time of all the algorithms on three datasets in Figure 4. The results show that SAGA++ is faster than all the competitors on these three datasets. We conclude our experimental results by the following observations: (1) Although SAGA has faster convergence in terms of “number of data access” (see Figure 1-c), SVRG often outperforms SAGA due to faster sequential access. Our algorithm, SAGA++, has a sequential access stage like SVRG, while uses the most up-to-date gradient information at the random update stage, thus outperforms both SVRG and SAGA in all cases. (2) The lazy update (briefly discussed in Section 3.4) accelerates SAGA/SVRG/SAGA++ a lot; without such technique all the variance reduction methods will be much slower than LIBLINEAR. However, with such technique they can outperform the LIBLINEAR implentation of proximal Newton methods.

5 Conclusions and Discussions

We study the unified framework for variance reduction methods with stochastic batch size and prove the linear convergence rate in strongly convex finite sum functions with continuous regularizer. We show that choosing batch size always equals to (equivalent to SAGA) leads to the best rate in terms of number of data accesses; however, it is not optimal in terms of running time, so we develop a new SAGA++ algorithm. We demonstrate that SAGA++ outperforms SAGA and SVRG in terms of running time, both in theory and in practice.

One reason that SAGA++ outperforms other VR methods is due to the cache/IO effect, although we only shows in-memory optimization, one can imagine that when data is too large to fit in memory and an out-of-core solver is needed, the overhead of IO will be even more significant. In this setting we would expect an even more greater advantage over SVRG/SAGA. Another important reason is that SAGA++ update its control variate more frequently, making the stochastic gradient a lower variance estimator, so intuitively its more close to gradient descent.

Acknowledgements

The authors acknowledge the support of NSF via IIS-1719097 and the computing resources provided by Google cloud and Nvidia.

References

  • Allen-Zhu (2017) Allen-Zhu, Z. Katyusha: The first direct acceleration of stochastic gradient methods. In

    Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

    , pp. 1200–1205. ACM, 2017.
  • Bengio (2012) Bengio, Y. Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade, pp. 437–478. Springer, 2012.
  • Csiba & Richtárik (2015) Csiba, D. and Richtárik, P. Primal method for erm with flexible mini-batching schemes and non-convex losses. arXiv preprint arXiv:1506.02227, 2015.
  • De et al. (2016) De, S., Yadav, A., Jacobs, D., and Goldstein, T. Big batch sgd: Automated inference using adaptive batch sizes. arXiv preprint arXiv:1610.05792, 2016.
  • Defazio (2014) Defazio, A. New Optimization Methods for Machine Learning. PhD thesis, PhD thesis, Australian National University, 2014.
  • Defazio et al. (2014) Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646–1654, 2014.
  • Dekel et al. (2012) Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165–202, 2012.
  • Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
  • Fan et al. (2008) Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
  • Harikandeh et al. (2015) Harikandeh, R., Ahmed, M. O., Virani, A., Schmidt, M., Konečnỳ, J., and Sallinen, S. Stopwasting my gradients: Practical svrg. In Advances in Neural Information Processing Systems, pp. 2251–2259, 2015.
  • Hofmann et al. (2015) Hofmann, T., Lucchi, A., Lacoste-Julien, S., and McWilliams, B. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems, pp. 2305–2313, 2015.
  • Hsieh et al. (2015) Hsieh, C.-J., Yu, H.-F., and Dhillon, I. Passcode: Parallel asynchronous stochastic dual co-ordinate descent. In International Conference on Machine Learning, pp. 2370–2379, 2015.
  • Johnson & Zhang (2013) Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315–323, 2013.
  • Keskar et al. (2016) Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
  • Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Konecnỳ & Richtárik (2013) Konecnỳ, J. and Richtárik, P. Semi-stochastic gradient descent methods. arXiv preprint arXiv:1312.1666, 2(2.1):3, 2013.
  • Konečnỳ et al. (2016) Konečnỳ, J., Liu, J., Richtárik, P., and Takáč, M. Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10(2):242–255, 2016.
  • Leblond et al. (2016) Leblond, R., Pedregosa, F., and Lacoste-Julien, S. Asaga: asynchronous parallel saga. arXiv preprint arXiv:1606.04809, 2016.
  • Li et al. (2014) Li, M., Zhang, T., Chen, Y., and Smola, A. J. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 661–670. ACM, 2014.
  • Lin et al. (2015) Lin, H., Mairal, J., and Harchaoui, Z. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pp. 3384–3392, 2015.
  • Nesterov (2012) Nesterov, Y. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
  • Qu et al. (2015) Qu, Z., Richtárik, P., and Zhang, T. Quartz: Randomized dual coordinate ascent with arbitrary sampling. In Advances in neural information processing systems, pp. 865–873, 2015.
  • Recht et al. (2011) Recht, B., Re, C., Wright, S., and Niu, F. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.
  • Reddi et al. (2015) Reddi, S. J., Hefny, A., Sra, S., Poczos, B., and Smola, A. J. On variance reduction in stochastic gradient descent and its asynchronous variants. In Advances in Neural Information Processing Systems, pp. 2647–2655, 2015.
  • Richtárik & Takáč (2016) Richtárik, P. and Takáč, M. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(1-2):433–484, 2016.
  • (26) Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, pp. 1–30.
  • Shalev-Shwartz & Zhang (2013) Shalev-Shwartz, S. and Zhang, T. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567–599, 2013.
  • Takác et al. (2013) Takác, M., Bijral, A. S., Richtárik, P., and Srebro, N. Mini-batch primal and dual methods for svms. In ICML (3), pp. 1022–1030, 2013.
  • Xiao & Zhang (2014) Xiao, L. and Zhang, T. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.

Appendix A Appendix

a.1 Proof of Lemma 1

It is straight forward to see:

The second line of equality comes from the rule of total expectation, where the inner expectation is taken over the index set , and the outer expectation is taken over the set cardinality .

a.2 Proof of Lemma 2

The proof technique is similar to SAGA, as well as a useful inequality (Lemma 4 in (Defazio et al., 2014)):

(A14)

First of all, by the update rule (2):

(A15)

The inequality follows from non-expansiveness of proximal operator, notice that our stochastic gradient is unbiased, take the expectation to the second term and apply (A14) to each and the average over all will goes to:

(A16)

Next we bound the last term in (A15):

(A17)

In equation we use the property that , now use the inequality , to the first term:

(A18)

Next we bound the first and second terms again by variance decomposition, for simplicity we only take the first term as example:

(A19)

is by RMS-AM inequality, and in we drop the negative term. Similarly,