Convergence of Distributed Stochastic Variance Reduced Methods without Sampling Extra Data

05/29/2019 ∙ by Shicong Cen, et al. ∙ Microsoft Carnegie Mellon University 0

Stochastic variance reduced methods have gained a lot of interest recently for empirical risk minimization due to its appealing run time complexity. When the data size is large and disjointly stored on different machines, it becomes imperative to distribute the implementation of such variance reduced methods. In this paper, we consider a general framework that directly distributes popular stochastic variance reduced methods, by assigning outer loops to the parameter server, and inner loops to worker machines. This framework is natural as it does not require sampling extra data and is friendly to implement, but its theoretical convergence is not well understood. We obtain a unified understanding of the convergence for algorithms under this framework by measuring the smoothness of the discrepancy between the local and global loss functions. We establish the linear convergence of distributed versions of a family of stochastic variance reduced algorithms, including those using accelerated and recursive gradient updates, for minimizing strongly convex losses. Our theory captures how the convergence of distributed algorithms behaves as the number of machines and the size of local data vary. Furthermore, we show that when the smoothness discrepancy between local and global loss functions is large, regularization can be used to ensure convergence. Our analysis can be further extended to handle nonsmooth and nonconvex loss functions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Empirical risk minimization arises frequently in machine learning applications, where the objective function is the average of losses computed at different data points. Due to the increasing size of data, distributed computing architectures are in great need to meet the scalability requirement in terms of both computation power and storage space by distributing the learning task over multiple computing nodes. In addition, distributed frameworks are suitable for problems where there are privacy concerns to transmit and store all the data in a central location, a scenario related to the nascent field of

federated learning [11]. It is, therefore, necessary to develop distributed optimization frameworks that are tailored to solving large-scale empirical risk minimization problems with desirable communication-computation trade-offs, where the data are stored disjointly over different machines.

Due to the low per-iteration cost, a popular solution is distributed stochastic gradient descent (SGD)

[19], where the parameter server aggregates gradients from each worker and does mini-batch gradient updates. However, distributed SGD is not communication-efficient and requires lots of communication rounds to converge, which partially diminishes the benefit of distribution. Many deterministic optimization methods have been developed to achieve communication efficiency, including but not limited to DANE [25], AIDE [20], DiSCo [34], GIANT [29], CoCoA [26], one-shot averaging [39, 35], etc.

Recent breakthroughs in developing stochastic variance reduced methods such as SAG [21], SAGA [6], SVRG [10], SDCA [23], MiG [38], Katyusha [3], Catalyst [14], SCOPE [37, 36], SARAH [17], SPIDER [7], SpiderBoost [30], and many more, allow achieving fast convergence and small per-iteration cost at the same time. Yet, distributed schemes of such variance reduced methods that are both practical and theoretically sound are much less developed.

This paper focuses on a general framework of distributed stochastic variance reduced methods presented in Alg. 1, which is natural and friendly to implement. On a high level, SVRG-type algorithms contain inner loops for parameter updates via variance-reduced SGD, and outer loops for global gradient and parameter updates. Alg. 1 assigns outer loops to the parameter server, and inner loops to worker machines. The parameter server collects gradients from worker machines and then distributes the global gradient to each machine. Each worker machine then runs the inner loop independently in parallel using variance reduction techniques, and returns the updates to the parameter server at the end. Per iteration, Alg. 1

requires two communication rounds: one communication round is used to average the parameter estimates, and the other is used to average the gradients, which is the same as distributed synchronous SGD. A distributed SVRG method under this framework has been proposed in several works under different scenarios

[11, 5, 20] with great empirical success. Surprisingly, a complete theoretical understanding is still missing at large. Moreover, distributed variants using accelerated variance reduction methods are not developed. The main analysis difficulty is that the variance-reduced gradient of each worker is no longer an unbiased gradient estimator when sampling from re-used local data.

On the other end, several variants of distributed SVRG, e.g. [12, 24, 28] have been proposed with performance guarantees. They try to bypass the biased gradient estimation issue by simulating the process of i.i.d sampling from all the data, so a random data re-allocation is needed before any batch sample is used again. Such algorithmic steps, i.e., sampling extra data with or without replacement, can be cumbersome and difficult to implement in practice.

1.1 Contributions of This Paper

This paper provides a convergence analysis of a family of naturally distributed stochastic variance reduced methods under the framework described in Alg. 1. By using different variance reduction schemes at the worker machines, we study distributed variants of three representative algorithms in this paper: D-SVRG [10], D-SARAH with recursive gradient updates [17, 18], and D-MiG with accelerated gradient updates [38]. The contributions of this paper are summarized below.

  • We suggest a simple and intuitive metric called distributed smoothness to gauge data balancedness among workers, defined as the smoothness of the difference between the local loss function and the global loss function . The metric is deterministic, easy-to-compute, and applies for arbitrary dataset splitting. We establish the linear convergence of D-SVRG, D-SARAH, and D-MiG under strongly convex losses, as long as the distributed smoothness parameter is smaller than a constant fraction of the strong convexity parameter , e.g. , where the fraction might change for different algorithms.

  • Under appropriate distributed smoothness, we show that, to reach -accuracy, D-SVRG and D-SARAH (resp. D-MiG) achieve (resp. ) time complexity with rounds of communication, where is the total number of input data points, is the number of worker machines and is the condition number of the global loss function . Compared to the time complexity of the original central SVRG and SARAH (resp. MiG), (resp. , this leads to a speed up in by the number of machines, if the local data size is sufficiently large. Furthermore, our bounds capture the phenomenon that the convergence rate improves as the local loss functions become more similar to the global loss function, by reducing the distributed smoothness parameter.

  • When local data are highly unbalanced, the distributed smoothness parameter becomes large which might lead to algorithm divergence. We suggests regularization as an effective way to handle this situation, and show that by adding larger regularization to machines that are less smooth, one can still ensure linear convergence in a regularized version of D-SVRG, called D-RSVRG, though at a slower rate of convergence.

  • More generally, the notion of distributed smoothness can also be used to establish the convergence under possibly nonsmooth and nonconvex losses. Under mild conditions, we show that D-SARAH achieves -accuracy in a time complexity of with rounds of communication.

1.2 Related Work

Many algorithms have been proposed for distributed (stochastic) optimization. For conciseness, Table 1 summarizes the most relevant ones to the current paper. Note that previous works considering variance reduction are all on distributed SVRG and its variants. The general framework of distributed variance-reduced methods covering SARAH and MiG and various loss settings in this paper has not been studied before.

Algorithm Rounds Runtime Assumptions
DSVRG [12] sampling extra data
DASVRG [12] sampling extra data
Dist. Acc. Grad none
ADMM none
SCOPE [37] uniform reg.
pSCOPE [36] good partition
D-SVRG* dist. smoothness
D-SARAH* dist. smoothness
D-MiG* dist. smoothness
D-RSVRG* large regularization
D-RSVRG* small regularization
Table 1: Rounds and runtime of the proposed and existing algorithms for strongly convex losses (ignoring logarithmic factors in ). Algorithms with an asterisk are proposed/analyzed in this paper.

The D-SVRG algorithm, presented in Alg. 2, has been empirically studied before in [20, 11] without a theoretical convergence analysis. The pSCOPE algorithm [36] is also a variant of distributed SVRG, and its convergence is studied under an assumption called good data partition in [36]. The SCOPE algorithm [37] is similar to the regularized algorithm D-RSVRG under large regularization, however our analysis is much more refined by adding different regularizations to different local workers, characterizes the amount of regularization necessary to ensure convergence with respect to the distributed smoothness of local data, and gracefully degenerates to the unregularized case when distributed smoothness is benign.

There are also a lot of recent efforts on reducing the communication cost of distributed GD/SGD by gradient quantization [1, 4, 22, 32], gradient compression and sparsification [2, 13, 15, 31, 27]. In comparison, we communicate the exact gradient, and it might be an interesting future direction to combine gradient compression schemes in distributed variance reduced stochastic gradient methods.

Paper Organization.

The rest of this paper is organized as follows. Section 2 presents the problem setup and a general framework of distributed stochastic optimization. Section 3 presents the convergence guarantees of D-SVRG, D-SARAH and D-MiG under appropriate distributed smoothness assumptions. Section 4 introduces regularization to D-SVRG to handle unbalanced data when distributed smoothness does not hold. Section 5 presents extensions to nonsmooth and nonconvex losses. Section 6 presents numerical experiments for validation. Finally, we conclude in Section 7.

2 Problem Setup

Suppose we have a data set , where for , that contains data points. In particular, we do not make any assumptions on their statistical distribution. We consider the following empirical risk minimization problem

(1)

where is the parameter to be optimized and is the sample loss function. For brevity, we use to denote throughout the paper.

In a distributed setting, where the data are distributed to machines or workers, we define a partition of the data set as , where , . The th worker, correspondingly, is in possession of the data subset , . We assume there is a parameter server (PS), that coordinates the parameter sharing among the workers. The sizes of data held by each worker machine is . When the data is split equally, we have . The original problem (1) can be rewritten as minimizing the following objective function:

(2)

where is the local loss function at the th worker machine.111It is straightforward to states our results under unequal data splitting with proper rescaling.

Alg. 1 presents a general framework for distributed stochastic variance reduced methods, which assigns the outer loops to PS and the inner loops to local workers. By using different variance reduction schemes at the worker machines (i.e. LocalUpdate), we obtain distributed variants of different algorithms, as described in later sections.

1:  Input: initial point .
2:  Initialization: Compute and distribute it to all machines.
3:  for  do
4:     for workers in parallel do
5:        LocalUpdate; // options to use different variance reduced methods
6:        Send to PS
7:     end for
8:     PS: randomly select from all and push to all workers;
9:     for workers in parallel do
10:        compute and send it to PS;
11:     end for
12:     PS: average and push to all workers.
13:  end for
14:  return
Algorithm 1 A general distributed framework

Throughout, we invoke one or several of the following standard assumptions of loss functions.

Assumption 1.

The sample loss function is -smooth for all .

Assumption 2.

The sample loss function is convex for all .

Assumption 3.

The empirical risk is -strongly convex.

When is strongly convex, the condition number of is defined as . Denote the minimizer of as and the corresponding optimal value as .

As it turns out, the smoothness of the deviation between the local loss function and the global loss function plays a key role in the convergence analysis, as it measures the balancedness between local data in a simple and intuitive manner. We refer to this as the “distributed smoothness”, which is central in our analysis. In some cases, a weaker notion called restricted smoothness is sufficient, which is defined below.

Definition 1 (Restricted smoothness).

A differentiable function is called -restricted smooth with regard to if , .

The restricted smoothness, compared to standard smoothness, fixes one of the arguments to , and is therefore a much weaker requirement. The following assumption quantifies the distributed smoothness using either restricted smoothness or standard smoothness.

Assumption 4.a.

The deviation is -restricted smooth with regard to (c.f. Definition 1) for all .

Assumption 4.b.

The deviation is -smooth for all .

3 Convergence in the Strongly Convex Case

In this section, we describe and analyze three variance-reduced routines for LocalUpdate in Alg. 1, namely SVRG [10], SARAH with recursive gradient updates [17, 18], and MiG with accelerated gradient updates [38] when is strongly convex. We establish the convergence guarantee for each algorithm, respectively.

1:  Input: step size , number of iterations .
2:  Set ,
3:  for  do
4:     ;
5:     Sample from uniformly at random;
6:     
7:  end for
Algorithm 2 LocalUpdate via SVRG/SARAH

3.1 Distributed SVRG (D-SVRG)

The LocalUpdate of D-SVRG is described in Alg. 2. Theorem 1 provides the convergence guarantee of D-SVRG as long as the distributed smoothness parameter is small enough.

Theorem 1 (D-Svrg).

Suppose that Assumptions 1, 2 and 3 hold, and Assumption 4.a holds with . With proper and step size , there exists such that D-SVRG satisfies

and the time complexity of finding an -optimal solution is , where . If Assumption 2 does not hold, the time complexity degenerates to .

Theorem 1 establishes the linear convergence of function values in expectation for D-SVRG, as long as the parameter is sufficiently small. With proper and , the convergence rate can be set as . From the time complexity and the rate expressions, we can see that the smaller , the faster D-SVRG converges. When is set such that is bounded above by a constant smaller than , the time complexity becomes , which improves the convergence rate of SVRG in the centralized setting.

Remark 1.

The algorithm and theorem above corresponds to Option II (w.r.t. setting ) specified in [10]. Under similar assumptions, we also establish the convergence of D-SVRG using Option I, which is deferred to the supplementary material in Appendix B.4.

3.2 Distributed SARAH (D-SARAH)

The LocalUpdate of D-SARAH is also described in Alg. 2, which is different from SVRG in the update of stochastic gradient , by using a recursive formula. Theorem 2 provides the convergence guarantee of D-SARAH.

Theorem 2 (D-Sarah).

Suppose that Assumptions 1, 2 and 3 hold, and Assumption 4.a holds with . With proper and step size , there exists such that D-SARAH satisfies

The time complexity of finding an -optimal solution is , where .

Theorem 2 establishes the linear convergence of the gradient norm in expectation for D-SARAH, as long as the parameter is small enough. With proper and , the rate can be set as . Similar to D-SVRG, a smaller leads to faster convergence of D-SARAH. When is set such that is bounded above by a constant smaller than , the time complexity becomes , which improves the convergence rate of SARAH in the centralized setting. In particular, Theorem 2 suggests that D-SARAH may allow a larger , compared with D-SVRG, to guarantee convergence.

3.3 Distributed MiG (D-MiG)

The LocalUpdate of D-MiG is described in Alg. 3, which is inspired by the inner loop of the MiG algorithm [38], a recently proposed accelerated variance-reduced algorithm. Theorem 3 provides the convergence guarantee of D-MiG.

Theorem 3 (D-MiG).

Suppose that Assumptions 1, 2 and 3 hold, and Assumption 4.b holds with . With proper and step size , and let , D-MiG achieves an -accurate solution within a time complexity of .

Theorem 3 establishes the linear convergence of D-MiG under standard smoothness of , in order to fully harness the power of acceleration. While we do not make it explicit in the theorem statement, the time complexity of D-MiG also decreases as gets smaller. Furthermore, the time complexity of D-MiG is smaller than that of D-SVRG/D-SARAH when .

1:  Input: step size , number of iterations , and .
2:  if  then
3:     set
4:  else
5:     set
6:  end if
7:  for  do
8:     Set
9:     Sample from uniformly at random; ;
10:     ;
11:  end for
12:  set ;
Algorithm 3 LocalUpdate via MiG
Remark 2.

Theorem 3 continues to hold for regularized empirical risk minimization, where the loss function is , with a convex and non-smooth regularizer . In this case, line 10 in Alg. 3 is changed to .

4 Adding Regularization Helps Unbalanced Data

So far, we have established the convergence when the distributed smoothness is not too large. Under i.i.d. and balanced data, this requirement can be justified for large enough data size, see Appendix E. However, when such conditions are violated, the algorithms might diverge. In this situation, adding a regularization term might ensure the convergence, at the cost of possibly slowing down the convergence rate. We consider regularizing the local gradient update of SVRG in Alg. 2 as

(3)

where the last regularization term penalizes the proximity between the current iterates and the reference point . We have the following theorem.

Theorem 4 (Distributed Regularized SVRG (D-RSVRG)).

Suppose that Assumptions 1, 2 and 3 hold, and Assumption 4.a holds with . Let . With proper and step size , there exists such that D-RSVRG satisfies

and the time complexity of finding an -optimal solution is no worse than

where and .

Compared with Theorem 1, Theorem 4 relaxes the assumption to , which means that by inserting a larger regularization to local workers that are not distributed smooth, i.e. those with large , one can still guarantee the convergence of D-RSVRG. However, increasing leads to a slower convergence rate: a large leads to an iteration complexity , similar to gradient descent. Compared with SCOPE [37] which requires a uniform regularization , our analysis applies tailored regularization to local workers, and potentially allows much smaller regularization to guarantee the convergence, since ’s can be much smaller than the smoothness parameter .

5 Convergence in the Nonconvex Case

In this section, we extend the convergence analysis of distributed SARAH to nonconvex loss functions, since SARAH-type algorithms are recently shown to achieve near-optimal performances for nonconvex problems [30, 18, 7]. Our result is summarized in the theorem below.

Theorem 5 (D-SARAH for non-convex losses).

Suppose that Assumption 1 and Assumption 4.b hold with . With step size , D-SARAH222In non-convex case, we make every agent return . satisfies

where is the agent index selected in the  th round for parameter update, i.e. (c.f. line 8 of Alg. 1). The time complexity of finding an -optimal solution is no worse than with proper .

Theorem 5 suggests that D-SARAH converges as long as the step size is small enough. Furthermore, a smaller allows a larger step size , and hence faster convergence. To gain further insights, assuming i.i.d. data, by concentration inequalities under mild conditions [16], and consequently, the iteration complexity of finding an -accurate solution using D-SARAH is . This is comparable to the best known result for the centralized SARAH-type algorithms [18, 7, 30].

6 Numerical Experiments

Though our focus in this paper is on the convergence analysis, we illustrate the performance of the proposed distributed variance reduced algorithms in various settings as a proof-of-concept.

Logistic regression.

Consider the

-regularized logistic regression, where the sample loss is defined as

, with the data . We evaluate the performance on the gisette dataset [8], by splitting the data equally to all workers. We scale the data according to , so that the smoothness is estimated as . We choose , and to illustrate the performance under different condition numbers. We use the optimality gap, defined as , to illustrate convergence.

For D-SVRG and D-SARAH, the step size is set as . For D-MiG, although the choice of in the theory requires knowledge of , we simply ignore it and set , and the step size to reflect the robustness of the practical performance to parameters. We further use at the PS for better empirical performance. For D-AGD, the step size is set as and the momentum parameter is set as . Following [10], which sets the iterations of inner loops as , we set to ensure the same number of total inner iterations. We note that such parameters can be further tuned to achieve better trade-off between communication cost and computation cost in practice.

Fig. 1 illustrates the optimality gap of various algorithms with respect to the number of communication rounds with 4 local workers under different conditioning, and Fig. 2 shows the corresponding result with different number of local workers when . The distributed variance-reduced algorithms outperform the distributed AGD,and D-MiG outperforms D-SVRG and D-SARAH when the condition number is large.

(a) (b) (c)
Figure 1: Optimality gap with respect to the number of communication rounds with 4 agents on the gisette dataset under different conditioning for different algorithms.
(a) (b) (c)
Figure 2: Optimality gap with respect to the number of communication rounds with different number of agents on the gisette dataset for different algorithms when .
Figure 3: Regularized algorithms converge despite highly unbalanced data allocation.

Dealing with Unbalanced data.

We justify the benefit of regularization by evaluating the proposed algorithms under unbalanced data allocation. We assign 50%, 30%, 19.9%, 0.1% percent of data to four workers, respectively, and set in the logistic regression loss. To deal with unbalanced data, we perform the regularized update, given in (3), on the worker with the least amount of data, and keep the update on the rest of the workers unchanged. A similar regularized update can be conceived for D-SARAH and D-MiG, resulting into regularized variants, D-RSARAH and D-RMiG. While our theory does not cover them, we still evaluate their numerical performance. We properly set according to the amount of data on this worker as . We set the number of iterations at workers to on all agents. For D-AGD we set the momentum parameter to , which is a common choice in the convex setting. Fig. 3 shows the optimality gap with respect to the number of communication rounds for all algorithms. It can be seen that unregularized D-SVRG and D-SARAH fail to converge, and the regularized algorithms still converge, verifying the role of regularization in addressing unbalanced data. It is interesting to note that D-MiG still manages to converge even with highly unbalanced data, a phenomenon not fully covered by the current theory and worth further investigation. In this case, the regularization slightly slows down the converge speed. It is also worth-mentioning that the regularization can be flexibly imposed depending on the local data size, rather than homogeneously across all workers.

7 Conclusions

In this paper, we have developed a convergence theory for a family of distributed stochastic variance reduced methods without sampling extra data, under a mild distributed smoothness assumption that measures the discrepancy between the local and global loss functions. Convergence guarantees are obtained for distributed stochastic variance reduced methods using accelerations and recursive gradient updates, and for minimizing both strongly convex and nonconvex losses. We also suggest regularization as a means of ensuring convergence when the local data are less balanced. We believe the analysis framework is useful for studying distributed variants of other stochastic variance-reduced methods such as Katyusha [3], and proximal variants such as [33].

Acknowledgements

The work of S. Cen was partly done when visiting MSRA. The work of S. Cen and Y. Chi is supported in part by National Science Foundation under the grant CCF-1806154, Office of Naval Research under the grants N00014-18-1-2142 and N00014-19-1-2404, and Army Research Office under the grant W911NF-18-1-0303.

References

References

  • [1] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017.
  • [2] D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, pages 5973–5983, 2018.
  • [3] Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In

    Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

    , pages 1200–1205. ACM, 2017.
  • [4] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. Signsgd: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pages 559–568, 2018.
  • [5] S. De and T. Goldstein. Efficient distributed sgd with variance reduction. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 111–120. IEEE, 2016.
  • [6] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
  • [7] C. Fang, C. J. Li, Z. Lin, and T. Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 687–697, 2018.
  • [8] I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror.

    Result analysis of the nips 2003 feature selection challenge.

    In Advances in neural information processing systems, pages 545–552, 2005.
  • [9] B. Hu, S. Wright, and L. Lessard. Dissipativity theory for accelerating stochastic variance reduction: A unified analysis of svrg and katyusha using semidefinite programs. International Conference on Machine Learning (ICML), 2018.
  • [10] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
  • [11] J. Konečnỳ, B. McMahan, and D. Ramage. Federated optimization: Distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575, 2015.
  • [12] J. D. Lee, Q. Lin, T. Ma, and T. Yang. Distributed stochastic variance reduced gradient methods by sampling extra data with replacement. The Journal of Machine Learning Research, 18(1):4404–4446, 2017.
  • [13] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017.
  • [14] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pages 3384–3392, 2015.
  • [15] Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations, 2018.
  • [16] S. Mei, Y. Bai, A. Montanari, et al. The landscape of empirical risk for nonconvex losses. The Annals of Statistics, 46(6A):2747–2774, 2018.
  • [17] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pages 2613–2621, 2017.
  • [18] L. M. Nguyen, M. van Dijk, D. T. Phan, P. H. Nguyen, T.-W. Weng, and J. R. Kalagnanam. Finite-sum smooth optimization with sarah. arXiv preprint arXiv:1901.07648, 2019.
  • [19] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011.
  • [20] S. J. Reddi, J. Konečnỳ, P. Richtárik, B. Póczós, and A. Smola. Aide: fast and communication efficient distributed optimization. arXiv preprint arXiv:1608.06879, 2016.
  • [21] M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
  • [22] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
  • [23] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567–599, 2013.
  • [24] O. Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems, pages 46–54, 2016.
  • [25] O. Shamir, N. Srebro, and T. Zhang. Communication-efficient distributed optimization using an approximate newton-type method. In International conference on machine learning, pages 1000–1008, 2014.
  • [26] V. Smith, S. Forte, M. Chenxin, M. Takáč, M. I. Jordan, and M. Jaggi. Cocoa: A general framework for communication-efficient distributed optimization. Journal of Machine Learning Research, 18:230, 2018.
  • [27] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu. : Decentralized training over decentralized data. In International Conference on Machine Learning, pages 4855–4863, 2018.
  • [28] J. Wang, W. Wang, and N. Srebro. Memory and communication efficient distributed stochastic optimization with minibatch prox. In Conference on Learning Theory, pages 1882–1919, 2017.
  • [29] S. Wang, F. Roosta-Khorasani, P. Xu, and M. W. Mahoney. Giant: Globally improved approximate newton method for distributed optimization. In Advances in Neural Information Processing Systems, pages 2338–2348, 2018.
  • [30] Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh. Spiderboost: A class of faster variance-reduced algorithms for nonconvex optimization. arXiv preprint arXiv:1810.10690, 2018.
  • [31] J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pages 1299–1309, 2018.
  • [32] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li.

    Terngrad: Ternary gradients to reduce communication in distributed deep learning.

    In Advances in neural information processing systems, pages 1509–1519, 2017.
  • [33] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
  • [34] Y. Zhang and X. Lin. DiSCO: Distributed Optimization for Self-Concordant Empirical Loss. International Conference on Machine Learning, pages 362–370, 2015.
  • [35] Y. Zhang, M. J. Wainwright, and J. C. Duchi. Communication-efficient algorithms for statistical optimization. In Advances in Neural Information Processing Systems, pages 1502–1510, 2012.
  • [36] S. Zhao, G.-D. Zhang, M.-W. Li, and W.-J. Li. Proximal scope for distributed sparse learning. In Advances in Neural Information Processing Systems, pages 6552–6561, 2018.
  • [37] S.-Y. Zhao, R. Xiang, Y.-H. Shi, P. Gao, and W.-J. Li. Scope: scalable composite optimization for learning on spark. In

    Thirty-First AAAI Conference on Artificial Intelligence

    , 2017.
  • [38] K. Zhou, F. Shang, and J. Cheng. A simple stochastic variance reduced algorithm with fast convergence rates. In International Conference on Machine Learning, pages 5975–5984, 2018.
  • [39] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603, 2010.

Appendix A Useful Lemma

We first establish the following Lemma which will be useful in the proof of D-SVRG and D-MiG.

Lemma 1.

When Assumption 1,2 and one of the distributed smoothness (Assumption 4.a or 4.b) hold, we have

where is randomly selected from , where .

Proof.

Given is -smooth and convex, the Bregman divergence is -smooth and convex as a function of . When Assumption 1 and 2 hold, we have

Averaging the inequality over gives

(4)

Assumption 4.a allows us to compare and :

Following similar arguments, using Assumption 4.b we obtain a tighter bound: . Combining the estimate in equation (4) proves the lemma. ∎

Appendix B Proof for distributed SVRG and its regularized variant

In this section, we outline the convergence of D-SVRG and D-RSVRG in various setups. We adopt the dissipativity theory in [9] to the analysis of D-SVRG, which is briefly reviewd in Section B.1.

b.1 Dissipativity Theory for First-Order Methods

Consider the following linear time-invariant system:

where is the state and is the input. Dissipativity theory characterizes how the input forces drive the internal energy stored in the states via an energy function and a supply rate . The theory aims to build the following exponential dissipation inequality:

where . The inequality indicates that at least a fraction of the internal energy will dissipate at every iteration. With an energy function and supply rates , we have

(5)

as long as there exists a positive semidefinite matrix an non-negative scalars such that

(6)

In fact, by left multiplying and right multiplying to (6), we recover (5).

In the following analysis, we simplify the notation by dropping the superscript and the subscript when the meaning is clear, since it is sufficient to analyze the convergence of a specific agent during a single round. Set and we can write the distributed SVRG update as the following linear time-invariant system [9]:

or equivalently,

(7)

where is selected uniformly at random from local data points and , ,

Recall that the supply rate is defined as

Since and , we will write as , where .

b.2 Proof of Theorem 1

Following [9], we consider the following three supply rates:

(8)

We have the following lemma and corollary which are proved in Appendix B.6 and B.7, respectively.

Lemma 2.

Suppose that Assumption 1, 2, 3 and 4.a hold. For the supply rates defined in (8), we have

Corollary 1.

Suppose that Assumption 1, 2, 3 and 4.a hold. If there exist non-negative scalars , such that and

(9)

then distributed SVRG satisfies

(10)

where the final output is selected from uniformly at random.

By choosing , and , (9) becomes

which clearly holds. Then (10) can be written as

As a theoretical evaluation of time complexity, when , by choosing , , a convergence rate no more than is obtained. Therefore, as long as , we have

So we need communication rounds to obtain an -optimal solution. Note that in the distributed setting, calculating the full-batch gradient only needs time instead of time due to parallelized computation, so the overall time complexity to find an -optimal solution is

where .

b.3 Convergence of D-SVRG without Assumption 2

When Assumption 2 does not hold, we can still use similar arguments as Appendix B.2 and establish convergence, though at a slower rate. Using the same supply rates (8), Lemma 2 can be modified as below.

Lemma 3.

Suppose that Assumption 1, 3 and 4.a hold. For the supply rates defined in (8), we have

Proof.

With -smoothness of we have the following estimate:

So we have

and

The estimate of is identical to that in Lemma 2. ∎

Following the same process in the proof of Corollary 1 in B.7, we have the following inequality with proper choices of , and :

By summing the inequality and letting , , we have

Therefore, with the following convergence result can be established:

By choosing ,