1 Introduction
Due to the enormity of the training data used today, distributing the data and the computation over a network of worker nodes has attracted intensive research efforts in recent years. In this paper, we will focus on parallelizing synchronous SGD in a decentralized setting without central coordinators (i.e., parameter server). Given an arbitrary network topology, all nodes can only exchange parameters or gradients with their local neighbors. This scenario is common and useful when training in largescale sensor networks, multiagent systems, as well as federated learning in edge devices.
ErrorRuntime Tradeoff in Decentralized SGD. In the context of decentralized optimization, previous works have studied the error convergence in terms of iterations or communication rounds for decentralized gradient descent [22, 7, 40, 42, 10, 27, 30, 13]
mostly for (strongly) convex loss functions. Recent works have extended the analysis to decentralized SGD for nonconvex loss functions and subsequently applied it to distributed deep learning in both synchronous
[17, 11, 35] and asynchronous settings [1, 18]. However, most of existing works do not explicitly consider how the topology affects the runtime, that is, wallclock time required to complete each SGD iteration. Wellconnected networks encourage faster consensus and give better mean square error convergence rates, but they incur communication delays which increases with increasing node degree. To strike the best errorruntime tradeoff, one can carefully design the network topology, for example, using expander graphs that are sparse while being well connected [6, 23]. However, systems constraints such as locality may preclude us from designing such arbitrary network topologies. Other approaches for optimizing the perepoch rate of convergence of decentralized procedures through efficient linkscheduling or constraining the number of allowable links, have been proposed [4]. However, these design criteria do not take into account the wallclock time which depend on the parallel versus sequential scheduling of the communication links which we describe later. This raises a pertinent question: for a given topology of worker nodes, how can we achieve the fastest convergence in terms of mean square error versus wallclock time for a synchronous decentralized SGD algorithm?Related Works. There have been massive amount of work in the context of algorithms [37, 33, 36, 5, 26] and systems [16, 43, 9] that improve the communication efficiency of synchronous distributed SGD in a fullyconnected network. However, it is still unclear whether theses strategies can be directly applied to any general decentralized setting. Given an arbitrary network topology, recent works [14, 29] propose to compress the transmitted message size to reduce the communication bandwidth. However, these methods may not help if the network latency (i.e., time to establish handshakes) is high. Other communication efficient schemes [25, 31], which focus on reducing the number of communications by sparsifying communications over time have also been proposed which do not take into account communication delays. Instead, we focus on a complementary idea of reducing the effective node degree so as to reduce the communication delay, which is suitable for both high latency and low bandwidth settings and can be easily combined with existing compression schemes.
Our Proposed Method Matcha. In this paper, we propose Matcha, a decentralized SGD method based on matching decomposition sampling, that drastically reduces the communication delay per iteration for any given node topology while maintaining the same error convergence speed. The following key ideas allow us to achieve this: 1) we decompose the graph topology into matchings consisting of disjoint communication links that can operate in parallel and save communication delay, and 2) in each iteration, we carefully sample a subset of these matchings to construct a sparse subgraph of the base topology, 3) this sequence of subgraphs results in more frequent communication over connectivitycritical links (ensuring fast error convergence) and less frequent communication over other links (saving communication delays).
An illustration of the advantages of using Matcha is presented in Figure 1. It shows that the reduction of communication complexity at different nodes is not uniform. In particular, when the communication budget is (using time to communicate at each iteration compared to vanilla decentralized SGD), critical links (such as edge ) end up being used for communication with high priority. As a result, the communication time at a degree node (node for example) does not change. On the other hand, links, which are incident to the busiest node, will be used for communication infrequently. The communication time at a node of degree (node for example) is directly reduced by half, as it is the bottleneck of run time per iteration. We further validate the effectiveness of Matcha through theoretical analysis and extensive experiments (see Sections 5 and 4).
Besides a winwin in the wall clock timeerror tradeoff, Matcha has many more practical benefits. First, the proposed algorithm is simple, in the sense that the communication schedule (i.e., the sequence of sparse subgraphs) of Matcha can be obtained apriori. There is no additional runtime overhead during training. Furthermore, Matcha provides a highly flexible communication scheme among nodes. By setting the communication budget, one can easily tailor the communication time to various system and problem settings, allowing a better tradeoff between communication and computation. In our experiments on CIFAR100, Matcha gets a x reduction in communication delay per iteration, and up to x reduction in wallclock time to achieve the same training accuracy.
2 Problem Formulation and Preliminaries
Consider a network of worker nodes. The communication links connecting the workers are represented by an arbitrary possibly sparse undirected connected graph with vertex set and edge set . Each node can only communicate with its neighbors, that is, it can communicate with node only if .
Furthermore, each worker node only has access to its own local data distribution . Our objective is to use this network of nodes to train a model using the joint dataset. In other words, we seek to minimize the objective function , which is defined as follows:
(1) 
where denotes the model parameters (for instance, the weights and biases of a neural network), is the local objective function, denotes a single data sample, and is the loss function for sample , defined by the learning model.
Decentralized SGD (DecenSGD). Decentralized SGD (or consensusbased distributed SGD) [32, 22, 40, 42, 11, 17, 10] is an effective way to optimize the empirical risk creftypecap 1 in the considered setting. The algorithm alternates between the consensus and gradient steps as follows ^{1}^{1}1One can also use another update rule: . All insights and conclusions in this paper will remain the same.:
(2) 
where denotes a minibatch sampled uniformly at random from local data distribution at iteration , denotes the stochastic gradient, and is the th element of mixing matrix . In particular, only if node and node are connected, i.e., . In order to guarantee that all nodes can reach consensus and converge to a common stationary point, mixing matrix can be taken to be symmetric and doubly stochastic. For instance, if node only connects with node and , then the first row of can be , where is constant.
Convergence in terms of Error Versus Wallclock Time. The total training time of an optimization algorithm is a product of two factors: 1) total iterations; and 2) run time per iteration. In a decentralized setup involving multiple worker nodes without a coordinating master node, both of these two factors are closely related to the graph topology. While there has been extensive literature studying the first factor [22, 7, 21], the second factor is less explored from a theoretical point of view.
In DecenSGD, each node needs to communicate with all of its neighbors at each iteration. The node with the highest degree in the graph (the busiest node) turns out to be the bottleneck as far as reducing communication time to finish one consensus step is concerned. Intuitively, the communication time per iteration monotonically increases with the maximal node degree. In general, the scaling is linear as commonly used in previous works [9, 31, 6, 28, 18], since the bandwidth is limited and both the total transmitted message size and number of handshakes are linear in the degree of the node. In this paper, we will focus on this linear scaling delay model, but the main idea can also be extended to other scaling rules. Without loss of generality, we assume the communication (sending and receiving model parameters) over one link costs unit of time. Then, the communication per iteration takes at least the maximal degree units of time. Although a denser base graph may require less iterations to converge, it consumes more communication time, resulting in longer training time.
Preliminaries on Graph Theory. The communication graph can be abstracted by an adjacency matrix . In particular, if ; otherwise. The graph Laplacian is defined as , where denotes the th node’s degree. When
is a connected graph, the second smallest eigenvalue
of the graph Laplacian is strictly greater than and referred to as algebraic connectivity [2]. A larger value of implies a denser graph. Moreover, we will use the notion of matching, defined as follows. [Matching] A matching in is a subgraph of , in which each vertex is incident with at most one edge.3 Matcha: Proposed Matching Decomposition Sampling Strategy
Following the intuition that it is beneficial to communicate over critical links more frequently and less over other links, the algorithm consists of three key steps as follows. A brief illustration is also shown in Figure 2.
Step 1: Matching Decomposition. First, we decompose the base communication graph into total disjoint matchings, i.e., and . This decomposition procedure can be achieved via Misra & Gries edge coloring algorithm [20], which guarantees that the number of disjoint matchings equals to either or , where is the maximal degree of graph .
The main benefit of using matchings is that it allows parallel communication, due to the disjoint links connecting nodes. Recall that a matching is a set of edges without common vertices. In each matching, nodes have at most one neighbor. Thus, all edges (or links) can be used to communicate over in parallel. The communication time for each matching is exactly unit. Inspired by this matching decomposition scheme, communicating over all matchings sequentially is a simple and efficient way to implement the consensus step in decentralized training algorithm. The total communication time will be linear in the number of matchings and be bounded by units, which matches with the communication time model discussed in Section 2 and previous works [31, 6, 28, 18].
Step 2: Computing Matching Activation Probabilities.
In order to control the communication time, we assign each matching a Bernoulli random variable
, which is with probability and otherwise, . Then, at each iteration, only when the realization of is , links in the corresponding matching will be used for information exchange between the corresponding worker nodes. As a result, when ’s are independent to each other, the communication time per iteration can be written as(3) 
We define as the activation probability. By controlling the summation of all activation probabilities, one can easily change the expected communication time. When all ’s equal to , the algorithm reduces to that of vanilla DecenSGD and takes units of time to finish one consensus step. We further define communication budget (CB), in terms of fraction of communication time of vanilla DecenSGD (e.g., CB means using only communication time per iteration of vanilla DecenSGD). Given a CB, there can be many feasible activation probabilities. As mentioned before, a key contribution of this paper is that we give more importance to critical links. This is achieved by controlling the activation probabilities for the matchings. Formally, we choose a set of activation probabilities by solving the following optimization problem:
(4) 
where denotes the Laplacian matrix of the th subgraph and can be considered as the Laplacian of the expected graph. CB is the predetermined communication budget. Moreover, recall that represents the algebraic connectivity and is a concave function [12, 2]. Thus, it directly follows that (4) is a convex problem and can be solved efficiently.
Step 3: Generating Random Topology Sequence. At the th iteration, the communication among nodes only happen over links in the activated topology , which is sparse or even disconnected. According to this activated topology, we need to further specify in what proportions the local models are averaged together in order to perform the consensus step in creftypecap 2. A common practice is to use an equal weight matrix [38, 12, 7] as follows:
(5) 
where denotes the graph Laplacian at the th iteration. The matrix is symmetric and doubly stochastic by construction. The parameter represents the weight of neighbor’s information in the consensus step. By setting a proper value of , the convergence of Matcha to a stationary point can be guaranteed. In particular, we select a value of that minimizes the optimization error upper bound. In Section 4 Section 4.2, we will show that optimizing can be formulated as a semidefinite programming problem. It needs to be solved only once at the beginning of training.
Extension to Other Design Choices. To sum up, the inputs of the proposed algorithm Matcha are: a base communication topology and a target communication budget CB. Then, following the steps 1 to 3, the algorithm will output a random topology sequence and a value of that defines the internode information exchange. All of these information can be obtained and assigned apriori to worker nodes before starting the training procedure.
We note that the framework involving randomly activating subgraphs, is very general and can be extended to various other delay models and graph decomposition methods. For example, instead of activating all matchings independently, one can choose to activate only one matching at each iteration; instead of assuming all links cost same amount of time, one can model the communication time for each link as a random variable and modify the formula creftypecap 3 accordingly. Moreover, rather than matching decomposition, it is also possible to decompose the base topology into subgraphs of different types. For instance, each subraph can be a single edge in the base graph .
Among all possible variants, we would like to highlight one special case: Periodic DecenSGD (PDecenSGD), which has appeared in previous works [31, 35]. In PDecenSGD, all links in the base topology are activated together () after every few iterations. In this case, the communication budget is equivalent to communication frequency. In Sections 5 and 4, we will use PDecenSGD as another benchmark for comparison.
4 Theoretical Analyses
In this section, we provide convergence guarantees for Matcha. To be specific, we first provide convergence guarantees where we explicitly quantify the dependence of the mean square error on the arbitrary random topology sequence. Then, in Section 4.2, we analyze the spectral norm of the random topology generated by Matcha. All proofs are provided in the Appendix.
In order to facilitate the analysis, we define the averaged iterate as and the lower bound of the objective function as . Since, we focus on general nonconvex loss functions, the quantity of interest is the averaged gradient norm: . When it approaches zero, the algorithm converges to a stationary point. The convergence analysis is centered around the following assumptions, which are common in distributed optimization literature [3, 22, 17]. Each local objective function is Lipschitz: .
Stochastic gradients at each worker node is an unbiased estimator of the true gradient of local objectives:
, where denotes the sources of randomness upto time , i.e., sigma algebra generated by noise of the stochastic gradients and the graph activation probabilities before iteration .The variance of stochastic gradients at each worker node is uniformly bounded:
.4.1 Convergence Analysis for Arbitrary Random Topology
[Basic Convergence Result] Suppose that all local models are initiated at the same iterate and
is an i.i.d. random matrix sequence. Then, under Assumptions 1 to 3, if the learning rate satisfies
, then after total iterations, we have that,(6) 
where
is the spectral norm (i.e., largest singular value) of matrix
. The result in Section 4.1 can be further refined by introducing new assumptions on the dissimilarities among local objectives. For the brevity of result, we simply assume the local gradients are uniformly bounded () as [7, 39, 14] and derive the following corollary. In the Appendix, we provide another version of corollary with weaker assumption as in [17]. Suppose for each local objective, we have and the learning rate is set as , then after total iterations,(7) 
where all the other constants are subsumed in . Dependence on the Random Topology. Section 4.1 together with Section 4.1 show that when the other algorithm parameters are fixed, the mean square error monotonically increases with . Typically, the value of spectral norm relates to the connectivity of the random topology. If the activated topology is fully connected, i.e., , then and Section 4.1 recovers the convergence results for centralized SGD. However, if there are two groups of nodes which are not connected during the whole training procedure, then . Local models cannot achieve consensus and the iterates will diverge. Since in Matcha, we optimize the connectivity of the average activated topology, it is important to guarantee . We further prove this statement in Section 4.2.
4.2 Analysis for Random Topology Sequence Generated by Matcha
[Existence Proof] Suppose the base graph is connected. Let denote the Laplacian matrix of the activated topology at th iteration in Matcha. If the mixing matrix is defined as , then there exists a value of such that . Section 4.2 and Section 4.1 guarantee the convergence of Matcha. When the communication budget (or activation probability) varies, the value of should be changed. However, finding an optimal value of , which minimizes the spectral norm, is not trivial. It is hard to get the analytic form of . However, we show that optimizing can be formulated as a semidefinite program. Thus, it can be efficiently solved via numerical methods. [Optimizing ] Given subgraphs and their corresponding activation probabilities, optimizing the mixing matrix can be formulated as a semidefinite programming problem:
(8) 
where is an auxiliary variable, and . Dependence on Communication Budget. In Figure 3, we present simulation results on how the minimal spectral norm (solution of creftypecap 8) changes along with the communication budget. Recall that a lower spectral norm means better errorconvergence in terms of iterations. It can be observed that Matcha can reduce communication time while preserving the same spectral norm as vanilla DecenSGD. By setting a proper communication budget (for instance in Figure 2(b)), Matcha can have even lower spectral norm than vanilla DecenSGD. Besides, to achieve the same spectral norm, Matcha always requires much less communication budget than periodic DecenSGD. Moreover, even if one sets a very low communication budget, since the spectral norm only influences the higher order terms in creftypecap 7, Matcha can still achieve the rate after sufficiently large number of iterations. These theoretical findings are corroborated by extensive experiments in Section 5.
5 Experimental Results
Experimental Setting We evaluate the performance of the proposed algorithm in multiple deep learning tasks: (1) image classification on CIFAR10 and CIFAR100 [15]; (2) Language modeling on Penn Treebank corpus (PTB) dataset [19]. All training datasets are evenly partitioned over a network of workers. All algorithms are trained for sufficiently long times until convergence or overfitting. Besides, in order to guarantee a fair comparison for each task, the learning rate is finetuned for vanilla DecenSGD and kept the same in all other algorithms. More detailed descriptions on the datasets and training configurations are provided in Appendix A.1.
Effectiveness of Matcha. We compare the performance of Matcha with various communication budgets () and vanilla DecenSGD in Figure 4. The base communication topology is shown in Figure 1. From Figures 3(f), 3(e) and 3(d), one can observe that when the communication budget is set to , Matcha has the nearly identical training loss as vanilla DecenSGD at every epoch. But it only requires, at most, half of the communication time per iteration. This empirical finding reinforces the claim regarding the similarity of the algorithms’ performance in terms of epochs in Section 4 (see Figure 2(a)). When we continue to decrease the communication budget, Matcha attains significantly faster convergence with respect to wallclock time in communicationintense tasks. In particular, the proposed algorithm can reduce communication time per iteration and achieve a training loss of using x less time than vanilla DecenSGD on CIFAR100 (see Figure 3(a)).
Effects of Base Communication Topology. In order to further verify the generality of Matcha, we evaluate it on another base topology with varying connectivity using worker nodes. In Figure 5, we present experimental results on three different base topologies, which are random geometric graphs and have different maximal degrees. In particular, when the maximal degree is (see Figure 4(b)), Matcha with communication budget not only can reduce the communication time per iteration by x, but also has lower error than vanilla DecenSGD. This result corroborates its corresponding spectral norm versus communication budget curve shown in Figure 2(b). When we further increase the density of the base topology (see Figure 4(c)), Matcha reduces communication time per iteration by x without hurting the errorconvergence.
Another interesting observation is that Matcha gives more communication reduction for denser base graphs. As shown in Figure 5, along with the increase in the density of the base graph, the training time of vanilla DecenSGD also increases from to minutes to finish epochs. However, in Matcha, since the effective maximal degree in all cases is maintained to be about by controlling communication budget, the total training time of epochs remains nearly the same (about minutes). Moreover, Matcha also takes less and less time to achieve a training loss of , on the contrast to vanilla DecenSGD and PDecenSGD.
Comparison to Periodic DecenSGD. As discussed in Sections 3 and 4, a naive way to reduce the communication time per iteration is to introduce a communication frequency for the whole base graph [35, 31]. Instead, in Matcha, we allow matchings to have different communication frequencies. Similar to the theoretical simulations in Figure 3, the results in Figure 5 show that given a fixed communication budget, Matcha consistently outperforms periodic DecenSGD. More results are presented in the Appendix.
6 Concluding Remarks
In this paper, we have proposed Matcha to reduce and control the communication delay of decentralized SGD algorithm in any general topology worker networks. The key idea in Matcha is that workers communicate over the connectivitycritical links with high priority, which we achieve via matching decomposition sampling. Rigorous theoretical analysis and experimental results show that Matcha can reduce the communication delay while maintaining the same errorconvergence rate in terms of epochs. Future directions includes adaptively changing the communication time per iteration as [34], and extending Matcha to directed communication graphs.
References
 [1] Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Michael Rabbat. Stochastic gradient push for distributed deep learning. arXiv preprint arXiv:1811.10792, 2018.
 [2] Béla Bollobás. Modern graph theory, volume 184. Springer Science & Business Media, 2013.

[3]
Léon Bottou, Frank E Curtis, and Jorge Nocedal.
Optimization methods for largescale machine learning.
SIAM Review, 60(2):223–311, 2018.  [4] Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip algorithms. IEEE/ACM Transactions on Networking (TON), 14(SI):2508–2530, 2006.
 [5] Tianyi Chen, Georgios Giannakis, Tao Sun, and Wotao Yin. Lag: Lazily aggregated gradient for communicationefficient distributed learning. In Advances in Neural Information Processing Systems, pages 5050–5060, 2018.
 [6] YatTin Chow, Wei Shi, Tianyu Wu, and Wotao Yin. Expander graph and communicationefficient decentralized optimization. In 2016 50th Asilomar Conference on Signals, Systems and Computers, pages 1715–1720. IEEE, 2016.
 [7] John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3):592–606, 2012.

[8]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [9] Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, and Kurt Keutzer. Firecaffe: nearlinear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2592–2600, 2016.
 [10] Dusan Jakovetic, Dragana Bajovic, Anit Kumar Sahu, and Soummya Kar. Convergence rates for distributed stochastic optimization over random networks. In 2018 IEEE Conference on Decision and Control (CDC), pages 4238–4245. IEEE, 2018.
 [11] Zhanhong Jiang, Aditya Balu, Chinmay Hegde, and Soumik Sarkar. Collaborative deep learning in fixed topology networks. In Advances in Neural Information Processing Systems, pages 5906–5916, 2017.
 [12] Soummya Kar and José MF Moura. Sensor networks with random links: Topology design for distributed consensus. IEEE Transactions on Signal Processing, 56(7):3315–3326, 2008.
 [13] Soummya Kar, José MF Moura, and Kavita Ramanan. Distributed parameter estimation in sensor networks: Nonlinear observation models and imperfect communication. IEEE Transactions on Information Theory, 58(6):3575–3605, 2012.
 [14] Anastasia Koloskova, Sebastian U Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. arXiv preprint arXiv:1902.00340, 2019.
 [15] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [16] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and BorYiing Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583–598, 2014.
 [17] Xiangru Lian, Ce Zhang, Huan Zhang, ChoJui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5336–5346, 2017.
 [18] Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. Asynchronous decentralized parallel stochastic gradient descent. arXiv preprint arXiv:1710.06952, 2017.
 [19] Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. 1993.
 [20] Jayadev Misra and David Gries. A constructive proof of vizing’s theorem. In Information Processing Letters. Citeseer, 1992.
 [21] Angelia Nedić, Alex Olshevsky, and Michael G Rabbat. Network topology and communicationcomputation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5):953–976, 2018.
 [22] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multiagent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
 [23] Reza OlfatiSaber. Algebraic connectivity ratio of Ramanujan graphs. In 2007 American Control Conference, pages 4619–4624. IEEE, 2007.
 [24] Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.
 [25] Anit Kumar Sahu, Dusan Jakovetic, Dragana Bajovic, and Soummya Kar. Communicationefficient distributed strongly convex stochastic optimization: Nonasymptotic rates. arXiv preprint arXiv:1809.02920, 2018.
 [26] Felix Sattler, Simon Wiedemann, KlausRobert Müller, and Wojciech Samek. Sparse binary compression: Towards distributed deep learning with minimal communication. arXiv preprint arXiv:1805.08768, 2018.
 [27] Kevin Scaman, Francis Bach, Sébastien Bubeck, Laurent Massoulié, and Yin Tat Lee. Optimal algorithms for nonsmooth distributed optimization in networks. In Advances in Neural Information Processing Systems, pages 2740–2749, 2018.
 [28] Zebang Shen, Aryan Mokhtari, Tengfei Zhou, Peilin Zhao, and Hui Qian. Towards more efficient stochastic decentralized learning: Faster convergence and sparse communication. arXiv preprint arXiv:1805.09969, 2018.
 [29] Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized training. In Advances in Neural Information Processing Systems, pages 7652–7662, 2018.
 [30] Zaid J Towfic, Jianshu Chen, and Ali H Sayed. Excessrisk of distributed stochastic learners. IEEE Transactions on Information Theory, 62(10):5753–5785, 2016.
 [31] Konstantinos Tsianos, Sean Lawlor, and Michael G Rabbat. Communication/computation tradeoffs in consensusbased distributed optimization. In Advances in neural information processing systems, pages 1943–1951, 2012.
 [32] John Tsitsiklis, Dimitri Bertsekas, and Michael Athans. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Transactions on Automatic Control, 31(9):803–812, 1986.
 [33] Hongyi Wang, Scott Sievert, Shengchao Liu, Zachary Charles, Dimitris Papailiopoulos, and Stephen Wright. Atomo: Communicationefficient learning via atomic sparsification. In Advances in Neural Information Processing Systems, pages 9850–9861, 2018.
 [34] Jianyu Wang and Gauri Joshi. Adaptive communication strategies to achieve the best errorruntime tradeoff in localupdate SGD. CoRR, abs/1810.08313, 2018.
 [35] Jianyu Wang and Gauri Joshi. Cooperative SGD: A unified framework for the design and analysis of communicationefficient SGD algorithms. arXiv preprint arXiv:1808.07576, 2018.
 [36] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communicationefficient distributed optimization. arXiv preprint arXiv:1710.09854, 2017.
 [37] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. TernGrad: Ternary gradients to reduce communication in distributed deep learning. arXiv preprint arXiv:1705.07878, 2017.
 [38] Lin Xiao and Stephen Boyd. Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1):65–78, 2004.
 [39] Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted SGD for nonconvex optimization with faster convergence and less communication. arXiv preprint arXiv:1807.06629, 2018.
 [40] Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016.
 [41] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
 [42] Jinshan Zeng and Wotao Yin. On nonconvex decentralized gradient descent. arXiv preprint arXiv:1608.05766, 2016.
 [43] Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In 2017 USENIX Annual Technical Conference (USENIXATC 17), pages 181–193, 2017.
Appendix A More Experimental Results
a.1 Detailed Experimental Setting
Image Classification Tasks. CIFAR10 and CIFAR100 consist of color images in and classes, respectively. For CIFAR10 and CIFAR100 training, we set the initial learning rate as and it decays by after and epochs. The minibatch size per worker node is . We train vanilla DecenSGD for epochs and all other algorithms for the same wallclock time as vanilla DecenSGD.
Language Model Task. The PTB dataset contains training words. We train ResNet50 [8], and WideResNet2810 [41] on the image classification tasks. A twolayer LSTM with hidden nodes in each layer [24] is adopted for language modeling. For the training on PTB dataset, we set the initial learning rate as and it decays by when the training procedure saturates. The minibatch size per worker node is . The embedding size is . All algorithms are trained for epochs.
Machines. Unless otherwise stated, the training procedure is performed in a network of nodes, each of which is equipped with one NVIDIA TitanX Maxwell GPU and has a MB/s Ethernet interface. Matcha
is implemented with PyTorch and MPI4Py.
a.2 More Results
Appendix B Proofs of Theorem 1 and Corollary 1
b.1 Preliminaries
In the proof, we will use the following matrix forms:
(9)  
(10)  
(11) 
Recall the assumptions we make:
(12)  
(13)  
(14) 
b.2 Lemmas
Let be an i.i.d. symmetric and doubly stochastic matrices sequence. The size of each matrix is . Then, for any matrix ,
(15) 
where = [W^(k)⊤W^(k)]J.
Proof.
For the ease of writing, let us define and use to denote the
th row vector of
. Since for all , we have and . Thus, one can obtain(16) 
Then, taking the expectation with respect to ,
(17)  
(18)  
(19) 
Let and , then
(20)  
(21)  
(22) 
Repeat the following procedure, since ’s are i.i.d. matrices, we have
(23) 
Here, we complete the proof. ∎
b.3 Proof of Theorem 1
Since the objective function is Liptchitz smooth, it means that
(24) 
Plugging into the update rule , we have
(25) 
Then, taking the expectation with respect to random minibatches at th iteration,
(26) 
For the first term in (26), since , we have
(27)  
(28) 
Recall that ,
(29)  
(30)  
(31) 
where the last inequality follows the Lipschitz smooth assumption. Then, plugging creftypecap 31 into (28), we obtain
(32) 
Next, for the second part in (26),
(33)  
(34)  
(35) 
where the last inequality is according to the bounded variance assumption. Then, combining creftypepluralcap 35 and 32 and taking the total expectation over all random variables, one can obtain:
(36) 
Summing over all iterates and taking the average,
(37) 
By minor rearranging, we get
(38)  
(39) 
Now we complete the first part of the proof. Then, we’re going to show that the discrepancies among local models is upper bounded. According to the update rule of decentralized SGD and the special property of gossip matrix , we have
(40)  
(41)  
(42)  
(43) 
Since all local models are initiated at the same point, . Thus, we can obtain
(44)  
(45)  
(46) 
For the first term in (46), we have
(47)  
(48)  
(49)  
(50) 
where (48) comes from Section B.2. For the second term in (46), define
Comments
There are no comments yet.