Stochastic Gradient Push for Distributed Deep Learning

by   Mahmoud Assran, et al.

Large mini-batch parallel SGD is commonly used for distributed training of deep networks. Approaches that use tightly-coupled exact distributed averaging based on AllReduce are sensitive to slow nodes and high-latency communication. In this work we show the applicability of Stochastic Gradient Push (SGP) for distributed training. SGP uses a gossip algorithm called PushSum for approximate distributed averaging, allowing for much more loosely coupled communications, which can be beneficial in high-latency or high-variability scenarios. The tradeoff is that approximate distributed averaging injects additional noise in the gradient which can affect the train and test accuracies. We prove that SGP converges to a stationary point of smooth, non-convex objective functions. Furthermore, we validate empirically the potential of SGP. For example, using 32 nodes with 8 GPUs per node to train ResNet-50 on ImageNet, where nodes communicate over 10Gbps Ethernet, SGP completes 90 epochs in around 1.6 hours while AllReduce SGD takes over 5 hours, and the top-1 validation accuracy of SGP remains within 1.2 using AllReduce SGD.


page 1

page 2

page 3

page 4


Experiments on Parallel Training of Deep Neural Network using Model Averaging

In this work we apply model averaging to parallel training of deep neura...

A Distributed Hierarchical SGD Algorithm with Sparse Global Reduction

Reducing communication overhead is a big challenge for large-scale distr...

Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD

Distributed stochastic gradient descent (SGD) is essential for scaling t...

Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes

Stochastic Gradient Descent (SGD) is one of the simplest and most popula...

Parle: parallelizing stochastic gradient descent

We propose a new algorithm called Parle for parallel training of deep ne...

Stochastic Weight Averaging in Parallel: Large-Batch Training that Generalizes Well

We propose Stochastic Weight Averaging in Parallel (SWAP), an algorithm ...

1 Introduction

Deep Neural Networks (DNNs) are the state-of-the art machine learning approach in many application areas, including image recognition


and natural language processing


. Stochastic Gradient Descent (SGD) is the current workhorse for training neural networks. The algorithm optimizes the network parameters,

, to minimize a loss function,

, through gradient descent, where the loss function’s gradients are approximated using a subset of training examples (a mini-batch). DNNs often require large amounts of training data and trainable parameters, necessitating non-trivial computational requirements wu2016google ; mahajan2018exploring . There is a need for efficient methods to train DNNs in large-scale computing environments.

A parallel version of SGD is usually adopted for large-scale, distributed training goyal2017accurate ; li2014scaling . Worker nodes compute local mini-batch gradients of the loss function on different subsets of the data, and then calculate an exact inter-node average gradient using either the AllReduce communication primitive, in synchronous implementations goyal2017accurate , or using a central parameter server, in asynchronous implementations dean2012large . Using a parameter server to aggregate gradients introduces a potential bottleneck and a central point of failure lian2017can . The AllReduce primitive computes the exact average gradient at all workers in a decentralized manner, avoiding issues associated with centralized communication and computation.

However, exact averaging algorithms like AllReduce are not robust in high-latency or high-variability platforms, e.g., where the network bandwidth may be a significant bottleneck, because they involve tightly-coupled, blocking communication (i.e., the call does not return until all nodes have finished aggregating). Moreover, aggregating gradients across all the nodes in the network can introduce non-trivial computational overhead when there are many nodes, or when the gradients themselves are large. This issue motivates the investigation of a decentralized and inexact version of SGD to reduce the overhead associated with distributed training.

There have been numerous decentralized optimization algorithms proposed and studied in the control-systems literature that leverage consensus-based approaches to aggregate information; see the recent survey Nedic2018network and references therein. Rather than exactly aggregating gradients (as with AllReduce), this line of work uses less-coupled message passing algorithms which compute inexact distributed averages.

Most previous work in this area has focused on theoretical convergence analysis assuming convex objectives. Recent work has begun to investigate their applicability to large-scale training of DNNs lian2017can ; Jiang2017collaborative . However, these papers study methods based on communication patterns which are static (the same at every iteration) and symmetric (if sends to , then must also receive from before proceeding). Such methods inherently require blocking and communication overhead. State-of-the-art consensus optimization methods build on the PushSum algorithm for approximate distributed averaging kempe2003gossip ; Nedic2018network , which allows for non-blocking, time-varying, and directed (asymmetric) communication. Since SGD already uses stochastic mini-batches, the hope is that an inexact average mini-batch will be as useful as the exact one if the averaging error is sufficiently small relative to the variability in the stochastic gradient.

This paper studies the use of Stochastic Gradient Push (SGP), an algorithm blending SGD and PushSum, for distributed training of deep neural networks. We provide a theoretical analysis of SGP, showing it converges for smooth non-convex objectives. We also evaluate SGP experimentally, training ResNets on ImageNet using up to 32 nodes, each with 8 GPUs (i.e., 256 GPUs in total). Our main contributions are summarized as follows:

  • We provide the first convergence analysis for Stochastic Gradient Push when the objective function is smooth and non-convex. We show that, for an appropriate choice of the step size, SGP converges to a stationary point at a rate of , where is the number of nodes and is the number of iterations.

  • In a high-latency scenario, where nodes communicate over 10Gbps Ethernet, SGP runs up to faster than AllReduce SGD and exhibits 88.6% scaling efficiency over the range from 4–32 nodes.

  • The top-1 validation accuracy of SGP matches that of AllReduce SGD for up to 8 nodes (64 GPUs), and remains within 1.2% of AllReduce SGD for larger networks.

  • In a low-latency scenario, where nodes communicate over a 100Gbps InfiniBand network supporting GPUDirect, SGP is on par with AllReduce SGD in terms of running time, and SGP exhibits 92.4% scaling efficiency.

  • In comparison to other synchronous decentralized consensus-based approaches that require symmetric messaging, SGP runs faster and it produces models with better validation accuracy.

2 Preliminaries

Problem formulation.

We consider the setting where a network of nodes cooperates to solve the stochastic consensus optimization problem


Each node has local data following a distribution , and the nodes wish to cooperate to find the parameters of a DNN that minimizes the average loss with respect to their data, where is the loss function at node . Moreover, the goal codified in the constraints is for the nodes to reach agreement (i.e., consensus) on the solution they report. We assume that nodes can locally evaluate stochastic gradients , , but they must communicate to access information about the objective functions at other nodes.

Distributed averaging.

The problem described above encompasses distributed training based on data parallelism. There a canonical approach is large mini-batch parallel stochastic gradient descent: for an overall mini-batch of size , each node computes a local stochastic mini-batch gradient using samples, and then the nodes use the AllReduce communication primitive to compute the average gradient at every node. Let denote the objective at node , and let denote the overall objective. Since , averaging gradients via AllReduce provides an exact stochastic gradient of . Typical implementations of AllReduce have each node send and receive bytes, where

is the size (in bytes) of the tensor being reduced, and involve

communication steps Rabenseifner2004optimization . Moreover, AllReduce is a blocking primitive, meaning that no node will proceed with local computations until the primitive returns.

Approximate distributed averaging.

In this work we explore the alternative approach of using a gossip algorithm for approximate distributed averaging—specifically, the PushSum algorithm. Gossip algorithms typically use linear iterations for averaging. For example, let

be a vector at node

, and consider the goal of computing the average vector at all nodes. Stack the initial vectors into a matrix with one row per node. Typical gossip iterations have the form where is referred to as the mixing matrix. This corresponds to the update at node . To implement this update, node only needs to receive messages from other nodes for which , so it will be appealing to use sparse to reduce communications.

Drawing inspiration from the theory of Markov chains 

Seneta1981 , the mixing matrices are designed to be column stochastic. Then, under mild conditions (e.g., ensuring that information from every node eventually reaches all other nodes) one can show that , where is the ergodic limit of the chain and is a vector with all entries equal to . Consequently, the gossip iterations converge to a limit ; i.e., the value at node converges to . When the matrices are symmetric, it is straightforward to design the algorithm so that for all by making doubly stochastic. However, symmetric has strong practical ramifications, such as requiring care in the implementation to avoid deadlocks.

The PushSum algorithm only requires that be column-stochastic, and not necessarily symmetric (so node may send to node , but not necessarily vice versa). Instead, one additional scalar parameter is maintained at each node. The parameter is initialized to for all , and updated using the same linear iteration, . Consequently, the parameter converges to , or at node . Thus each node can recover the average of the initial vectors by computing the de-biased ratio . In practice, we stop after a finite number of gossip iterations and compute . The distance of the de-biased ratio to the exact average can be quantified in terms of properties of the matrices . Let and denote the sets of nodes that transmits to and receives from, respectively, at iteration . If we use bytes to represent the vector , then node sends and receives and bytes, respectively, per iteration. In our experiments we use graph sequences with or , and find that approximate averaging is both fast and still facilitates training.

3 Stochastic Gradient Push

1:Initialize , and for all nodes
2:for  do at node
3:     Sample new mini-batch from local distribution
4:     Compute a local stochastic mini-batch gradient at :
6:     Send to out-neighbors ;receive from in-neighbors
10:end for
Algorithm 1 Stochastic Gradient Push (SGP)
Algorithm description.

The stochastic gradient push (SGP) method for solving equation 1 is obtained by interleaving one local stochastic gradient descent update at each node with one iteration of PushSum. Each node maintains three variables: the model parameters at node , the scalar PushSum weight , and the de-biased parameters . The initial and can be initialized to any arbitrary value as long as . Pseudocode is shown in Alg. 1. Each node performs a local SGD step (lines 2–4) followed by one step of PushSum for approximate distributed averaging (lines 5–8).

Note that the gradients are evaluated at the de-biased parameters in line 3, and they are then used to update , the PushSum numerator, in line 4. All communication takes place in line 5, and each message contains two parts, the PushSum numerator and denominator. In particular, node controls the values used to weight the values in messages it sends.

We are mainly interested in the case where the mixing matrices are sparse in order to have low communication overhead. However, we point out that when the nodes’ initial values are identical, for all , and every entry of is equal to , then SGP is mathematically equivalent to parallel SGD using AllReduce. Please refer to appendix A for pratical implementation details, including how we design mixing matrices .

Theoretical guarantees.

SGP was first proposed and analyzed in Nedic2016stochastic assuming the local objectives are strongly convex. Here we provide convergence results in the more general setting of smooth, non-convex objectives. We make the following three assumptions:

1. (-smooth) There exists a constant such that , or equivalently


Note that this assumption implies that function is also L-smooth.

2. (Bounded variance) There exist finite positive constants

and such that


Thus bounds the variance of stochastic gradients at each node, and quantifies the similarity of data distributions at different nodes.

3. (Mixing connectivity) To each mixing matrix we can associate a graph with vertex set and edge set ; i.e., with edges from to if receives a message from at iteration . Assume that the graph with edge set is strongly connected and has diameter at most for every . To simplify the discussion, we assume that every column of the mixing matrices has at most non-zero entries.

Let . Under similar assumptions, lian2017can define that a decentralized algorithm for solving equation 1 converges if, for any , it eventually satisfies


Our first result shows that SGP converges in this sense.

Theorem 1.

Suppose that Assumptions 1–3 hold, and run SGP for iterations with step-size . Let and assume that . There exist constants and which depend on , , and such that if the total number of iterations satisfies


where and , then

The proof is given in Appendix C, where we also provide precise expressions for the constants and . The proof of Theorem 1 builds on an approach developed in lian2017can . Theorem 1 shows that, for a given number of nodes , by running a sufficiently large number of iterations (roughly speaking, , which is reasonable for distributed training of DNNs) and choosing the step-size as prescribed, then the criterion equation 5 is satisfied with a number of iterations . That is, we achieve a linear speedup in the number of nodes.

Theorem 1 shows that the average of the nodes parameters, , converges, but it doesn’t directly say anything about the parameters at each node. In fact, we can show a stronger result.

Theorem 2.

Under the same assumptions as in Theorem 1,


The proof is also given in Appendix C. This result shows that as grows, the de-biased variables converge to the node-wise average , and hence the de-biased variables at each node also converge to a stationary point. Note that for fixed and large , the term will dominate the other factors.

4 Related Work

A variety of approaches have been proposed to accelerate distributed training of DNNs, including quantizing gradients Alistarh2017qsgd ; Wen2017terngrad and performing multiple local SGD steps at each node before averaging McMahan2017federated . These approaches are complementary to the tradeoff we consider in this paper, between exact and approximate distributed averaging. Similar to using PushSum for averaging, both quantizing gradients and performing multiple local SGD steps before averaging can also be seen as injecting additional noise into SGD, leading to a trade off between training faster (by reducing communication overhead) and potentially obtaining a less accurate result. Combining these approaches (quantized, inexact, and infrequent averaging) is an interesting direction for future work.

For the remainder of this section we review related work applying consensus-based approaches to large-scale training of DNNs. Blot2016gossip report initial experimental results on small-scale experiments with an SGP-like algorithm. Jin2016how make a theoretical connection between PushSum-based methods and Elastic Averaging SGD Zhang2015elasticsgd . Relative to those previous works, we provide the first convergence analysis for a PushSum-based method in the smooth non-convex case. lian2017can and Jiang2017collaborative study synchronous consensus-based versions of SGD. However, unlike PushSum, those methods involve symmetric message passing (if sends to at iteration , then also sends to before both nodes update) which is inherently blocking. Consequently, these methods are more sensitive to high-latency communication settings, and each node generally must communicate more per iteration, in comparison to PushSum-based SGP where communication may be directed ( can send to without needing a response from ). The decentralized parallel SGD (D-PSGD) method proposed in lian2017can produces iterates whose node-wise average, , is shown to converge in the sense of equation 5. Our proof of Theorem 1, showing the convergence of SGP in the same sense, adapts some ideas from their analysis and also goes beyond to show that, since the values at each node converge to the average, the individual values at each node also converge to a stationary point. We compare SGP with D-PSGD experimentally in Section 5 below and find that although the two methods find solutions of comparable accuracy, SGP is consistently faster.

Jin2016how and Lian2018asynchronous study asynchronous consensus-based methods for training DNNs. Lian2018asynchronous analyzes an asynchronous version of D-PSGD and proves that its node-wise averages also converge to a stationary point. In general, these contributions focusing on asynchrony can be seen as orthogonal to the use of a PushSum based protocol for consensus averaging.

5 Experiments

Next, we compare SGP with AllReduce SGD, and D-PSGD lian2017can , an approximate distributed averaging baseline relying on doubly-stochastic gossip. We run experiments on a large-scale distributed computing environment using up to 256 GPUs. Our results show that when communication is the bottleneck, SGP is faster than both SGD and D-PSGD. SGP also outperforms D-PSGD in terms of validation accuracy, while achieving a slightly worse accuracy compared to SGD when using a large number of compute nodes. Our results also highlight that, in a setting where communication is efficient (e.g., over InfiniBand), doing exact averaging through AllReduce SGD remains a competitive approach.

We run experiments on 32 DGX-1 GPU servers in a high-performance computing cluster. Each server contains 8 NVIDIA Volta-V100 GPUs. We consider two communication scenarios: in the high-latency scenario the nodes communicate over a 10 Gbit/s Ethernet network, and in the low-latency scenario the nodes communicate over 100 Gbit/s InfiniBand, which supports GPUDirect RDMA communications. To investigate how each algorithm scales, we run experiments with 4, 8, 16, and 32 nodes (i.e., 32, 64, 128, and 256 GPUs).

We adopt the 1000-way ImageNet classification task russakovsky2015imagenet as our experimental benchmark. We train a ResNet-50 he2016deep following the experimental protocol of goyal2017accurate

, using the same hyperparameters with the exception of the learning rate schedule in the

node experiment for SGP and D-PSGD. In the experiments, we also modify SGP to use Nesterov momentum. In our default implementation of SGP, each node sends and receives to one other node at each iteration, and this destination changes from one iteration to the next. Please refer to appendix 

A for more information about our implementation, including how we design/implement the sequence of mixing matrices .

All algorithms are implemented in PyTorch v0.5 

paszkepytorch . To leverage the highly efficient NVLink interconnect within each server, we treat each DGX-1 as one node in all of our experiments. In our implementation of SGP, each node computes a local mini-batch in parallel using all eight GPUs using a local AllReduce, which is efficiently implemented via the NVIDIA Collective Communications Library. Then inter-node averaging is accomplished using PushSum either over Ethernet or InfiniBand. In the low-latency experiments, we leverage GPUDirect to directly send/receive messages between GPUs on different nodes and avoid transferring the model back to host memory. In the high-latency experiments this is not possible, so the model is transferred to host memory after the local AllReduce, and then PushSum messages are sent over Ethernet.

5.1 Evaluation on High-Latency Interconnect

(a) Validation Curve
(b) Time Per Iteration
(c) Validation Accuracy
Figure 4: Results on Ethernet 10Gbits. (a): Validation performance w.r.t. training time (in seconds) for model trained on 4 and 32 nodes. (b): Average time per training iteration (in seconds) (c): Best validation accuracy. Stochastic Gradient Push (SGP) is faster than both Decentralized-Parallel SGD (D-PSGD) and AllReduce SGD while decreasing validation accuracy by 1.2%.

We consider the high-latency scenario where nodes communicate over 10Gbit/s Ethernet. With a local mini-batch size of 256 samples per node (32 samples per GPU), a single Volta DGX-1 server can perform roughly mini-batches per second. Since the ResNet-50 model size is roughly 100MBytes, transmitting one copy of the model per iteration requires  Gbit/s. Thus in the high-latency scenario the problem, if a single 10 Gbit/s link must carry the traffic between more than two pairs of nodes, then communication clearly becomes a bottleneck.

Comparison with synchronous approaches.

We first compare SGP with other synchronous and decentralized approaches. Figure 4 (a) shows the validation curves when training on 4 and 32 nodes (additional training and validation curves for all the training runs can be found in B.1). Note that when we increase the number of nodes , we also decrease the total number of iterations to following Theorem 1 (see Figure B.9). For any number of nodes used in our experiments, we observe that SGP consistently outperforms D-PSGD and AllReduce SGD in terms of total training time in this scenario. In particular for 32 nodes, SGP training time takes less than hours while D-PSGD and AllReduce SGD require roughly and hours. Appendix B.2 provides experimental evidence that all nodes converge to models with a similar training and validation accuracy when using SGP.

Figure 4 (b) shows the average time per iteration for the different training runs. As we increase the number of nodes, the average iteration time stays almost constant for SGP and D-PSGD, while we observe a significant time-increase in the case of AllReduce SGD, resulting in an overall slower training time. Moreover, although D-PSGD and SGP both exhibit strong scaling, SGP is roughly 200ms faster per iteration, supporting the claim that it involves less communication overhead.

Figure 4 (c) reports the best validation accuracy for the different training runs. While they all start around the same value, the accuracy of D-PSGD and SGP decreases as we increase the number of nodes. In the case of SGP, we see its performance decrease by relative to SGD on 32 nodes. We hypothesize that this decrease is due to the noise introduced by approximate distributed averaging. We will see below than changing the connectivity between the nodes can ameliorate this issue. We also note that the SGP validation accuracy is better than D-PSGD for larger networks.

Comparison with asynchronous approach.

The results in Tables 1 and 2 provide a comparison between the aforementioned synchronous methods and AD-PSGD Lian2018asynchronous , a state-of-art asynchronous method. AD-PSGD is an asynchronous implementation of the doubly-stochastic method D-PSGD, which relies on doubly-stochastic averaging. All methods are trained for exactly 90 epochs, therefore, the time-per-iteration is a direct reflection of the total training time. Training using AD-PSGD does not degrade the accuracy (relative to D-PSGD), and provides substantial speedups in training time. Relative to SGP, the AD-PSGD method runs slightly faster at the expense of lower validation accuracy (except in the 32 nodes case). In general, we emphasize that this asynchronous line of work is orthogonal, and that by combining the two approaches (leveraging the PushSum protocol in an asynchronous manner), one can expect to further speed up SGP. We leave this as a promising line of investigation for future work.

4 nodes 8 nodes 16 nodes 32 nodes
AllReduce SGD
Table 1: Top-1 Validation accuracy (%) over 10Gbps Ethernet showcasing an additional comparison with the AD-PSGD asynchronous doubly-stochastic approach.
4 nodes 8 nodes 16 nodes 32 nodes
AllReduce SGD
Table 2: Average time per iteration (seconds) over 10Gbps Ethernet showcasing an additional comparison with the AD-PSGD asynchronous doubly-stochastic approach. The average time per iteration for the asynchronous method is calculated by dividing the average time per epoch by the total number of iterations per epoch.

5.2 Evaluation on a “Low Latency” Interconnect

(a) Validation Curve
(b) Time Per Iteration
(c) Validation Accuracy
Figure 8: Results on InfiniBand 100Gbits. (a): Validation performance w.r.t. training time (in second) for model trained on 32 nodes. (b): Average time per training iteration (in second) (c): Best validation accuracy. Stochastic Gradient Push (SGP) is on par and sometime even slightly faster than AllReduce SGD on “low latency” network while slightly degrading the accuracy.

We now investigate the behavior of SGP and AllReduce SGD over InfiniBand 100Gbit/s, following the same experimental protocol as in the Ethernet 10Gbit/s case. In this scenario which is not communication bound for a Resnet-50 model, we do not expect SGP to outperform AllReduce SGD. Our goal is to illustrate that SGP is not significantly slower than AllReduce SGD.

On this low-latency interconnect, SGD and SGP obtain similar timing and differ at most by ms per iteration (Figure 8 (b) for nodes). In particular, using 32 nodes, SGP trains a ResNet-50 on ImageNet in hours and SGD in hours. SGD, however, exhibits better validation accuracy for large networks. Communication on InfiniBand is not a bottleneck for models the size of ResNet-50. These results therefore confirm that SGP benefits are more prominent in high-latency/low-bandwidth communication-bound scenarios. Although timing are similar, SGP still shows better scaling in term of sample throughput than AllReduce SGD (see Figure B.17)

For experiments running at this speed (less than 0.31 seconds per iteration), timing could be impacted by other factors such as data loading. To better isolate the effects of data-loading, we run additional experiments on 32, 64, and 128 GPUs where we first copied the data locally on every node; see Appendix B.3 for more details. In that setting, the time-per-iteration of SGP remains approximately constant as we increase the number of nodes in the network, while the time for AllReduceSGD increases with more nodes.

5.3 Impact of Graph Topology

(a) Train
(b) Valid
Figure 11: Comparison of SGP using a communication graph with 1-neighbor, SGP using a graph with 2-neighbors, D-PSGD and SGD on 32 nodes communicating over 10 Gbit/s Ethernet. Using one additional neighbor improves the validation performance of SGD (from to ) while retaining most of the computational benefits.

Next we investigate the impact of the communication graph topology on the SGP validation performance using Ethernet 10Gbit/s. In the limit of a fully-connected communication graph, SGD and SGP are strictly equivalent (see section 3). By increasing the number of neighbors in the graph, we expect the accuracy of SGP to improve (approximate averages are more accurate) but the communication time required for training will increase.

In Figure 11, we compare the training and validation accuracies of SGP using a communication graph with 1-neighbor and 2-neighbors with D-PSGD and SGD on 32 nodes. By increasing the number of neighbors to two, SGP achieves better training/validation accuracy (from / to /) and gets closer to final validation achieves by SGD (/). Increasing the number of neighbors also increases the communication, hence the overall training time. SGP with 2 neighbors completes training in hours and its average time per iteration increases by relative to SGP with one neighbor. Nevertheless, SGP 2-neighbors is still faster than SGD and D-PSGD, while achieving better accuracy than SGP 1-neighbor.

6 Conclusion

DNN training often necessistates non-trivial computational requirements leveaging distributed computing resources. Traditional parallel versions of SGD use exact averaging algorithms to parallelize the computation between nodes, and induce additional parallelization overhead as the model and network sizes grow. This paper proposes the use of Stochastic Gradient Push for distributed deep learning. The proposed method computes in-exact averages at each iteartion in order to improve scaling efficiency and reduce the dependency on the underlying network topology. SGP converges to a stationary point at an rate in the smooth and non-convex case, and proveably achieves a linear speedup (in iterations) with respect to the number of nodes. Empirical results show that SGP can be up to times faster than traditional AllReduce SGD over high-latency interconnect, matches the top-1 validation accuracy up to 8 nodes (GPUs), and remains within of the top-1 validation accuracy for larger-networks.


We thank Shubho Sengupta, Teng Li, and Ailing Zhang for useful discussions and technical support, as well as for maintaining the computing infrastructure used to conduct these experiments.


  • (1) D. Alistarh, D. Grubic, J. Z. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2007.
  • (2) M. Assran and M. Rabbat. Asynchronous subgradient-push. arXiv preprint arXiv:1803.08950, 2018.
  • (3) M. Blot, D. Picard, M. Cord, and N. Thome. Gossip training for deep learning. In NIPS Workshop on Optimization for Machine Learning, 2016.
  • (4) J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012.
  • (5) P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  • (6) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • (7) Z. Jiang, A. Balu, C. Hegde, and S. Sarkar. Collaborative deep learning in fixed topology networks. In Advances in Neural Information Processing Systems, pages 5904–5914, 2017.
  • (8) P. H. Jin, Q. Yuan, F. Iandola, and K. Keutzer. How to scale distributed deep learning? In NIPS ML Systems Workshop, 2016.
  • (9) D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. In Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science., pages 482–491. IEEE, 2003.
  • (10) M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583–598, 2014.
  • (11) X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017.
  • (12) X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, pages 3049–3058, 2018.
  • (13) D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Exploring the limits of weakly supervised pretraining. arXiv preprint arXiv:1805.00932, 2018.
  • (14) H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Agüera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pages 1273–1282, 2017.
  • (15) A. Nedić and A. Olshevsky. Stochastic gradient-push for strongly convex functions on time-varying directed graphs. IEEE Trans. Automatic Control, (12):3936–3947, 2016.
  • (16) A. Nedić, A. Olshevsky, and M. G. Rabbat. Network topology and communication-computation tradeoffs in decentralized optimization. Proceedings of the IEEE, (5):953–976, 2018.
  • (17) A. Paszke, S. Chintala, R. Collobert, K. Kavukcuoglu, C. Farabet, S. Bengio, I. Melvin, J. Weston, and J. Mariethoz. Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration, may 2017.
  • (18) R. Rabenseifner. Optimization of collective reduction operations. In Proc. Intl. Conf. Computational Science, Krakow, Poland, Jun. 2004.
  • (19) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge.

    International Journal of Computer Vision

    , 115(3):211–252, 2015.
  • (20) E. Seneta. Non-negative Matrices and Markov Chains. Springer, 1981.
  • (21) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
  • (22) W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. TernGrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, pages 1509–1519, 2007.
  • (23) J. Wolfowitz. Products of indecomposible, aperiodic, stochastic matrices. Proceedings of the American Mathematical Society, 14(5):733–737, 1963.
  • (24) Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
  • (25) S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaged SGD. In Advances in Neural Information Processing Systems, pages 685–693, 2015.

Appendix A Implementation Details

(a) Directed Exponential Graph highlighting node ’s out-neighbours
Figure A.2: Example of an 8-node exponential graph used in experiments

a.1 Communication Topology

Directed exponential graph.

For the SGP experiments we use a time-varying directed graph to represent the inter-node connectivity. Thinking of the nodes as being ordered sequentially, according to their rank, ,111We use indices rather than only in this section, to simplify the discussion. each node periodically communicates with peers that are hops away. Fig. A.2 shows an example of a directed 8-node exponential graph. Node ’s -hop neighbour is node , node ’s -hop neighbour is node , and node ’s -hop neighbour is node .

In the one-peer-per-node experiments, each node cycles through these peers, transmitting, only, to a single peer from this list at each iteration. E.g., at iteration , all nodes transmit messages to their -hop neighbours, at iteration all nodes transmit messages to their -hop neighbours, an so on, eventually returning to the beginning of the list before cycling through the peers again. This procedure ensures that each node only sends and receives a single message at each iteration. By using full-duplex communication, sending and receiving can happen in parallel.

In the two-peer-per-node experiments, each node cycles through the same set of peers, transmitting to two peers from the list at each iteration. E.g., at iteration , all nodes transmit messages to their -hop and -hop neighbours, at iteration all nodes transmit messages to their -hop and neighbours, an so on, eventually returning to the beginning of the list before cycling through the peers again. Similarly, at each iteration, each node also receives, in a full-duplex manner, two messages from some peers that are unknown to the receiving node ahead of time. Thereby performing the send and receive operations in parallel.

Definition of .

Based on the description above, in the one-peer-per-node experiments, each node sends to one neighbor at every iteration, and so each column of has exactly two non-zero entries, both of which are equal to . The diagonal entries for all and . At time step , each node sends to a neighbor that is hops away. Thus, with , we get that

Note that, with this design, in fact each node sends to one peer and receives from one peer at every iteration, so the communication load is balanced across the network.

In the two-peer-per-node experiments, the definition is similar, but now there will be three non-zero entries in each column of , all of which will be equal to ; these are the diagonal, and the entries corresponding to the two neighbors to which the node sends at that iteration. In addition, each node will send two messages and receive two messages at every iteration, so the communication load is again balanced across the network.

Undirected exponential graph.

For the D-PSGD experiments we use a time-varying undirected bipartite exponential graph to represent the inter-node connectivity. Odd-numbered nodes send messages to peers that are

(even-numbered nodes), and wait to a receive a message back in return. Each odd-numbered node cycles through the peers in the list in a similar fashion to the one-peer-per-node SGP experiments. Even-numbered nodes wait to receive a message from some peer (unknown to the receiving node ahead of time), and send a message back in return.

We adopt these graphs to be consistent with the experimental setup used in [11] and [12].

Note also that these graphs are all regular, in that all nodes have the same number of in-coming and out-going connections.

Decentralized averaging errors.

To further motivate our choice of using the directed exponential graph with SGP, let us forget about optimization for a moment and focus on the problem of distributed averaging, described in Section 2, using the

PushSum algorithm. Recall that each node starts with a vector , and the goal of the agents is to compute the average . Then, since , after steps we have

where is a matrix with as its th row.


. The worst-case rate of convergence can be related to the second-largest singular value of

[16]. In particular, after iterations we have

where denotes the second largest singular value of .

For the scheme proposed above, cycling deterministically through neighbors in the directed exponential graph, one can verify that after iterations, we have , so all nodes exactly have the average. Intuitively, this happens because the directed exponential graph has excellent mixing properties: from any starting node in the network, one can get to any other node in at most hops. For nodes, after iterations averaging has converged using this strategy. In comparison, if one were to cycle through edges of the complete graph (where every node is connected to every other node), then for , after 5 consecutive iterations one would have still have ; i.e., nodes could be much further from the average (and hence, much less well-synchronized).

Similarly, one could consider designing the matrices in a stochastic manner, where each node randomly samples one neighbor to send to at every iteration. If each node samples a destination uniformly from its set of neighbors in the directed exponential graph, then , and if each node randomly selected a destination uniformly among all other nodes in the network (i.e., randomly from neighbors in the complete graph), then . Thus, random schemes are still not as effective at quickly averaging as deterministically cycling through neighbors in the directed exponential graph. Moreover, with randomized schemes, we are no longer guaranteed that each node receives the same number of messages at every iteration, so the communication load will not be balanced as in the deterministic scheme.

The above discussion focused only on approximate distributed averaging, which is a key step within decentralized optimization. When averaging occurs less quickly, this also impacts optimization. Specifically, since nodes are less well-synchronized (i.e., further from a consensus), each node will be evaluating its local mini-batch gradient at a different point in parameter space. Averaging these points (rather than updates based on mini-batch gradients evaluated at the same point) can be seen as injecting additional noise into the optimization process, and in our experience this can lead to worse performance in terms of train and generalization errors.

a.2 Stochastic Gradient Push

In all of our experiments, we minimize the number of floating-point operations performed in each iteration, , by using the mixing weights

for all . In words, each node assigns mixing weights uniformly to all of its out-neighbors in each iteration. Recalling our convention that each node is an in- and out-neighbor of itself, it is easy to see that this choice of mixing-weight satisfies the column-stochasticity property. It may very well be that there is a different choice of mixing-weights that lead to better spectral properties of the gossip algorithm; however we leave this exploration for future work. We denote node ’s uniform mixing weights at time by — dropping the other subscript, which identifies the receiving node.

To maximize the utility of the resources available on each server, each node (occupying a single server exclusively) runs two threads, a gossip thread and a computation thread. The computation thread executes the main logic used to train the local model on the GPUs available to the node, while the communication thread is used for inter-node network I/O. In particular, the communication thread is used to gossip messages between nodes. When using Ethernet-based communication, the nodes communicate their parameter tensors over CPUs. When using InifiniBand-based communication, the nodes communicate their parameter tensors using GPUDirect RDMA, thereby avoiding superfluous device to pinned-memory transfers of the model parameters.

Each node initializes its model on one of its GPUs, and initializes its scalar push-sum weight to . At the start of training, each node also allocates a send- and a receive- communication-buffer in pinned memory on the CPU (or equivalently on a GPU in the case of GPUDirect RDMA communication).

In each iteration, the communication thread waits for the send-buffer to be filled by the computation thread; transmits the message in the send-buffer to its out-neighbours; and then aggregates any newly-received messages into the receive-buffer.

In each iteration, the computation thread blocks to retrieve the aggregated messages in the receive-buffer; directly adds the received parameters to its own model parameters; and directly adds the received push-sum weights to its own push-sum weight. The computation thread then converts the model parameters to the de-biasedestimate by dividing by the push-sum weight; executes a forward-backward pass of the de-biased model in order to compute a stochastic mini-batch gradient; converts the model parameters back to the biased estimate by multiplying by the push-sum weight; and applies the newly-computed stochastic gradients to the biased model. The updated model parameters are then multiplied by the mixing weight, , and asynchronously copied back into the send-buffer for use by the communication thread. The push-sum weight is also multiplied by the same mixing weight and concatenated into the send-buffer.

In short, gossip is performed on the biased model parameters (push-sum numerators); stochastic gradients are computed using the de-biased model parameters; stochastic gradients are applied back to the biased model parameters; and then the biased-model and the push-sum weight are multiplied by the same uniform mixing-weight and copied back into the send-buffer.

a.3 Hyperparameters

When we “apply the stochastic gradients” to the biased model parameters, we actually carry out an SGD step with nesterov momentum. For the , and GPU experiments we use the same exact learning-rate, schedule, momentum, and weight decay as those suggested in [5] for SGD. In particular, we use a reference learning-rate of with respect to a sample batch, and scale this linearly with the batch-size; we decay the learning-rate by a factor of at epochs ; we use a nesterov momentum parameter of , and we use weight decay . For the GPU experiments, we decay the learning-rate by a factor of at epochs , and we use a reference learning-rate of . In the GPU experiment with two peers-per-node, we revert to the original learning-rate and schedule.

1:Initialize , , and for all nodes
2:for  do at node
3:     Sample new mini-batch from local distribution
4:     Compute a local stochastic mini-batch gradient at :
7:     Send to out-neighbors ;receive from in-neighbors
11:end for
Algorithm 2 Stochastic Gradient Push with Momentum

Appendix B Extra Experiments

b.1 Additional Training Curves

(a) Train
(b) Validation
Figure B.3: Training on Ethernet 10Gbit/s
(a) Train
(b) Validation
Figure B.6: Training on InfiniBand 100Gbit/s

FigureB.3 show the train and validation curve for the different runs performed on Ethernet 10Gbit/s. Figure B.6 show the train and validation curve for the different runs performed on InfiniBand 100Gbit/s.

(a) Train
(b) Validation
Figure B.9: Training/Validation accuracy per iteration for SGP (Ethernet 10Gbit/s). Each time we double the number of node in the network, we half the total number of iterations.

Figure B.9 reports the training and validation accuracy of SGP when using a high-latency interconnect. As we scale up the number of nodes , we scale down the total number of iterations to following Theorem 1. In particular, 32-node runs involves times fewer global iterations than 4-node runs. We additionally report the total number of iterations and the final performances in Table 3. While we reduce the total number iterations by a factor of when going from 4 to 32 nodes, the validation accuracy and training accuracy of the 32 node runs remain within and , respecively, of the validation and training accuracy achieved by the 4-node runs (and remains within the of AllReduce SGD accuracies).

Nodes 4 8 16 32
Training (%)
Validation (%)
Table 3: Total number of iterations and final training and validation performances when training a Resnet50 on ImageNet using SGP over Ethernet 10Gbit/s.

b.2 Discrepancy across different nodes

(a) Discrepancy on 4 nodes
(b) Discrepancy on 32 nodes
Figure B.12: Resnet50, trained with SGP, training and validation errors for 4 and 32 nodes experiments. The solid and dashed lines in each figure show the mean training and validation error, respectively, over all nodes. The shaded region shows the maximum and minimum error attained at different nodes in the same experiment. Although there is non-trivial variability across nodes early in training, all nodes eventually converge to similar validation errors, achieving consensus in the sense that they represent the same function.

Here, we investigate the performance variability across nodes during training for SGP. In figure B.12, we report the minimum, maximum and mean error across the different nodes for training and validation. In an initial training phase, we observe that nodes have different validation errors; their local copies of the Resnet-50 model diverge. As we decrease the learning, the variability between the different nodes diminish and the nodes eventually converging to similar errors. This suggests that all models ultimately represent the same function, achieving consensus.

b.3 Timing on InfiniBand with local data copy

Figure B.13: Average time per training iteration for model trained on 4, 8 and 16 nodes using data copied on the local nodes and InfiniBand interconnect. SGP time per training iteration remains approximatelly constant as we increase the number of node, while SGD shows a slight increase.

To better isolate the effects of data-loading, we ran experiments on 32, 64, and 128 GPUs, where we first copied the data locally on every node. In that setting, we observe in Figure B.13 that the time-per-iteration of SGP remains approximately constant as we increase the number of nodes in the network, while the time for AllReduce SGD increases.

b.4 SGP Scaling Analysis

(a) ETH 10Gbit/s
(b) InfiniBand 100Gbit/s
(c) Scaling of SGP and SGP
Figure B.17: SGP throughput on Ethernet (a) and InfiniBand (b). SGP exhibits 88.6% scaling efficiency on Ethernet 10Gbit/s and 92.4% on InfiniBand. Comparison of SGD vs SGP throughput in Figure (c) shows that SGP exhibit better scaling and is more robust to high-latency interconnect.

Figure B.17 highlights SGP input images throughput as we scale up the number of cluster node on both Ethernet 10Gbit/s and Infiniband 100Gbit/s. SGP exhibits 88.6% scaling efficiency on Ethernet 10Gbit/s and 92.4% on InfiniBand and stay close to the ideal scaling in both cases. In addition Figure (c) shows that SGP exhibit better scaling as we increase the network size and is more robust to high-latency interconnect.

Appendix C Proofs of Theoretical Guarantees

Our convergence rate analysis is divided into three main parts. In the first one (subsection C.1) we present upper bounds for three important expressions that appear in our computations. In subsection C.2 we focus on proving the important for our analysis Lemma 8 based on which we later build the proofs of our main Theorems. Finally in the third part (subsection C.3) we provide the proofs for Theorems 1 and 2.

Preliminary results.

In our analysis two preliminary results are extensively used. We state them here for future reference.

  • Let . Since , it holds that


    Thus, .

  • Let then from the summation of geometric sequence and for any it holds that

Matrix Representation.

The presentation of stochastic gradient push (Algorithm 1) was done from node i’s perspective for all . Note however, that the update rule of SGP at the iteration can be viewed from a global viewpoint. To see this let us define the following matrices (concatenation of the values of all nodes at the iteration):

Using the above matrices, the step of SGP (Algorithm 1) can be expressed as follows 222Note that in a similar way we can obtain matrix expressions for steps 7 and 8 of Algorithm 1.:


where is the transpose of matrix with entries:


Recall that we also have .

Bound for the mixing matrices.

Next we state a known result from the control literature studying consensus-based optimization which allows us to bound the distance between the de-biased parameters at each node and the node-wise average.

Recall that we have assumed that the sequence of mixing matrices are -strongly connected. A directed graph is called strongly connected if every pair of vertices is connected with a directed path (i.e., following the direction of edges), and the -strongly connected assumption is that the graph with edge set is strongly connected, for every .

We have also assumed that for all , each column of has non-zero entries, and the diameter of the graph with edge set has diameter at most . Based on these assumptions, after consecutive iterations, the product

has no non-zero entries. Moreover, every entry of is at least .

Lemma 3.

Suppose that Assumption 3 (mixing connectivity) holds. Let and let . Then there exists a constant

where is the dimension of , , and , such that, for all and ,

This particular lemma follows after a small adaptation to Theorem 1 in [2] and its proof is based on [23]. Similar bounds appear in a variety of other papers, including [15].

c.1 Important Upper Bounds

Lemma 4 (Bound of stochastic gradient).

We have the following inequality under Assumptions 1 and 2:


Lemma 5.

Let Assumptions 1-3 hold. Then,