1 Introduction
Deep Neural Networks (DNNs) are the stateofthe art machine learning approach in many application areas, including image recognition
he2016deepand natural language processing
vaswani2017attention. Stochastic Gradient Descent (SGD) is the current workhorse for training neural networks. The algorithm optimizes the network parameters,
, to minimize a loss function,
, through gradient descent, where the loss function’s gradients are approximated using a subset of training examples (a minibatch). DNNs often require large amounts of training data and trainable parameters, necessitating nontrivial computational requirements wu2016google ; mahajan2018exploring . There is a need for efficient methods to train DNNs in largescale computing environments.A parallel version of SGD is usually adopted for largescale, distributed training goyal2017accurate ; li2014scaling . Worker nodes compute local minibatch gradients of the loss function on different subsets of the data, and then calculate an exact internode average gradient using either the AllReduce communication primitive, in synchronous implementations goyal2017accurate , or using a central parameter server, in asynchronous implementations dean2012large . Using a parameter server to aggregate gradients introduces a potential bottleneck and a central point of failure lian2017can . The AllReduce primitive computes the exact average gradient at all workers in a decentralized manner, avoiding issues associated with centralized communication and computation.
However, exact averaging algorithms like AllReduce are not robust in highlatency or highvariability platforms, e.g., where the network bandwidth may be a significant bottleneck, because they involve tightlycoupled, blocking communication (i.e., the call does not return until all nodes have finished aggregating). Moreover, aggregating gradients across all the nodes in the network can introduce nontrivial computational overhead when there are many nodes, or when the gradients themselves are large. This issue motivates the investigation of a decentralized and inexact version of SGD to reduce the overhead associated with distributed training.
There have been numerous decentralized optimization algorithms proposed and studied in the controlsystems literature that leverage consensusbased approaches to aggregate information; see the recent survey Nedic2018network and references therein. Rather than exactly aggregating gradients (as with AllReduce), this line of work uses lesscoupled message passing algorithms which compute inexact distributed averages.
Most previous work in this area has focused on theoretical convergence analysis assuming convex objectives. Recent work has begun to investigate their applicability to largescale training of DNNs lian2017can ; Jiang2017collaborative . However, these papers study methods based on communication patterns which are static (the same at every iteration) and symmetric (if sends to , then must also receive from before proceeding). Such methods inherently require blocking and communication overhead. Stateoftheart consensus optimization methods build on the PushSum algorithm for approximate distributed averaging kempe2003gossip ; Nedic2018network , which allows for nonblocking, timevarying, and directed (asymmetric) communication. Since SGD already uses stochastic minibatches, the hope is that an inexact average minibatch will be as useful as the exact one if the averaging error is sufficiently small relative to the variability in the stochastic gradient.
This paper studies the use of Stochastic Gradient Push (SGP), an algorithm blending SGD and PushSum, for distributed training of deep neural networks. We provide a theoretical analysis of SGP, showing it converges for smooth nonconvex objectives. We also evaluate SGP experimentally, training ResNets on ImageNet using up to 32 nodes, each with 8 GPUs (i.e., 256 GPUs in total). Our main contributions are summarized as follows:

We provide the first convergence analysis for Stochastic Gradient Push when the objective function is smooth and nonconvex. We show that, for an appropriate choice of the step size, SGP converges to a stationary point at a rate of , where is the number of nodes and is the number of iterations.

In a highlatency scenario, where nodes communicate over 10Gbps Ethernet, SGP runs up to faster than AllReduce SGD and exhibits 88.6% scaling efficiency over the range from 4–32 nodes.

The top1 validation accuracy of SGP matches that of AllReduce SGD for up to 8 nodes (64 GPUs), and remains within 1.2% of AllReduce SGD for larger networks.

In a lowlatency scenario, where nodes communicate over a 100Gbps InfiniBand network supporting GPUDirect, SGP is on par with AllReduce SGD in terms of running time, and SGP exhibits 92.4% scaling efficiency.

In comparison to other synchronous decentralized consensusbased approaches that require symmetric messaging, SGP runs faster and it produces models with better validation accuracy.
2 Preliminaries
Problem formulation.
We consider the setting where a network of nodes cooperates to solve the stochastic consensus optimization problem
(1) 
Each node has local data following a distribution , and the nodes wish to cooperate to find the parameters of a DNN that minimizes the average loss with respect to their data, where is the loss function at node . Moreover, the goal codified in the constraints is for the nodes to reach agreement (i.e., consensus) on the solution they report. We assume that nodes can locally evaluate stochastic gradients , , but they must communicate to access information about the objective functions at other nodes.
Distributed averaging.
The problem described above encompasses distributed training based on data parallelism. There a canonical approach is large minibatch parallel stochastic gradient descent: for an overall minibatch of size , each node computes a local stochastic minibatch gradient using samples, and then the nodes use the AllReduce communication primitive to compute the average gradient at every node. Let denote the objective at node , and let denote the overall objective. Since , averaging gradients via AllReduce provides an exact stochastic gradient of . Typical implementations of AllReduce have each node send and receive bytes, where
is the size (in bytes) of the tensor being reduced, and involve
communication steps Rabenseifner2004optimization . Moreover, AllReduce is a blocking primitive, meaning that no node will proceed with local computations until the primitive returns.Approximate distributed averaging.
In this work we explore the alternative approach of using a gossip algorithm for approximate distributed averaging—specifically, the PushSum algorithm. Gossip algorithms typically use linear iterations for averaging. For example, let
be a vector at node
, and consider the goal of computing the average vector at all nodes. Stack the initial vectors into a matrix with one row per node. Typical gossip iterations have the form where is referred to as the mixing matrix. This corresponds to the update at node . To implement this update, node only needs to receive messages from other nodes for which , so it will be appealing to use sparse to reduce communications.Drawing inspiration from the theory of Markov chains
Seneta1981 , the mixing matrices are designed to be column stochastic. Then, under mild conditions (e.g., ensuring that information from every node eventually reaches all other nodes) one can show that , where is the ergodic limit of the chain and is a vector with all entries equal to . Consequently, the gossip iterations converge to a limit ; i.e., the value at node converges to . When the matrices are symmetric, it is straightforward to design the algorithm so that for all by making doubly stochastic. However, symmetric has strong practical ramifications, such as requiring care in the implementation to avoid deadlocks.The PushSum algorithm only requires that be columnstochastic, and not necessarily symmetric (so node may send to node , but not necessarily vice versa). Instead, one additional scalar parameter is maintained at each node. The parameter is initialized to for all , and updated using the same linear iteration, . Consequently, the parameter converges to , or at node . Thus each node can recover the average of the initial vectors by computing the debiased ratio . In practice, we stop after a finite number of gossip iterations and compute . The distance of the debiased ratio to the exact average can be quantified in terms of properties of the matrices . Let and denote the sets of nodes that transmits to and receives from, respectively, at iteration . If we use bytes to represent the vector , then node sends and receives and bytes, respectively, per iteration. In our experiments we use graph sequences with or , and find that approximate averaging is both fast and still facilitates training.
3 Stochastic Gradient Push
Algorithm description.
The stochastic gradient push (SGP) method for solving equation 1 is obtained by interleaving one local stochastic gradient descent update at each node with one iteration of PushSum. Each node maintains three variables: the model parameters at node , the scalar PushSum weight , and the debiased parameters . The initial and can be initialized to any arbitrary value as long as . Pseudocode is shown in Alg. 1. Each node performs a local SGD step (lines 2–4) followed by one step of PushSum for approximate distributed averaging (lines 5–8).
Note that the gradients are evaluated at the debiased parameters in line 3, and they are then used to update , the PushSum numerator, in line 4. All communication takes place in line 5, and each message contains two parts, the PushSum numerator and denominator. In particular, node controls the values used to weight the values in messages it sends.
We are mainly interested in the case where the mixing matrices are sparse in order to have low communication overhead. However, we point out that when the nodes’ initial values are identical, for all , and every entry of is equal to , then SGP is mathematically equivalent to parallel SGD using AllReduce. Please refer to appendix A for pratical implementation details, including how we design mixing matrices .
Theoretical guarantees.
SGP was first proposed and analyzed in Nedic2016stochastic assuming the local objectives are strongly convex. Here we provide convergence results in the more general setting of smooth, nonconvex objectives. We make the following three assumptions:
1. (smooth) There exists a constant such that , or equivalently
(2) 
Note that this assumption implies that function is also Lsmooth.
2. (Bounded variance) There exist finite positive constants
and such that(3)  
(4) 
Thus bounds the variance of stochastic gradients at each node, and quantifies the similarity of data distributions at different nodes.
3. (Mixing connectivity) To each mixing matrix we can associate a graph with vertex set and edge set ; i.e., with edges from to if receives a message from at iteration . Assume that the graph with edge set is strongly connected and has diameter at most for every . To simplify the discussion, we assume that every column of the mixing matrices has at most nonzero entries.
Let . Under similar assumptions, lian2017can define that a decentralized algorithm for solving equation 1 converges if, for any , it eventually satisfies
(5) 
Our first result shows that SGP converges in this sense.
Theorem 1.
Suppose that Assumptions 1–3 hold, and run SGP for iterations with stepsize . Let and assume that . There exist constants and which depend on , , and such that if the total number of iterations satisfies
(6) 
where and , then
The proof is given in Appendix C, where we also provide precise expressions for the constants and . The proof of Theorem 1 builds on an approach developed in lian2017can . Theorem 1 shows that, for a given number of nodes , by running a sufficiently large number of iterations (roughly speaking, , which is reasonable for distributed training of DNNs) and choosing the stepsize as prescribed, then the criterion equation 5 is satisfied with a number of iterations . That is, we achieve a linear speedup in the number of nodes.
Theorem 1 shows that the average of the nodes parameters, , converges, but it doesn’t directly say anything about the parameters at each node. In fact, we can show a stronger result.
Theorem 2.
The proof is also given in Appendix C. This result shows that as grows, the debiased variables converge to the nodewise average , and hence the debiased variables at each node also converge to a stationary point. Note that for fixed and large , the term will dominate the other factors.
4 Related Work
A variety of approaches have been proposed to accelerate distributed training of DNNs, including quantizing gradients Alistarh2017qsgd ; Wen2017terngrad and performing multiple local SGD steps at each node before averaging McMahan2017federated . These approaches are complementary to the tradeoff we consider in this paper, between exact and approximate distributed averaging. Similar to using PushSum for averaging, both quantizing gradients and performing multiple local SGD steps before averaging can also be seen as injecting additional noise into SGD, leading to a trade off between training faster (by reducing communication overhead) and potentially obtaining a less accurate result. Combining these approaches (quantized, inexact, and infrequent averaging) is an interesting direction for future work.
For the remainder of this section we review related work applying consensusbased approaches to largescale training of DNNs. Blot2016gossip report initial experimental results on smallscale experiments with an SGPlike algorithm. Jin2016how make a theoretical connection between PushSumbased methods and Elastic Averaging SGD Zhang2015elasticsgd . Relative to those previous works, we provide the first convergence analysis for a PushSumbased method in the smooth nonconvex case. lian2017can and Jiang2017collaborative study synchronous consensusbased versions of SGD. However, unlike PushSum, those methods involve symmetric message passing (if sends to at iteration , then also sends to before both nodes update) which is inherently blocking. Consequently, these methods are more sensitive to highlatency communication settings, and each node generally must communicate more per iteration, in comparison to PushSumbased SGP where communication may be directed ( can send to without needing a response from ). The decentralized parallel SGD (DPSGD) method proposed in lian2017can produces iterates whose nodewise average, , is shown to converge in the sense of equation 5. Our proof of Theorem 1, showing the convergence of SGP in the same sense, adapts some ideas from their analysis and also goes beyond to show that, since the values at each node converge to the average, the individual values at each node also converge to a stationary point. We compare SGP with DPSGD experimentally in Section 5 below and find that although the two methods find solutions of comparable accuracy, SGP is consistently faster.
Jin2016how and Lian2018asynchronous study asynchronous consensusbased methods for training DNNs. Lian2018asynchronous analyzes an asynchronous version of DPSGD and proves that its nodewise averages also converge to a stationary point. In general, these contributions focusing on asynchrony can be seen as orthogonal to the use of a PushSum based protocol for consensus averaging.
5 Experiments
Next, we compare SGP with AllReduce SGD, and DPSGD lian2017can , an approximate distributed averaging baseline relying on doublystochastic gossip. We run experiments on a largescale distributed computing environment using up to 256 GPUs. Our results show that when communication is the bottleneck, SGP is faster than both SGD and DPSGD. SGP also outperforms DPSGD in terms of validation accuracy, while achieving a slightly worse accuracy compared to SGD when using a large number of compute nodes. Our results also highlight that, in a setting where communication is efficient (e.g., over InfiniBand), doing exact averaging through AllReduce SGD remains a competitive approach.
We run experiments on 32 DGX1 GPU servers in a highperformance computing cluster. Each server contains 8 NVIDIA VoltaV100 GPUs. We consider two communication scenarios: in the highlatency scenario the nodes communicate over a 10 Gbit/s Ethernet network, and in the lowlatency scenario the nodes communicate over 100 Gbit/s InfiniBand, which supports GPUDirect RDMA communications. To investigate how each algorithm scales, we run experiments with 4, 8, 16, and 32 nodes (i.e., 32, 64, 128, and 256 GPUs).
We adopt the 1000way ImageNet classification task russakovsky2015imagenet as our experimental benchmark. We train a ResNet50 he2016deep following the experimental protocol of goyal2017accurate
, using the same hyperparameters with the exception of the learning rate schedule in the
node experiment for SGP and DPSGD. In the experiments, we also modify SGP to use Nesterov momentum. In our default implementation of SGP, each node sends and receives to one other node at each iteration, and this destination changes from one iteration to the next. Please refer to appendix
A for more information about our implementation, including how we design/implement the sequence of mixing matrices .All algorithms are implemented in PyTorch v0.5
paszkepytorch . To leverage the highly efficient NVLink interconnect within each server, we treat each DGX1 as one node in all of our experiments. In our implementation of SGP, each node computes a local minibatch in parallel using all eight GPUs using a local AllReduce, which is efficiently implemented via the NVIDIA Collective Communications Library. Then internode averaging is accomplished using PushSum either over Ethernet or InfiniBand. In the lowlatency experiments, we leverage GPUDirect to directly send/receive messages between GPUs on different nodes and avoid transferring the model back to host memory. In the highlatency experiments this is not possible, so the model is transferred to host memory after the local AllReduce, and then PushSum messages are sent over Ethernet.5.1 Evaluation on HighLatency Interconnect
We consider the highlatency scenario where nodes communicate over 10Gbit/s Ethernet. With a local minibatch size of 256 samples per node (32 samples per GPU), a single Volta DGX1 server can perform roughly minibatches per second. Since the ResNet50 model size is roughly 100MBytes, transmitting one copy of the model per iteration requires Gbit/s. Thus in the highlatency scenario the problem, if a single 10 Gbit/s link must carry the traffic between more than two pairs of nodes, then communication clearly becomes a bottleneck.
Comparison with synchronous approaches.
We first compare SGP with other synchronous and decentralized approaches. Figure 4 (a) shows the validation curves when training on 4 and 32 nodes (additional training and validation curves for all the training runs can be found in B.1). Note that when we increase the number of nodes , we also decrease the total number of iterations to following Theorem 1 (see Figure B.9). For any number of nodes used in our experiments, we observe that SGP consistently outperforms DPSGD and AllReduce SGD in terms of total training time in this scenario. In particular for 32 nodes, SGP training time takes less than hours while DPSGD and AllReduce SGD require roughly and hours. Appendix B.2 provides experimental evidence that all nodes converge to models with a similar training and validation accuracy when using SGP.
Figure 4 (b) shows the average time per iteration for the different training runs. As we increase the number of nodes, the average iteration time stays almost constant for SGP and DPSGD, while we observe a significant timeincrease in the case of AllReduce SGD, resulting in an overall slower training time. Moreover, although DPSGD and SGP both exhibit strong scaling, SGP is roughly 200ms faster per iteration, supporting the claim that it involves less communication overhead.
Figure 4 (c) reports the best validation accuracy for the different training runs. While they all start around the same value, the accuracy of DPSGD and SGP decreases as we increase the number of nodes. In the case of SGP, we see its performance decrease by relative to SGD on 32 nodes. We hypothesize that this decrease is due to the noise introduced by approximate distributed averaging. We will see below than changing the connectivity between the nodes can ameliorate this issue. We also note that the SGP validation accuracy is better than DPSGD for larger networks.
Comparison with asynchronous approach.
The results in Tables 1 and 2 provide a comparison between the aforementioned synchronous methods and ADPSGD Lian2018asynchronous , a stateofart asynchronous method. ADPSGD is an asynchronous implementation of the doublystochastic method DPSGD, which relies on doublystochastic averaging. All methods are trained for exactly 90 epochs, therefore, the timeperiteration is a direct reflection of the total training time. Training using ADPSGD does not degrade the accuracy (relative to DPSGD), and provides substantial speedups in training time. Relative to SGP, the ADPSGD method runs slightly faster at the expense of lower validation accuracy (except in the 32 nodes case). In general, we emphasize that this asynchronous line of work is orthogonal, and that by combining the two approaches (leveraging the PushSum protocol in an asynchronous manner), one can expect to further speed up SGP. We leave this as a promising line of investigation for future work.
4 nodes  8 nodes  16 nodes  32 nodes  

AllReduce SGD  
DPSGD  
ADPSGD  
SGP 
4 nodes  8 nodes  16 nodes  32 nodes  

AllReduce SGD  
DPSGD  
ADPSGD  
SGP 
5.2 Evaluation on a “Low Latency” Interconnect
We now investigate the behavior of SGP and AllReduce SGD over InfiniBand 100Gbit/s, following the same experimental protocol as in the Ethernet 10Gbit/s case. In this scenario which is not communication bound for a Resnet50 model, we do not expect SGP to outperform AllReduce SGD. Our goal is to illustrate that SGP is not significantly slower than AllReduce SGD.
On this lowlatency interconnect, SGD and SGP obtain similar timing and differ at most by ms per iteration (Figure 8 (b) for nodes). In particular, using 32 nodes, SGP trains a ResNet50 on ImageNet in hours and SGD in hours. SGD, however, exhibits better validation accuracy for large networks. Communication on InfiniBand is not a bottleneck for models the size of ResNet50. These results therefore confirm that SGP benefits are more prominent in highlatency/lowbandwidth communicationbound scenarios. Although timing are similar, SGP still shows better scaling in term of sample throughput than AllReduce SGD (see Figure B.17)
For experiments running at this speed (less than 0.31 seconds per iteration), timing could be impacted by other factors such as data loading. To better isolate the effects of dataloading, we run additional experiments on 32, 64, and 128 GPUs where we first copied the data locally on every node; see Appendix B.3 for more details. In that setting, the timeperiteration of SGP remains approximately constant as we increase the number of nodes in the network, while the time for AllReduceSGD increases with more nodes.
5.3 Impact of Graph Topology
Next we investigate the impact of the communication graph topology on the SGP validation performance using Ethernet 10Gbit/s. In the limit of a fullyconnected communication graph, SGD and SGP are strictly equivalent (see section 3). By increasing the number of neighbors in the graph, we expect the accuracy of SGP to improve (approximate averages are more accurate) but the communication time required for training will increase.
In Figure 11, we compare the training and validation accuracies of SGP using a communication graph with 1neighbor and 2neighbors with DPSGD and SGD on 32 nodes. By increasing the number of neighbors to two, SGP achieves better training/validation accuracy (from / to /) and gets closer to final validation achieves by SGD (/). Increasing the number of neighbors also increases the communication, hence the overall training time. SGP with 2 neighbors completes training in hours and its average time per iteration increases by relative to SGP with one neighbor. Nevertheless, SGP 2neighbors is still faster than SGD and DPSGD, while achieving better accuracy than SGP 1neighbor.
6 Conclusion
DNN training often necessistates nontrivial computational requirements leveaging distributed computing resources. Traditional parallel versions of SGD use exact averaging algorithms to parallelize the computation between nodes, and induce additional parallelization overhead as the model and network sizes grow. This paper proposes the use of Stochastic Gradient Push for distributed deep learning. The proposed method computes inexact averages at each iteartion in order to improve scaling efficiency and reduce the dependency on the underlying network topology. SGP converges to a stationary point at an rate in the smooth and nonconvex case, and proveably achieves a linear speedup (in iterations) with respect to the number of nodes. Empirical results show that SGP can be up to times faster than traditional AllReduce SGD over highlatency interconnect, matches the top1 validation accuracy up to 8 nodes (GPUs), and remains within of the top1 validation accuracy for largernetworks.
Acknowledgments
We thank Shubho Sengupta, Teng Li, and Ailing Zhang for useful discussions and technical support, as well as for maintaining the computing infrastructure used to conduct these experiments.
References
 (1) D. Alistarh, D. Grubic, J. Z. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communicationefficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2007.
 (2) M. Assran and M. Rabbat. Asynchronous subgradientpush. arXiv preprint arXiv:1803.08950, 2018.
 (3) M. Blot, D. Picard, M. Cord, and N. Thome. Gossip training for deep learning. In NIPS Workshop on Optimization for Machine Learning, 2016.
 (4) J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012.
 (5) P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 (6) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 (7) Z. Jiang, A. Balu, C. Hegde, and S. Sarkar. Collaborative deep learning in fixed topology networks. In Advances in Neural Information Processing Systems, pages 5904–5914, 2017.
 (8) P. H. Jin, Q. Yuan, F. Iandola, and K. Keutzer. How to scale distributed deep learning? In NIPS ML Systems Workshop, 2016.
 (9) D. Kempe, A. Dobra, and J. Gehrke. Gossipbased computation of aggregate information. In Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science., pages 482–491. IEEE, 2003.
 (10) M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583–598, 2014.
 (11) X. Lian, C. Zhang, H. Zhang, C.J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017.
 (12) X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, pages 3049–3058, 2018.
 (13) D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Exploring the limits of weakly supervised pretraining. arXiv preprint arXiv:1805.00932, 2018.
 (14) H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Agüera y Arcas. Communicationefficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pages 1273–1282, 2017.
 (15) A. Nedić and A. Olshevsky. Stochastic gradientpush for strongly convex functions on timevarying directed graphs. IEEE Trans. Automatic Control, (12):3936–3947, 2016.
 (16) A. Nedić, A. Olshevsky, and M. G. Rabbat. Network topology and communicationcomputation tradeoffs in decentralized optimization. Proceedings of the IEEE, (5):953–976, 2018.
 (17) A. Paszke, S. Chintala, R. Collobert, K. Kavukcuoglu, C. Farabet, S. Bengio, I. Melvin, J. Weston, and J. Mariethoz. Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration, may 2017.
 (18) R. Rabenseifner. Optimization of collective reduction operations. In Proc. Intl. Conf. Computational Science, Krakow, Poland, Jun. 2004.

(19)
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, et al.
Imagenet large scale visual recognition challenge.
International Journal of Computer Vision
, 115(3):211–252, 2015.  (20) E. Seneta. Nonnegative Matrices and Markov Chains. Springer, 1981.
 (21) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
 (22) W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. TernGrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, pages 1509–1519, 2007.
 (23) J. Wolfowitz. Products of indecomposible, aperiodic, stochastic matrices. Proceedings of the American Mathematical Society, 14(5):733–737, 1963.
 (24) Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
 (25) S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaged SGD. In Advances in Neural Information Processing Systems, pages 685–693, 2015.
Appendix A Implementation Details
a.1 Communication Topology
Directed exponential graph.
For the SGP experiments we use a timevarying directed graph to represent the internode connectivity. Thinking of the nodes as being ordered sequentially, according to their rank, ,^{1}^{1}1We use indices rather than only in this section, to simplify the discussion. each node periodically communicates with peers that are hops away. Fig. A.2 shows an example of a directed 8node exponential graph. Node ’s hop neighbour is node , node ’s hop neighbour is node , and node ’s hop neighbour is node .
In the onepeerpernode experiments, each node cycles through these peers, transmitting, only, to a single peer from this list at each iteration. E.g., at iteration , all nodes transmit messages to their hop neighbours, at iteration all nodes transmit messages to their hop neighbours, an so on, eventually returning to the beginning of the list before cycling through the peers again. This procedure ensures that each node only sends and receives a single message at each iteration. By using fullduplex communication, sending and receiving can happen in parallel.
In the twopeerpernode experiments, each node cycles through the same set of peers, transmitting to two peers from the list at each iteration. E.g., at iteration , all nodes transmit messages to their hop and hop neighbours, at iteration all nodes transmit messages to their hop and neighbours, an so on, eventually returning to the beginning of the list before cycling through the peers again. Similarly, at each iteration, each node also receives, in a fullduplex manner, two messages from some peers that are unknown to the receiving node ahead of time. Thereby performing the send and receive operations in parallel.
Definition of .
Based on the description above, in the onepeerpernode experiments, each node sends to one neighbor at every iteration, and so each column of has exactly two nonzero entries, both of which are equal to . The diagonal entries for all and . At time step , each node sends to a neighbor that is hops away. Thus, with , we get that
Note that, with this design, in fact each node sends to one peer and receives from one peer at every iteration, so the communication load is balanced across the network.
In the twopeerpernode experiments, the definition is similar, but now there will be three nonzero entries in each column of , all of which will be equal to ; these are the diagonal, and the entries corresponding to the two neighbors to which the node sends at that iteration. In addition, each node will send two messages and receive two messages at every iteration, so the communication load is again balanced across the network.
Undirected exponential graph.
For the DPSGD experiments we use a timevarying undirected bipartite exponential graph to represent the internode connectivity. Oddnumbered nodes send messages to peers that are
(evennumbered nodes), and wait to a receive a message back in return. Each oddnumbered node cycles through the peers in the list in a similar fashion to the onepeerpernode SGP experiments. Evennumbered nodes wait to receive a message from some peer (unknown to the receiving node ahead of time), and send a message back in return.Note also that these graphs are all regular, in that all nodes have the same number of incoming and outgoing connections.
Decentralized averaging errors.
To further motivate our choice of using the directed exponential graph with SGP, let us forget about optimization for a moment and focus on the problem of distributed averaging, described in Section 2, using the
PushSum algorithm. Recall that each node starts with a vector , and the goal of the agents is to compute the average . Then, since , after steps we havewhere is a matrix with as its th row.
Let
. The worstcase rate of convergence can be related to the secondlargest singular value of
[16]. In particular, after iterations we havewhere denotes the second largest singular value of .
For the scheme proposed above, cycling deterministically through neighbors in the directed exponential graph, one can verify that after iterations, we have , so all nodes exactly have the average. Intuitively, this happens because the directed exponential graph has excellent mixing properties: from any starting node in the network, one can get to any other node in at most hops. For nodes, after iterations averaging has converged using this strategy. In comparison, if one were to cycle through edges of the complete graph (where every node is connected to every other node), then for , after 5 consecutive iterations one would have still have ; i.e., nodes could be much further from the average (and hence, much less wellsynchronized).
Similarly, one could consider designing the matrices in a stochastic manner, where each node randomly samples one neighbor to send to at every iteration. If each node samples a destination uniformly from its set of neighbors in the directed exponential graph, then , and if each node randomly selected a destination uniformly among all other nodes in the network (i.e., randomly from neighbors in the complete graph), then . Thus, random schemes are still not as effective at quickly averaging as deterministically cycling through neighbors in the directed exponential graph. Moreover, with randomized schemes, we are no longer guaranteed that each node receives the same number of messages at every iteration, so the communication load will not be balanced as in the deterministic scheme.
The above discussion focused only on approximate distributed averaging, which is a key step within decentralized optimization. When averaging occurs less quickly, this also impacts optimization. Specifically, since nodes are less wellsynchronized (i.e., further from a consensus), each node will be evaluating its local minibatch gradient at a different point in parameter space. Averaging these points (rather than updates based on minibatch gradients evaluated at the same point) can be seen as injecting additional noise into the optimization process, and in our experience this can lead to worse performance in terms of train and generalization errors.
a.2 Stochastic Gradient Push
In all of our experiments, we minimize the number of floatingpoint operations performed in each iteration, , by using the mixing weights
for all . In words, each node assigns mixing weights uniformly to all of its outneighbors in each iteration. Recalling our convention that each node is an in and outneighbor of itself, it is easy to see that this choice of mixingweight satisfies the columnstochasticity property. It may very well be that there is a different choice of mixingweights that lead to better spectral properties of the gossip algorithm; however we leave this exploration for future work. We denote node ’s uniform mixing weights at time by — dropping the other subscript, which identifies the receiving node.
To maximize the utility of the resources available on each server, each node (occupying a single server exclusively) runs two threads, a gossip thread and a computation thread. The computation thread executes the main logic used to train the local model on the GPUs available to the node, while the communication thread is used for internode network I/O. In particular, the communication thread is used to gossip messages between nodes. When using Ethernetbased communication, the nodes communicate their parameter tensors over CPUs. When using InifiniBandbased communication, the nodes communicate their parameter tensors using GPUDirect RDMA, thereby avoiding superfluous device to pinnedmemory transfers of the model parameters.
Each node initializes its model on one of its GPUs, and initializes its scalar pushsum weight to . At the start of training, each node also allocates a send and a receive communicationbuffer in pinned memory on the CPU (or equivalently on a GPU in the case of GPUDirect RDMA communication).
In each iteration, the communication thread waits for the sendbuffer to be filled by the computation thread; transmits the message in the sendbuffer to its outneighbours; and then aggregates any newlyreceived messages into the receivebuffer.
In each iteration, the computation thread blocks to retrieve the aggregated messages in the receivebuffer; directly adds the received parameters to its own model parameters; and directly adds the received pushsum weights to its own pushsum weight. The computation thread then converts the model parameters to the debiasedestimate by dividing by the pushsum weight; executes a forwardbackward pass of the debiased model in order to compute a stochastic minibatch gradient; converts the model parameters back to the biased estimate by multiplying by the pushsum weight; and applies the newlycomputed stochastic gradients to the biased model. The updated model parameters are then multiplied by the mixing weight, , and asynchronously copied back into the sendbuffer for use by the communication thread. The pushsum weight is also multiplied by the same mixing weight and concatenated into the sendbuffer.
In short, gossip is performed on the biased model parameters (pushsum numerators); stochastic gradients are computed using the debiased model parameters; stochastic gradients are applied back to the biased model parameters; and then the biasedmodel and the pushsum weight are multiplied by the same uniform mixingweight and copied back into the sendbuffer.
a.3 Hyperparameters
When we “apply the stochastic gradients” to the biased model parameters, we actually carry out an SGD step with nesterov momentum. For the , and GPU experiments we use the same exact learningrate, schedule, momentum, and weight decay as those suggested in [5] for SGD. In particular, we use a reference learningrate of with respect to a sample batch, and scale this linearly with the batchsize; we decay the learningrate by a factor of at epochs ; we use a nesterov momentum parameter of , and we use weight decay . For the GPU experiments, we decay the learningrate by a factor of at epochs , and we use a reference learningrate of . In the GPU experiment with two peerspernode, we revert to the original learningrate and schedule.
Appendix B Extra Experiments
b.1 Additional Training Curves




FigureB.3 show the train and validation curve for the different runs performed on Ethernet 10Gbit/s. Figure B.6 show the train and validation curve for the different runs performed on InfiniBand 100Gbit/s.


Figure B.9 reports the training and validation accuracy of SGP when using a highlatency interconnect. As we scale up the number of nodes , we scale down the total number of iterations to following Theorem 1. In particular, 32node runs involves times fewer global iterations than 4node runs. We additionally report the total number of iterations and the final performances in Table 3. While we reduce the total number iterations by a factor of when going from 4 to 32 nodes, the validation accuracy and training accuracy of the 32 node runs remain within and , respecively, of the validation and training accuracy achieved by the 4node runs (and remains within the of AllReduce SGD accuracies).
Nodes  4  8  16  32 

Iterations  
Training (%)  
Validation (%) 
b.2 Discrepancy across different nodes


Here, we investigate the performance variability across nodes during training for SGP. In figure B.12, we report the minimum, maximum and mean error across the different nodes for training and validation. In an initial training phase, we observe that nodes have different validation errors; their local copies of the Resnet50 model diverge. As we decrease the learning, the variability between the different nodes diminish and the nodes eventually converging to similar errors. This suggests that all models ultimately represent the same function, achieving consensus.
b.3 Timing on InfiniBand with local data copy
To better isolate the effects of dataloading, we ran experiments on 32, 64, and 128 GPUs, where we first copied the data locally on every node. In that setting, we observe in Figure B.13 that the timeperiteration of SGP remains approximately constant as we increase the number of nodes in the network, while the time for AllReduce SGD increases.
b.4 SGP Scaling Analysis



Figure B.17 highlights SGP input images throughput as we scale up the number of cluster node on both Ethernet 10Gbit/s and Infiniband 100Gbit/s. SGP exhibits 88.6% scaling efficiency on Ethernet 10Gbit/s and 92.4% on InfiniBand and stay close to the ideal scaling in both cases. In addition Figure (c) shows that SGP exhibit better scaling as we increase the network size and is more robust to highlatency interconnect.
Appendix C Proofs of Theoretical Guarantees
Our convergence rate analysis is divided into three main parts. In the first one (subsection C.1) we present upper bounds for three important expressions that appear in our computations. In subsection C.2 we focus on proving the important for our analysis Lemma 8 based on which we later build the proofs of our main Theorems. Finally in the third part (subsection C.3) we provide the proofs for Theorems 1 and 2.
Preliminary results.
In our analysis two preliminary results are extensively used. We state them here for future reference.

Let . Since , it holds that
(7) Thus, .

Let then from the summation of geometric sequence and for any it holds that
(8)
Matrix Representation.
The presentation of stochastic gradient push (Algorithm 1) was done from node i’s perspective for all . Note however, that the update rule of SGP at the iteration can be viewed from a global viewpoint. To see this let us define the following matrices (concatenation of the values of all nodes at the iteration):
Using the above matrices, the step of SGP (Algorithm 1) can be expressed as follows ^{2}^{2}2Note that in a similar way we can obtain matrix expressions for steps 7 and 8 of Algorithm 1.:
(9) 
where is the transpose of matrix with entries:
(10) 
Recall that we also have .
Bound for the mixing matrices.
Next we state a known result from the control literature studying consensusbased optimization which allows us to bound the distance between the debiased parameters at each node and the nodewise average.
Recall that we have assumed that the sequence of mixing matrices are strongly connected. A directed graph is called strongly connected if every pair of vertices is connected with a directed path (i.e., following the direction of edges), and the strongly connected assumption is that the graph with edge set is strongly connected, for every .
We have also assumed that for all , each column of has nonzero entries, and the diameter of the graph with edge set has diameter at most . Based on these assumptions, after consecutive iterations, the product
has no nonzero entries. Moreover, every entry of is at least .
Lemma 3.
Suppose that Assumption 3 (mixing connectivity) holds. Let and let . Then there exists a constant
where is the dimension of , , and , such that, for all and ,
c.1 Important Upper Bounds
Lemma 4 (Bound of stochastic gradient).
We have the following inequality under Assumptions 1 and 2:
Proof.
(11)  
∎
Lemma 5.
Let Assumptions 13 hold. Then,
(12)  