1 Introduction
One key enabler of the extremely rapid recent progress of machine learning has been
distribution: the ability to efficiently split computation among multiple nodes or devices, in order to share the high computational loads of training large models, and therefore reduce endtoend training time. Distributed machine learning has become commonplace, and it is not unusual to encounter systems which distribute model training among tens or even hundreds of nodes. In this paper, we take this trend to the extreme, and ask: would it be possible to distribute basic optimization procedures such as stochastic gradient descent (SGD) to thousands of agents? How could the dynamics be implemented in such a largescale setting, and what would be with the resulting convergence and speedup behavior?To get some intuition, let us consider the classical dataparallel distribution strategy for SGD [15]. We are in the classical empirical risk minimization setting, where we have a set of samples from a distribution, and wish to minimize the function , which is the average of losses over samples from by finding . Assume that we have compute nodes which can process samples in parallel. Dataparallel SGD consists of parallel iterations, in which each node computes the gradient for one sample, followed by a gradient exchange. Globally, this leads to the iteration:
where is the learning rate, is the value of the global parameter, initially , and is the stochastic gradient with respect to the parameter obtained by node at time .
When extending this strategy to high node counts, two major bottlenecks are communication and synchronization. In particular, to maintain a consistent view of the parameter , the nodes would need to broadcast and receive all gradients, and would need to synchronize with all other nodes, at the end of every iteration. Recently, a tremendous amount of work has been dedicated to address these two barriers. In particular, there has been significant progress on communicationreduced variants of SGD (e.g. [43, 46, 7, 51, 3, 22, 25]), asynchronous variants (e.g. [41, 42, 23, 6]), as well as largebatch or periodic model averaging methods, which aim to reduce the frequency of communication(e.g. [24, 53] and [19, 45]), or even decentralized synchronous variants(e.g. [32, 48, 31]
). Using such techniques, it is possible to scale SGD to hundreds of nodes, even for complex objectives such as the training of deep neural networks. However, in systems with node counts in the thousands or larger, it is infeasible to assume that all nodes can efficiently synchronize into global iterations, or that they can directly broadcast messages to all other nodes.
Instead, in this paper we will consider the classic population model of distributed computing [8], which is defined as follows. We are given a population of compute agents, each with its own input, which cooperate to perform some globally meaningful computation with respect to their inputs. Interactions occur pairwise, where the two interaction partners are randomly chosen in every step. Thus, algorithms are specified in terms of the agents’ state transitions upon an interaction. The basic unit of time is a single pairwise interaction between two nodes, whereas global (parallel) time is measured as the total number of interactions divided by , the number of nodes. Parallel time corresponds intuitively to the average number of interactions per node to reach convergence. Population protocols have a rich history in distributed computing (e.g. [8, 11, 10, 9, 12, 4, 5]), and are standard in modelling distributed systems with millions or billions of nodes, such as Chemical Reaction Networks (CRNs) [16, 18] and synthetic DNA strand displacement cascades [20]. The key difference between population protocols and the synchronous gossip models (e.g. [52, 32, 31]) previously used to analyze decentralized SGD is that nodes are not synchronized: since pairwise interactions are uniform random, there are no global rounds, and nodes lack a common notion of time.
While the population model is a theoretical construct, we show that it can be efficiently mapped to largescale supercomputing scenarios, with large numbers of compute nodes connected by a fast pointtopoint interconnect, where we can avoid the high costs of global synchronization.
An immediate instantiation of SGD in the population model would be to initially assign one sample from the distribution to each node
, and have each node maintain its own parameter estimate
. Whenever two nodes interact, they exchange samples, and each performs a gradient update with respect to the other’s sample. If we assume interaction pairs are uniform random (with replacement), each node would obtain a stochastic gradient upon each interaction, and therefore each model would converge locally. However, this instance would not have any parallel speedup, since the SGD instances at each node are essentially independent.In this context, we propose a natural change to the above procedure, by which interacting nodes and first perform a gradient step, and then also average their resulting models upon every interaction. Effectively, if node interacts with node , node ’s updated model becomes
(1.1) 
where is the interaction partner, and the stochastic gradients and are taken with respect to each other’s samples. The update for node is symmetric. In this paper, we analyze a variant of the above protocol, which we call PopSGD, in the population protocol model.
We show that, perhaps surprisingly, this simple decentralized SGD averaging dynamic provides strong convergence guarantees for both convex and nonconvex objectives. First, we prove that, under standard convexity and smoothness assumptions, PopSGD has convergence speedup that linear in the number of nodes . Specifically, if is the parameter average over all node models at parallel time , and is the average over the sequence , then, for large enough parallel time , our main result is that
(1.2) 
which is times faster than the sequential variant given the same number of SGD steps per node. (Please see Theorem 4.1
for the exact formulation, including bias and variance terms, and for an extended discussion.) This result suggests that, even though interactions occur only pairwise, uniformly at random, and in an uncoordinated manner, as long as the convergence time is large enough to amortize the information propagation, the protocol enjoys the full parallel speedup of minibatch SGD with a batch size proportional to the number of nodes. While speedup behaviour has been observed in various
synchronous models– e.g. [32, 45, 31], or for complex accelerated algorithms [27]–we are the first to show that SGD does not require the existence of globally synchronized rounds or global communication.Central to our analytic approach is a new technical connection between averaging decentralized SGD and the line of research studying loadbalancing processes in theoretical computer science (e.g. [14, 35, 47, 39]). Intuitively, we show PopSGD can be viewed as a composition between a set of instances of SGD–each corresponding to one of the local parameters –which are loosely coupled via pairwise averaging, whose role is to “balance” the models by keeping them well concentrated around their mean, despite the random nature of the pairwise interactions. Our analysis characterizes this concentration, showing that, in essence, the averaging process propagates enough information to globally “simulate” SGD with a batch of size , even though communication is only performed pairwise. We emphasize that the convexity of the objective function in isolation would not be sufficient to prove this fact, see e.g. [45]. Along the way, we overcome nontrivial technical difficulties, such as the lack of a common notion of time among nodes, or the fact that, due to the structure of SGD, this novel loadbalancing process exhibits nontrivial correlations within the same round.
On the practical side, we provide convergence and speedup results using an efficient implementation of PopSGD using Pytorch/MPI applied to regression tasks, but also to the standard CIFAR/ImageNet classification tasks for deployments on a multiGPU nodes, and on the CSCS Piz Daint supercomputer (
[1]). Experiments predictably confirm the scalability of PopSGD to large node counts. More surprisingly, we also observe an improvement in convergence versus number of SGD iterations per model at higher node counts, in both convex and nonconvex settings. In particular, using PopSGD, we are able to train the ResNet18 and ResNet50 [26] models to full accuracy using only the number of SGD updates per model, compared to the sequential baseline, resulting in fast convergence with nearly linear scalability.Related Work.
The study of decentralized optimization algorithms dates back to [49], and is related to the study of gossip algorithms for information dissemination [30, 52]. The distinguishing feature of this setting is that information is shared in the absence of a coordinator, between nodes and their neighbors. Several classic algorithms have been ported and analyzed in the gossip setting, such as subgradient methods for convex objectives [37, 29, 44] or ADMM [50, 28]. References [32, 33, 13] consider SGDtype algorithms in the nonconvex setting, while references [48, 31] analyze the use of quantization in the gossip setting.
The key difference between the gossip model discussed above and the population model we analyze is that, in the gossip model, time is divided into global rounds, which are assumed to be consistent across nodes. In each round, each node broadcasts and receives with all neighbors: analytically, this synchrony assumption allows the global evolution of the system to be represented in terms of the “gossip” (communication/contact) matrix (see e.g. [52, 31, 33]). This matrix characterization is not possible in the population model: nodes do not share a notion of time or rounds, as communication steps correspond to individual interactions. If we consider sequences of consecutive interactions, due to scheduler randomness, some nodes will interact several times, while others may not interact at all during such an interval. For this reason, our analysis makes use of finegrained potential arguments, rather than a global matrix iteration. There do exist instances in the literature which consider dynamic interaction models. First, Nedic et al. [36] present a gradient tracking algorithm in a different dynamic graph model; however, their results would not translate to the PP model, as they assume a dynamically changing but simple graph in each iteration: by contrast, merging together multiple interaction rounds from the PP model could result in a multigraph. Further, Hendrickx et al. [27] achieve exponential convergence rates in a gossip model where transmissions are synchronized across edges; however, the algorithm they consider is a more complex instance of accelerated coordinate descent, and is therefore quite different from the simple dynamics we consider. Importantly, neither reference considers largescale deployments for nonconvex objectives (in particular, neural networks).
It is interesting to contrast our work with that of [33], who assume a synchronized gossip model, but allow for asynchrony, in the sense that nodes can see stale variants of their neighbors’ messages. To our understanding, the models are not directly comparable, and in particular their results cannot be applied to our setting. This is because they rely on a variant of the global matrix iteration (albeit based on delayed views). They consider challenging smooth nonconvex objectives, but do not show any speedup due to parallelization. We provide similar guarantees for nonconvex PopSGD, but are able to show linear speedup in the convex case. In our setting, since interaction pairs are chosen randomly, there can be significant local variability between the interaction rates of the nodes, and so the matrix iteration would not be applicable.
The population model, introduced by [8], is one of the standard models of distributed computing, and has proved a useful abstraction for modeling settings from wireless sensor networks [40, 21], to gene regulatory networks [16], and chemical reaction networks [18]. While there has been significant work on algorithms for specific tasks, such as majority (consensus), leader election, and approximate counting, we are the first to consider optimization tasks in the population model. Potential analysis is a common tool in load balancing (e.g. [47, 39]), which we adapt for our setting. The three key departures from the load balancing literature are that 1) in the SGD setting, the weights (gradients) are correlated with the loads of the bins (the models); 2) in the SGD setting, the magnitude of the weights is diminishing (due to the learning rate), which requires a continuous reevaluation of the balancing objective; 3) models are multidimensional, whereas in the classical formulation the balanced items are singledimensional.
2 Preliminaries
The Population Protocol Model.
We consider a variant of the population protocol model which consists of a set of anonymous agents, or nodes, each executing a local state machine. (Our analysis will make use of node identifiers only for exposition purposes.) Since our application is continuous optimization, we will assume that the agents’ states may store real numbers. The execution proceeds in discrete steps
, where in each step a new pair of agents is selected uniformly at random to interact from the set of all possible pairs. (To preserve symmetry of the protocols, we will assume that a process may interact with a copy of itself, with low probability.) Each of the two chosen agents updates its state according to a state update function, specified by the algorithm. The basic unit of time is a single pairwise interaction between two nodes. Notice however that in a real system
of these interactions could occur in parallel. Thus, a standard global measure is parallel time, defined as the total number of interactions divided by , the number of nodes. Parallel time intuitively corresponds to the average number of interactions per node to convergence.Stochastic Optimization.
We assume that the agents wish to minimize a dimensional, differentiable and strongly convex function with parameter , that is:
(2.1) 
Specifically, we will assume the empirical risk minimization setting, in which agents are given access to a set of data samples coming from some underlying distribution , to a function which encodes the loss of the argument at the sample . The goal of the agents is to converge on a model which minimizes the empirical loss, that is
(2.2) 
In this paper, we assume that the agents employ these samples to run a decentralized variant of SGD, described in detail in the next section. For this, we will assume that agents have access to stochastic gradients of the function , which are functions such that . Stochastic gradients can be computed by each agent by sampling i.i.d. the distribution , and computing the gradient of at with respect to that sample. In the population model, we could implement this by procedure either by allowing agents to sample in each step, or by assigning a sample to each agent , and having agents compute gradients of their local models with respect to each others’ samples. We will assume the following about the gradients:

[nolistsep,leftmargin=5mm]

Smooth Gradients: The gradient is Lipschitz continuous for some , i.e. for all :
(2.3) 
Bounded Variance: The variance of the stochastic gradients is bounded by some , i.e. for all :
(2.4) 
Bounded Second Moment
: The second moment of the stochastic gradients is bounded by some
, i.e. for all :(2.5)
3 The Population SGD Algorithm
Algorithm Description.
We now describe a decentralized variant of SGD, designed to be executed by a population of nodes, interacting in uniform random pairs as per the population protocol model. We assume that each node has access to local stochastic gradients , and maintains a model estimate , as well as a local learning rate . For simplicity, we will assume that this initial estimate is at each agent, although its value may be arbitrary. We detail the way in which the learning rates are updated below. Specifically, upon every interaction, the interacting agents and perform the following steps:
We are interested in the convergence of local models : after interactions occur in total. For the theoretical reasons, in the case when is convex, we derive convergence for which is weighted average of average values of local models per step(See Theorem 4.1). In the beginning of section 5 we show that by performing single global averaging step at time step , which is carefully chosen from specified distribution, we can make sure that in expectation local models converge with the same rate as .
Estimating Time and the Learning Rate.
In parallel with the above algorithm, each agent maintains a local time value , which is estimated using a local “phase clock” protocol. These local times are defined and updated as follows. The initial value at each agent is . Upon each interaction, the interacting agents and exchange their time values. The agent with a lower time value, say , will increment its value by . The other agent keeps its local value unchanged. (We break ties arbitrarily.) Although intuitively simple, the above procedure provides strong probabilistic guarantees on how far individual values may stray from the mean: with high probability,^{1}^{1}1An event holds with high probability (w.h.p.) if it occurs with probability , for constant and the total number of interactions  . all the estimates are in the interval , where is a constant.
Given the current value of at the agent, the value of the learning rate at is simply , where and are constant parameters which we will fix later. This will ensure that the gap between two agents’ learning rates will be in the interval , w.h.p. (See Lemma 4.2.)
4 The Convergence of PopSGD in the Convex Case
This section is dedicated to proving that the following result holds with high probability:
Theorem 4.1.
Let be an smooth, strongly convex function satisfying conditions (2.3)—(2.5), whose minimum we are trying to find via the PopSGD procedure given in Algorithm 1. Let the learning rate for process at local time be where and are fixed(for some constant ). Let the sequence of weights be given by . Define , and . Then, for any time , we have with probability that
(4.1) 
Discussion.
We first emphasize that, in the above bound, the time refers to the number of interactions (as opposed to parallel time). With this in mind, we focus on the bound in the case where and the parameters , , and are assumed to be wellbehaved. In this case, since , the first and third terms are vanishing as grows, and we get that convergence is dominated by the second term, which can be bounded as . It is tempting to think that this is roughly the same rate as sequential SGD; however, our notion of time is different, as we are counting the total number of SGD steps executed in total at all the models. (In fact, the total number of SGD steps up to is , since each interaction performs two SGD steps.)
It is interesting to interpret this from the perspective of an arbitrary local model. For this, notice that the parallel time corresponding to the number of total interactions , which is by definition , corresponds (up to constants) to the average number of interactions and SGD steps performed by each node up to time . Thus, for any single model, convergence with respect to its number of performed SGD steps would be , which would correspond to running SGD with a batch size of . Notice that this reduction in convergence time is solely thanks to the averaging step: in the absence of averaging, each local model would converge independently at a rate of . We note that our discussion assumes a batch size of , but it would generalize to arbitrary batch size , replacing with . We note that, due to the concentration properties of the averaging process, the claim above can be extended to show convergence behavior for arbitrary individual models (instead of the average of models ).
Proof Overview.
The argument, given in full in the Additional Material, can be split into two steps. The first step aims to bound the variance of the local models at each time and node with respect to the mean . It views this quantity as a potential , which we show has supermartingalelike behavior, which enables us to bound its expected value as . This shows that the variance of the parameters is always bounded with respect to the number of nodes, but also, importantly, that it can be controlled via the learning rate. The key technical step here is Lemma 4.3, which provides a careful bound for the evolution of the potential at a step, by modelling SGD as a dynamic load balancing process: each interaction corresponds to a weight generation step (in which gradients are generated) and a load balancing step, in which the “loads” of the two nodes (corresponding to their model values) are balanced through averaging.
In the second step of the proof, we first bound the rate at which the mean converges towards , where we crucially (and carefully) leverage the variance bound obtained above. This is our second key technical lemma. Next, with this in hand, we can apply a standard argument to characterize the rate at which the quantity converges towards .
Notation and Preliminaries.
In this section, we overview the analysis of the PopSGD protocol. We begin with some notation. Recall that is the number of nodes. We will analyze a sequence of time steps , each corresponding to an individual interaction between two nodes, which are usually denoted by and . We will consider that , and therefore w.h.p. results are assumed to hold throughout the execution. Recall the definition of parallel time , where counts the number of pairwise interactions. For any time , define by the “true” learning rate at time , where and are constants to be fixed later, such that for some constant . We denote by the optimum of the function .
Learning Rate Estimates.
Our first technical result characterizes the gap between the “global” learning rate (in terms of the true time ), and the individual learning rates at an arbitrary agent at the same time, denoted by .
Lemma 4.2.
Let , be the learning rate estimate of agent at time step , in terms of its time estimate . Then, there exists a constant such that, with probability at least (Here, is a total number of steps our algorithms takes), the following holds for every and agent :
(4.2) 
Step 1: Parameter Concentration. Next, let
be a vector of model estimates at time step
, that is . Also, let , be an average estimate at time step . The following potential function measures the concentration of the models around the average:With this in place, one of our key technical results is to provide a supermartingaletype bound on the evolution of the potential , in terms of , , and the number of nodes .
Lemma 4.3.
For any time step and fixed learning rate used at , we have the bound
Next, we unroll this recurrence to upper bound in expectation for any time step , by choosing an appropriate series of nonconstant learning rates.
Lemma 4.4.
If , then the potential is bounded as follows
Step 2: Convergence of the Mean and Risk Bound.
The above result allows us to characterize how well the individual parameters are concentrated around their mean, in terms of the second moment of the gradients, the number of nodes, and the learning rate. In turn, this will allow us to provide a recurrence for how fast the parameter average is moving towards the optimum, in terms of the variance and secondmoment bounds of the gradients:
Lemma 4.5.
For , we have that
Finally, we wish to phrase this bound as a recurrence which will allow us to bound the expected risk of the weighted sum average. We aim to use the following standard result (see e.g. [45]):
Lemma 4.6.
Let , , , be sequences satisfying
for , , , then
(4.3) 
for and .
To use the above lemma, we set , and the parameter . We also use , , and . Let . Also, let and . By the convexity of we have that
(4.4) 
Using this fact and Lemma 4.6 above we obtain the following final bound:
(4.5) 
To complete the proof of the Theorem, we only need to find the appropriate value of the parameter . For that, we list all the constraints on : and . These inequalities can be satisfied by setting ). This concludes our proof.
5 Extensions
Convergence of local models and alternative to computing .
Notice that Theorem 4.1 measures convergence of , where , is a weighted average of s per step. Notice that actually computing can be expensive, since we need values of local models over steps and it does not necessarily guarantee convergence of each individual model. In order to circumvent this issue, we can look at the following inequality, which in combination with the Jensen’s inequality gives us the proof of Theorem 4.1 (Please see Appendix for details) :
(5.1) 
What we can do is, instead of computing , we just sample time step with probability and compute , by using single global averaging procedure. Observe that is exactly the left hand side of the above inequality.
Hence, we get the convergence identical to the one in Theorem 4.1 and additionally, since we are using global averaging, we also guarantee the same convergence for each local model. Finally, we would like to emphasize that in practice there is no need to compute or to use global averaging, since local models are already converged after interactions.
General Interaction Graphs.
Our analysis can be extended to more general interaction graphs by tying the evolution of the potential in this case. In the following, we present the results for a cycle, leaving the exact derivations for more general classes of expander graphs for the full version. In particular, we assume that each agent is a node on a cycle, and that it is allowed to interact only with its neighbouring nodes. Again, the scheduler chooses interaction edges uniformly at random. In this setting, we can show the following result, which is similar to Theorem 4.1:
Theorem 5.1.
Let be an smooth, strongly convex function satisfying conditions (2.3)—(2.5), whose minimum we are trying to find via the PopSGD procedure on a cycle. Let the learning rate for process at local time be where and are fixed(for some constant ). Let the sequence of weights be given by . Define , and . Then, for any time , we have with probability that
Notice that for , the second term dominates convergence and we can repeat the same argument as for Theorem 4.1 to show convergence (where is the total number of interactions). Next we provide the sketch of a proof for the PopSGD on a cycle case. The crucial part of the proof is to show the following bound for the potential per step:
Notice that the above inequality is similar to Lemma 4.3 (Except a factor in front of is larger for a cycle, since "information" propagates slower on a cycle) and it allows us to show that:
This, in turn, allows us to prove the above theorem , by carefully following the steps in the proof of Lemma 4.5 and then by using Lemma 4.6 to finish the proof.
The NonConvex Case.
Next, we show convergence for nonconvex, but smooth functions:
Theorem 5.2.
Let be an nonconvex, smooth, function satisfying conditions (2.3) and (2.5), whose minimum we are trying to find via the PopSGD procedure given in Algorithm 1. Define . For the total number of interactions  , time step and process , let (for any process, learning rate does not depend on current local or global time). Then, for any , we have that:
(5.2) 
The proof follows from the more general version of the theorem, which is proved in the appendix, see Theorem 9.2. Observe that, since is the total number of interactions and is equal to , where is a parallel time, we get convergence . This matches convergence of the sequential version. (Note that in the sequential case parallel time and the total number of interactions are the same.)
6 Experimental Results
In this section, we validate our results numerically by implementing PopSGD in Pytorch, using MPI for internode communication [2]. We are interested in the convergence behavior of the algorithm, and in the scalability with respect to the number of nodes. Our study is split into simulated experiments for convex objectives–to examine the validity of our analysis as increases—and largescale realworld experiments for nonconvex objectives (training neural networks), aimed to examine whether PopSGD can provide scalability and convergence for such objectives.
Convex Objectives.
To validate our analysis in the convex case, we evaluated the performance of PopSGD on three datasets: (1) a realworld linear regression problem (the
Year Prediction dataset [17]) with a test/train split, and ; (2) a realworld classification problem (gisette [17]) with test/train split, and ; (3) a synthetic leastsquares problem of the form (2.2) with where and , with and variable . As a baseline, we employ vanilla SGD with manual learning rate tuning. The learning rate is adjusted in terms of the number of local steps each node has taken, similar to our analysis.on a real linear regression (left) and logistic regression (right) datasets. The baseline is sequential SGD, which is identical to PopSGD with node count
.Our first set of experiments examines train and test loss for PopSGD on the realworld tasks specified above. We examine the test loss behavior with respect to the number of nodes , and execute for powers of between and . Each node obtains a stochastic gradient by sampling elements from the training set in a batch. We tuned the learning rate parameter for each instance independently, through line search, and obtained learning rates in the interval for Gisette, and for Year Prediction.
Please see Figure 1(b)
for the results.(The number of epochs is cropped to maintain visibility, but the trends are maintained in general.) The results confirm our analysis; notice in particular the clear separation between instances for different
, which follows exactly the increase in the number of nodes, although the X axis values correspond to the same number of gradient steps for the local model. In Appendix 8, we present additional experiments which precisely examine the reduction in variance versus the number of nodes on the synthetic regression task, confirming our analysis.Training Neural Networks.
Our second set of experiments tests PopSGD in a realistic distributed environment. For this, we implemented PopSGD in Pytorch using MPI onesided primitives [2], which allow nodes to read eachothers’ models for averaging without explicit synchronization. We used PopSGD to train ResNets on the classic CIFAR10 and ImageNet datasets, and deploy our code on the CSCS Piz Daint supercomputer, which is composed of Cray XC50 nodes, each with a Xeon E52690v3 CPU and an NVIDIA Tesla P100 GPU, using a stateoftheart Aries interconnect.
Training proceeds in epochs, each of which is structured as follows. At the beginning of each epoch, we shuffle the dataset and split it into partitions, ensuring that each partition will be assigned to exactly two processes. We define a fixed constant , which counts the number of times each process will iterate through its partition in an epoch. In our experiments, takes values between and . Intuitively, follows the intuition given by Theorems 4.1 and 5.2, which suggest that PopSGD needs additional iterations for the information in each partition to propagate to all nodes. Given this setup, PopSGD may appear wasteful, since each sample is processed times in each epoch. We compensate for this by compressing the standard training schedules for the networks we examine, dividing the total number of epochs by , and scaling the learning rate updates accordingly. We keep local batch sizes constant with respect to the sequential baseline. That is, in an experiment with nodes and multiplier , PopSGD processes each sample less times than standard sequential or dataparallel SGD, and performs less gradient updates per model. Surprisingly, we found this to be sufficient to preserve both train and test accuracy. Figure 2 shows the test and train accuracies for the ResNet18 model trained on the ImageNet dataset, with 32 Piz Daint nodes and , as well as scalability versus number of nodes.
The results suggest that PopSGD can indeed preserve convergence while ensuring scalability for this complex task. We note that the hyperparameters used for model training are identical to the standard sequential recipe (batch size 128 per node), with the sole exception of the
mult parameter, for which we found low constant values (–) to be sufficient. Appendix 8 presents additional experiments for ResNet50/Imagenet and ResNet20/CIFAR10, which further substantiate this claim.7 Discussion and Future Work
We have analyzed for the first time the convergence of decentralized SGD in the population model of distributed computing. We have shown that, despite the extremely weak synchronization characteristics of this model, SGD is able to still converge in this setting, and moreover, under parameter and objective assumptions, can even achieve linear speedup in the number of agents in terms of parallel time. The empirical results confirmed our analytical findings. The main surprising result is that PopSGD presents speedup behavior roughly similar to minibatch SGD, even though a node only sees one gradient update and a single model at a time. This asymptotic speedup behavior is obviously optimal (assuming all other parameters are constant), since we cannot expect superlinear speedup in . Similar speedup behavior required either the existence of synchronized rounds (e.g. [32]), or global averaging steps [45], or both. Our work opens several avenues for future work. One natural extension is to study PopSGD with quantized communication, or allowing the interactions to present inconsistent (stale) model views to the two agents. Another avenue is to tighten the bounds in terms of their dependence on the problem conditioning, and on the objective assumptions.
References
 [1] The CSCS Piz Daint supercomputer. http://www.cscs.ch/computers/piz_daint. Accessed: 2018125.
 [2] Mpich: high performance and widely portable implementation of the message passing interface (mpi) standard. http://www.mpich.org/. Accessed: 2018125.
 [3] Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021, 2017.
 [4] Dan Alistarh, James Aspnes, David Eisenstat, Rati Gelashvili, and Ronald L Rivest. Timespace tradeoffs in population protocols. In Proceedings of the 28th ACMSIAM Symposium on Discrete Algorithms, (SODA), pages 2560–2579, 2017.
 [5] Dan Alistarh, James Aspnes, and Rati Gelashvili. Spaceoptimal majority in population protocols. In Proceedings of the 29th ACMSIAM Symposium on Discrete Algorithms, (SODA), pages 2221–2239, 2018.
 [6] Dan Alistarh, Christopher De Sa, and Nikola Konstantinov. The convergence of stochastic gradient descent in asynchronous shared memory. In PODC, pages 169–178, 2018.
 [7] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Randomized quantization for communicationefficient stochastic gradient descent. In Proceedings of NIPS 2017, 2017.
 [8] Dana Angluin, James Aspnes, Zoë Diamadi, Michael J Fischer, and René Peralta. Computation in networks of passively mobile finitestate sensors. Distributed computing, 18(4):235–253, 2006.
 [9] Dana Angluin, James Aspnes, and David Eisenstat. Fast computation by population protocols with a leader. Distributed Computing, 21(3):183–199, 2008.
 [10] Dana Angluin, James Aspnes, and David Eisenstat. A simple population protocol for fast robust approximate majority. Distributed Computing, 21(2):87–102, 2008.
 [11] Dana Angluin, James Aspnes, David Eisenstat, and Eric Ruppert. The computational power of population protocols. Distributed Computing, 20(4):279–304, 2007.
 [12] Dana Angluin, James Aspnes, Michael J Fischer, and Hong Jiang. Selfstabilizing population protocols. ACM Transactions on Autonomous and Adaptive Systems, 3(4):13:1–13:28, 2008.
 [13] Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Michael Rabbat. Stochastic gradient push for distributed deep learning. arXiv preprint arXiv:1811.10792, 2018.
 [14] Yossi Azar, Andrei Z Broder, Anna R Karlin, and Eli Upfal. Balanced allocations. SIAM Journal on Computing, 29(1):180–200, 1999.
 [15] Léon Bottou. Largescale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
 [16] James M Bower and Hamid Bolouri. Computational modeling of genetic and biochemical networks. MIT press, 2004.

[17]
ChihChung Chang and ChihJen Lin.
Libsvm: A library for support vector machines.
ACM Trans. Intell. Syst. Technol., 2(3):27:1–27:27, May 2011.  [18] HoLin Chen, Rachel Cummings, David Doty, and David Soloveichik. Speed faults in computation by chemical reaction networks. Distributed Computing, 30(5):373–390, 2017.

[19]
Kai Chen and Qiang Huo.
Scalable training of deep learning machines by incremental block training with intrablock parallel optimization and blockwise modelupdate filtering.
In 2016 ieee international conference on acoustics, speech and signal processing (icassp), pages 5880–5884. IEEE, 2016.  [20] YuanJyue Chen, Neil Dalchau, Niranjan Srnivas, Andrew Phillips, Luca Cardelli, David Soloveichik, and Georg Seelig. Programmable chemical controllers made from dna. Nature Nanotechnology, 8(10):755–762, 2013.
 [21] Moez Draief and Milan Vojnovic. Convergence speed of binary interval consensus. SIAM Journal on Control and Optimization, 50(3):1087–1109, 2012.
 [22] Nikoli Dryden, Sam Ade Jacobs, Tim Moon, and Brian Van Essen. Communication quantization for dataparallel training of deep neural networks. In Proceedings of the Workshop on Machine Learning in High Performance Computing Environments, pages 1–8. IEEE Press, 2016.
 [23] John C Duchi, Sorathan Chaturapruek, and Christopher Ré. Asynchronous stochastic convex optimization. arXiv preprint arXiv:1508.00882, 2015.
 [24] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 [25] Demjan Grubic, Leo Tam, Dan Alistarh, and Ce Zhang. Synchronous multigpu deep learning with lowprecision communication: An experimental study. In EDBT, 2018.

[26]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [27] Hadrien Hendrikx, Laurent Massoulié, and Francis Bach. Accelerated decentralized optimization with local updates for smooth and strongly convex objectives. arXiv preprint arXiv:1810.02660, 2018.
 [28] Franck Iutzeler, Pascal Bianchi, Philippe Ciblat, and Walid Hachem. Asynchronous distributed optimization using a randomized alternating direction method of multipliers. In 52nd IEEE conference on decision and control, pages 3671–3676. IEEE, 2013.
 [29] Björn Johansson, Maben Rabi, and Mikael Johansson. A randomized incremental subgradient method for distributed optimization in networked systems. SIAM Journal on Optimization, 20(3):1157–1170, 2009.
 [30] David Kempe, Alin Dobra, and Johannes Gehrke. Gossipbased computation of aggregate information. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pages 482–491. IEEE, 2003.
 [31] Anastasia Koloskova, Sebastian U Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. arXiv preprint arXiv:1902.00340, 2019.
 [32] Xiangru Lian, Ce Zhang, Huan Zhang, ChoJio Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. arXiv preprint arXiv:1705.09056, 2017.
 [33] Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. Asynchronous decentralized parallel stochastic gradient descent. arXiv preprint arXiv:1710.06952, 2017.

[34]
Sébastien Marcel and Yann Rodriguez.
Torchvision the machinevision package of torch.
In Proceedings of the 18th ACM international conference on Multimedia, pages 1485–1488. ACM, 2010.  [35] Michael Mitzenmacher. How useful is old information? IEEE Transactions on Parallel and Distributed Systems, 11(1):6–20, 2000.
 [36] Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization over timevarying graphs. SIAM Journal on Optimization, 27(4):2597–2633, 2017.
 [37] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multiagent optimization. IEEE Transactions on Automatic Control, 54(1):48, 2009.
 [38] Yuval Peres, Kunal Talwar, and Udi Wieder. Graphical balanced allocations and the 1 + betachoice process. Random Struct. Algorithms, 47(4):760–775, December 2015.
 [39] Yuval Peres, Kunal Talwar, and Udi Wieder. Graphical balanced allocations and the one plus betachoice process. Random Struct. Algorithms, 47(4):760–775, 2015.
 [40] Etienne Perron, Dinkar Vasudevan, and Milan Vojnovic. Using three states for binary consensus on complete graphs. In Proceedings of the 28th IEEE Conference on Computer Communications, INFOCOM ’09, pages 2527–2535, 2009.
 [41] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In NIPS, pages 693–701, 2011.
 [42] C. M. De Sa, C. Zhang, K. Olukotun, and C. Re. Taming the wild: A unified analysis of hogwildstyle algorihms. In Advances in Neural Information Processing Systems, 2015.
 [43] F. Seide, H. Fu, L. G. Jasha, and D. Yu. 1bit stochastic gradient descent and application to dataparallel distributed training of speech dnns. Interspeech, 2014.
 [44] Ohad Shamir and Nathan Srebro. Distributed stochastic optimization and learning. In 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 850–857. IEEE, 2014.
 [45] Sebastian U Stich. Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
 [46] Nikko Strom. Scalable distributed dnn training using commodity gpu cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association, 2015.

[47]
Kunal Talwar and Udi Wieder.
Balanced allocations: the weighted case.
In David S. Johnson and Uriel Feige, editors,
Proceedings of the 39th Annual ACM Symposium on Theory of Computing, San Diego, California, USA, June 1113, 2007
, pages 256–265. ACM, 2007.  [48] Hanlin Tang, Ce Zhang, Shaoduo Gan, Tong Zhang, and Ji Liu. Decentralization meets quantization. CoRR, abs/1803.06443, 2018.
 [49] John Nikolas Tsitsiklis. Problems in decentralized decision making and computation. Technical report, Massachusetts Inst of Tech Cambridge Lab for Information and Decision Systems, 1984.
 [50] Ermin Wei and Asuman Ozdaglar. Distributed alternating direction method of multipliers. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 5445–5450. IEEE, 2012.
 [51] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, pages 1508–1518, 2017.
 [52] Lin Xiao and Stephen Boyd. Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1):65–78, 2004.
 [53] Yang You, Igor Gitman, and Boris Ginsburg. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017.
8 Additional Experiments
Convex Losses.
In these experiments, we examine the convergence of PopSGD versus parallel time for different node counts, and compared it with the sequential baseline. More precisely, for PopSGD, we execute the protocol by simulating the entire sequence of interactions sequentially, and track the evolution of train and test loss at an arbitrary fixed model with respect to the number of SGD steps it performs. Notice that this is practically equivalent to tracking with respect to parallel time. In this case, the theory suggests that loss convergence and variance should both improve when increasing the number of nodes. Figure 3(a) presents the results for the synthetic linear regression example with , for various values of , for constant learning rate across all models, and batch size for each local gradient. Figure 3(b) compares PopSGD convergence (with local batch size ) against sequential minibatch SGD with batch size equal to the number of nodes .
Examining Figure 3(a), we observe that both the convergence and loss variance improve as we increase the number of nodes , even though the target model executes exactly the same number of gradient steps at the same point on the axis. Of note, variance decreases proportionally with the number of nodes, with having the smallest variance. Compared to minibatch SGD with batch size = (Figure 3(b)), PopSGD with has similar, but notably higher variance, which follows the analytical bound in Theorem 4.1.
CIFAR10 Experiments.
We illustrate convergence and scaling results for nonconvex objectives by using PopSGD to train a standard ResNet20 DNN model on CIFAR10 in Pytorch, using 8 GPU nodes, comparing against vanilla and local SGD performing global averaging every 100 batches (we found this value necessary for the model to converge). We measure the error/loss at an arbitrary process for PopSGD. We run the parallel versions at and nodes.
The results in Figure 4(c) show that (a,b) PopSGD does indeed converge faster as we increase population size, tracking the trend from the convex case; and (c) PopSGD can provide nontrivial scalability, comparable or better than dataparallel and local SGD.
Training ResNet50 on ImageNet. Figure 4 shows the test and train accuracies for the ResNet50 model trained on the ImageNet dataset, with 32 Piz Daint nodes and . PopSGD achieves test accuracy within relative to the Torchvision baseline, despite the vastly inferior number of iterations, in a total of 29 hours. By way of comparison, endtoend training using standard dataparallel SGD takes approximately 48h on the same setup (using 8 GPUs instead of 32 to avoid largebatch effects).
9 Complete Correctness Argument
Lemma 4.2.
Let , be the learning rate estimate of agent at time step , in terms of its time estimate . Then, there exists a constant such that, with probability at least (Here, is a total number of steps our algorithms takes), the following holds for every and agent :
(9.1) 
Proof.
Let for some fixed constant . The following lemma is proved as Theorem 2.10 in [38]:
Lemma 9.1.
For any , and some fixed constants and ,
Subsequently, we can show that for any and agent :
(9.2) 
Hence, for large enough constant , using union bound over steps, we can show that there exists a constant such that for every and agent , , with probability at least .
Let be , thus . This allows us to finish the proof of the lemma:
(9.3) 
∎
This allows us to bound the per step change of potential , in terms of global learning rate .
Lemma 4.3.
For any time step and fixed learning rate used at , we have the bound
Proof.
First we bound change in potential for some time step . Let be a change in potential when we choose different agents and at random and let be a change in potential when we select the same node . We get that
(9.4) 
We proceed by bounding a change in potential for fixed .
Observe, that in this case
and
.
Hence,
For , since we get that
This gives us that
Observe that
and
Thus, we have that
Comments
There are no comments yet.