PopSGD: Decentralized Stochastic Gradient Descent in the Population Model

The population model is a standard way to represent large-scale decentralized distributed systems, in which agents with limited computational power interact in randomly chosen pairs, in order to collectively solve global computational tasks. In contrast with synchronous gossip models, nodes are anonymous, lack a common notion of time, and have no control over their scheduling. In this paper, we examine whether large-scale distributed optimization can be performed in this extremely restrictive setting. We introduce and analyze a natural decentralized variant of stochastic gradient descent (SGD), called PopSGD, in which every node maintains a local parameter, and is able to compute stochastic gradients with respect to this parameter. Every pair-wise node interaction performs a stochastic gradient step at each agent, followed by averaging of the two models. We prove that, under standard assumptions, SGD can converge even in this extremely loose, decentralized setting, for both convex and non-convex objectives. Moreover, surprisingly, in the former case, the algorithm can achieve linear speedup in the number of nodes n. Our analysis leverages a new technical connection between decentralized SGD and randomized load-balancing, which enables us to tightly bound the concentration of node parameters. We validate our analysis through experiments, showing that PopSGD can achieve convergence and speedup for large-scale distributed learning tasks in a supercomputing environment.

Authors

• 8 publications
• 3 publications
• 8 publications
• 3 publications
• 9 publications
• 53 publications
05/13/2020

SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized Optimization

In this paper, we consider the problem of communication-efficient decent...
05/15/2019

DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression

A standard approach in large scale machine learning is distributed stoch...
02/16/2018

Stochastic convex optimization algorithms are the most popular way to tr...
01/16/2020

Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent

Machine learning has made tremendous progress in recent years, with mode...
08/12/2019

Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

Load imbalance pervasively exists in distributed deep learning training ...
12/26/2017

Scaling GRPC Tensorflow on 512 nodes of Cori Supercomputer

We explore scaling of the standard distributed Tensorflow with GRPC prim...
05/17/2021

Removing Data Heterogeneity Influence Enhances Network Topology Dependence of Decentralized SGD

We consider decentralized stochastic optimization problems where a netwo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One key enabler of the extremely rapid recent progress of machine learning has been

distribution: the ability to efficiently split computation among multiple nodes or devices, in order to share the high computational loads of training large models, and therefore reduce end-to-end training time. Distributed machine learning has become commonplace, and it is not unusual to encounter systems which distribute model training among tens or even hundreds of nodes. In this paper, we take this trend to the extreme, and ask: would it be possible to distribute basic optimization procedures such as stochastic gradient descent (SGD) to thousands of agents? How could the dynamics be implemented in such a large-scale setting, and what would be with the resulting convergence and speedup behavior?

To get some intuition, let us consider the classical data-parallel distribution strategy for SGD [15]. We are in the classical empirical risk minimization setting, where we have a set of samples from a distribution, and wish to minimize the function , which is the average of losses over samples from by finding . Assume that we have compute nodes which can process samples in parallel. Data-parallel SGD consists of parallel iterations, in which each node computes the gradient for one sample, followed by a gradient exchange. Globally, this leads to the iteration:

 xt+1=xt−ηtP∑i=1˜git(xt),

where is the learning rate, is the value of the global parameter, initially , and is the stochastic gradient with respect to the parameter obtained by node at time .

When extending this strategy to high node counts, two major bottlenecks are communication and synchronization. In particular, to maintain a consistent view of the parameter , the nodes would need to broadcast and receive all gradients, and would need to synchronize with all other nodes, at the end of every iteration. Recently, a tremendous amount of work has been dedicated to address these two barriers. In particular, there has been significant progress on communication-reduced variants of SGD (e.g. [43, 46, 7, 51, 3, 22, 25]), asynchronous variants (e.g. [41, 42, 23, 6]), as well as large-batch or periodic model averaging methods, which aim to reduce the frequency of communication(e.g. [24, 53] and [19, 45]), or even decentralized synchronous variants(e.g. [32, 48, 31]

). Using such techniques, it is possible to scale SGD to hundreds of nodes, even for complex objectives such as the training of deep neural networks. However, in systems with node counts in the thousands or larger, it is infeasible to assume that all nodes can efficiently synchronize into global iterations, or that they can directly broadcast messages to all other nodes.

Instead, in this paper we will consider the classic population model of distributed computing [8], which is defined as follows. We are given a population of compute agents, each with its own input, which cooperate to perform some globally meaningful computation with respect to their inputs. Interactions occur pairwise, where the two interaction partners are randomly chosen in every step. Thus, algorithms are specified in terms of the agents’ state transitions upon an interaction. The basic unit of time is a single pairwise interaction between two nodes, whereas global (parallel) time is measured as the total number of interactions divided by , the number of nodes. Parallel time corresponds intuitively to the average number of interactions per node to reach convergence. Population protocols have a rich history in distributed computing (e.g. [8, 11, 10, 9, 12, 4, 5]), and are standard in modelling distributed systems with millions or billions of nodes, such as Chemical Reaction Networks (CRNs) [16, 18] and synthetic DNA strand displacement cascades [20]. The key difference between population protocols and the synchronous gossip models (e.g. [52, 32, 31]) previously used to analyze decentralized SGD is that nodes are not synchronized: since pairwise interactions are uniform random, there are no global rounds, and nodes lack a common notion of time.

While the population model is a theoretical construct, we show that it can be efficiently mapped to large-scale super-computing scenarios, with large numbers of compute nodes connected by a fast point-to-point interconnect, where we can avoid the high costs of global synchronization.

An immediate instantiation of SGD in the population model would be to initially assign one sample from the distribution to each node

, and have each node maintain its own parameter estimate

. Whenever two nodes interact, they exchange samples, and each performs a gradient update with respect to the other’s sample. If we assume interaction pairs are uniform random (with replacement), each node would obtain a stochastic gradient upon each interaction, and therefore each model would converge locally. However, this instance would not have any parallel speedup, since the SGD instances at each node are essentially independent.

In this context, we propose a natural change to the above procedure, by which interacting nodes and first perform a gradient step, and then also average their resulting models upon every interaction. Effectively, if node interacts with node , node ’s updated model becomes

 xi←xi+xj2−ηi˜gi(xi)+˜gj(xj)2, (1.1)

where is the interaction partner, and the stochastic gradients and are taken with respect to each other’s samples. The update for node is symmetric. In this paper, we analyze a variant of the above protocol, which we call PopSGD, in the population protocol model.

We show that, perhaps surprisingly, this simple decentralized SGD averaging dynamic provides strong convergence guarantees for both convex and non-convex objectives. First, we prove that, under standard convexity and smoothness assumptions, PopSGD has convergence speedup that linear in the number of nodes . Specifically, if is the parameter average over all node models at parallel time , and is the average over the sequence , then, for large enough parallel time , our main result is that

 E[f(yT)−f(x∗)]=O(σ2/(nT)), (1.2)

which is times faster than the sequential variant given the same number of SGD steps per node. (Please see Theorem 4.1

for the exact formulation, including bias and variance terms, and for an extended discussion.) This result suggests that, even though interactions occur only pairwise, uniformly at random, and in an uncoordinated manner, as long as the convergence time is large enough to amortize the information propagation, the protocol enjoys the full parallel speedup of mini-batch SGD with a batch size proportional to the number of nodes. While speedup behaviour has been observed in various

synchronous models– e.g. [32, 45, 31], or for complex accelerated algorithms [27]–we are the first to show that SGD does not require the existence of globally synchronized rounds or global communication.

Central to our analytic approach is a new technical connection between averaging decentralized SGD and the line of research studying load-balancing processes in theoretical computer science (e.g. [14, 35, 47, 39]). Intuitively, we show PopSGD can be viewed as a composition between a set of instances of SGD–each corresponding to one of the local parameters –which are loosely coupled via pairwise averaging, whose role is to “balance” the models by keeping them well concentrated around their mean, despite the random nature of the pairwise interactions. Our analysis characterizes this concentration, showing that, in essence, the averaging process propagates enough information to globally “simulate” SGD with a batch of size , even though communication is only performed pairwise. We emphasize that the convexity of the objective function in isolation would not be sufficient to prove this fact, see e.g. [45]. Along the way, we overcome non-trivial technical difficulties, such as the lack of a common notion of time among nodes, or the fact that, due to the structure of SGD, this novel load-balancing process exhibits non-trivial correlations within the same round.

On the practical side, we provide convergence and speedup results using an efficient implementation of PopSGD using Pytorch/MPI applied to regression tasks, but also to the standard CIFAR/ImageNet classification tasks for deployments on a multi-GPU nodes, and on the CSCS Piz Daint supercomputer (

[1]). Experiments predictably confirm the scalability of PopSGD to large node counts. More surprisingly, we also observe an improvement in convergence versus number of SGD iterations per model at higher node counts, in both convex and non-convex settings. In particular, using PopSGD, we are able to train the ResNet18 and ResNet50 [26] models to full accuracy using only the number of SGD updates per model, compared to the sequential baseline, resulting in fast convergence with nearly linear scalability.

Related Work.

The study of decentralized optimization algorithms dates back to  [49], and is related to the study of gossip algorithms for information dissemination [30, 52]. The distinguishing feature of this setting is that information is shared in the absence of a coordinator, between nodes and their neighbors. Several classic algorithms have been ported and analyzed in the gossip setting, such as subgradient methods for convex objectives [37, 29, 44] or ADMM [50, 28]. References [32, 33, 13] consider SGD-type algorithms in the non-convex setting, while references [48, 31] analyze the use of quantization in the gossip setting.

The key difference between the gossip model discussed above and the population model we analyze is that, in the gossip model, time is divided into global rounds, which are assumed to be consistent across nodes. In each round, each node broadcasts and receives with all neighbors: analytically, this synchrony assumption allows the global evolution of the system to be represented in terms of the “gossip” (communication/contact) matrix (see e.g. [52, 31, 33]). This matrix characterization is not possible in the population model: nodes do not share a notion of time or rounds, as communication steps correspond to individual interactions. If we consider sequences of consecutive interactions, due to scheduler randomness, some nodes will interact several times, while others may not interact at all during such an interval. For this reason, our analysis makes use of fine-grained potential arguments, rather than a global matrix iteration. There do exist instances in the literature which consider dynamic interaction models. First, Nedic et al. [36] present a gradient tracking algorithm in a different dynamic graph model; however, their results would not translate to the PP model, as they assume a dynamically changing but simple graph in each iteration: by contrast, merging together multiple interaction rounds from the PP model could result in a multi-graph. Further, Hendrickx et al. [27] achieve exponential convergence rates in a gossip model where transmissions are synchronized across edges; however, the algorithm they consider is a more complex instance of accelerated coordinate descent, and is therefore quite different from the simple dynamics we consider. Importantly, neither reference considers large-scale deployments for non-convex objectives (in particular, neural networks).

It is interesting to contrast our work with that of  [33], who assume a synchronized gossip model, but allow for asynchrony, in the sense that nodes can see stale variants of their neighbors’ messages. To our understanding, the models are not directly comparable, and in particular their results cannot be applied to our setting. This is because they rely on a variant of the global matrix iteration (albeit based on delayed views). They consider challenging smooth non-convex objectives, but do not show any speedup due to parallelization. We provide similar guarantees for non-convex PopSGD, but are able to show linear speedup in the convex case. In our setting, since interaction pairs are chosen randomly, there can be significant local variability between the interaction rates of the nodes, and so the matrix iteration would not be applicable.

The population model, introduced by [8], is one of the standard models of distributed computing, and has proved a useful abstraction for modeling settings from wireless sensor networks [40, 21], to gene regulatory networks [16], and chemical reaction networks [18]. While there has been significant work on algorithms for specific tasks, such as majority (consensus), leader election, and approximate counting, we are the first to consider optimization tasks in the population model. Potential analysis is a common tool in load balancing (e.g. [47, 39]), which we adapt for our setting. The three key departures from the load balancing literature are that 1) in the SGD setting, the weights (gradients) are correlated with the loads of the bins (the models); 2) in the SGD setting, the magnitude of the weights is diminishing (due to the learning rate), which requires a continuous re-evaluation of the balancing objective; 3) models are multi-dimensional, whereas in the classical formulation the balanced items are single-dimensional.

2 Preliminaries

The Population Protocol Model.

We consider a variant of the population protocol model which consists of a set of anonymous agents, or nodes, each executing a local state machine. (Our analysis will make use of node identifiers only for exposition purposes.) Since our application is continuous optimization, we will assume that the agents’ states may store real numbers. The execution proceeds in discrete steps

, where in each step a new pair of agents is selected uniformly at random to interact from the set of all possible pairs. (To preserve symmetry of the protocols, we will assume that a process may interact with a copy of itself, with low probability.) Each of the two chosen agents updates its state according to a state update function, specified by the algorithm. The basic unit of time is a single pairwise interaction between two nodes. Notice however that in a real system

of these interactions could occur in parallel. Thus, a standard global measure is parallel time, defined as the total number of interactions divided by , the number of nodes. Parallel time intuitively corresponds to the average number of interactions per node to convergence.

Stochastic Optimization.

We assume that the agents wish to minimize a -dimensional, differentiable and strongly convex function with parameter , that is:

 (x−y)T(∇f(x)−∇f(y))≥ℓ∥x−y∥2,∀x,y∈Rd. (2.1)

Specifically, we will assume the empirical risk minimization setting, in which agents are given access to a set of data samples coming from some underlying distribution , to a function which encodes the loss of the argument at the sample . The goal of the agents is to converge on a model which minimizes the empirical loss, that is

 x∗= argminxf(x)= argminx(1/m)m∑i=1fi(x). (2.2)

In this paper, we assume that the agents employ these samples to run a decentralized variant of SGD, described in detail in the next section. For this, we will assume that agents have access to stochastic gradients of the function , which are functions such that . Stochastic gradients can be computed by each agent by sampling i.i.d. the distribution , and computing the gradient of at with respect to that sample. In the population model, we could implement this by procedure either by allowing agents to sample in each step, or by assigning a sample to each agent , and having agents compute gradients of their local models with respect to each others’ samples. We will assume the following about the gradients:

• [nolistsep,leftmargin=5mm]

• Smooth Gradients: The gradient is -Lipschitz continuous for some , i.e. for all :

 ∥∇f(x)−∇f(y)∥≤L∥x−y∥. (2.3)
• Bounded Variance: The variance of the stochastic gradients is bounded by some , i.e. for all :

 (2.4)
• Bounded Second Moment

: The second moment of the stochastic gradients is bounded by some

, i.e. for all :

 E∥˜g(x)∥2≤M2. (2.5)

3 The Population SGD Algorithm

Algorithm Description.

We now describe a decentralized variant of SGD, designed to be executed by a population of nodes, interacting in uniform random pairs as per the population protocol model. We assume that each node has access to local stochastic gradients , and maintains a model estimate , as well as a local learning rate . For simplicity, we will assume that this initial estimate is at each agent, although its value may be arbitrary. We detail the way in which the learning rates are updated below. Specifically, upon every interaction, the interacting agents and perform the following steps:

We are interested in the convergence of local models : after interactions occur in total. For the theoretical reasons, in the case when is convex, we derive convergence for which is weighted average of average values of local models per step(See Theorem 4.1). In the beginning of section 5 we show that by performing single global averaging step at time step , which is carefully chosen from specified distribution, we can make sure that in expectation local models converge with the same rate as .

Estimating Time and the Learning Rate.

In parallel with the above algorithm, each agent maintains a local time value , which is estimated using a local “phase clock” protocol. These local times are defined and updated as follows. The initial value at each agent is . Upon each interaction, the interacting agents and exchange their time values. The agent with a lower time value, say , will increment its value by . The other agent keeps its local value unchanged. (We break ties arbitrarily.) Although intuitively simple, the above procedure provides strong probabilistic guarantees on how far individual values may stray from the mean: with high probability,111An event holds with high probability (w.h.p.) if it occurs with probability , for constant and the total number of interactions - . all the estimates are in the interval , where is a constant.

Given the current value of at the agent, the value of the learning rate at is simply , where and are constant parameters which we will fix later. This will ensure that the gap between two agents’ learning rates will be in the interval , w.h.p. (See Lemma 4.2.)

4 The Convergence of PopSGD in the Convex Case

This section is dedicated to proving that the following result holds with high probability:

Theorem 4.1.

Let be an -smooth, -strongly convex function satisfying conditions (2.3)—(2.5), whose minimum we are trying to find via the PopSGD procedure given in Algorithm 1. Let the learning rate for process at local time be where and are fixed(for some constant ). Let the sequence of weights be given by . Define , and . Then, for any time , we have with probability that

 E[f(yT)−f(x∗)]≤a3ℓ2ST∥μ0−x∗∥2+64T(T+2a)ℓSTσ2+9216Tn2ℓ2STM2L. (4.1)

Discussion.

We first emphasize that, in the above bound, the time refers to the number of interactions (as opposed to parallel time). With this in mind, we focus on the bound in the case where and the parameters , , and are assumed to be well-behaved. In this case, since , the first and third terms are vanishing as grows, and we get that convergence is dominated by the second term, which can be bounded as . It is tempting to think that this is roughly the same rate as sequential SGD; however, our notion of time is different, as we are counting the total number of SGD steps executed in total at all the models. (In fact, the total number of SGD steps up to is , since each interaction performs two SGD steps.)

It is interesting to interpret this from the perspective of an arbitrary local model. For this, notice that the parallel time corresponding to the number of total interactions , which is by definition , corresponds (up to constants) to the average number of interactions and SGD steps performed by each node up to time . Thus, for any single model, convergence with respect to its number of performed SGD steps would be , which would correspond to running SGD with a batch size of . Notice that this reduction in convergence time is solely thanks to the averaging step: in the absence of averaging, each local model would converge independently at a rate of . We note that our discussion assumes a batch size of , but it would generalize to arbitrary batch size , replacing with . We note that, due to the concentration properties of the averaging process, the claim above can be extended to show convergence behavior for arbitrary individual models (instead of the average of models ).

Proof Overview.

The argument, given in full in the Additional Material, can be split into two steps. The first step aims to bound the variance of the local models at each time and node with respect to the mean . It views this quantity as a potential , which we show has supermartingale-like behavior, which enables us to bound its expected value as . This shows that the variance of the parameters is always bounded with respect to the number of nodes, but also, importantly, that it can be controlled via the learning rate. The key technical step here is Lemma 4.3, which provides a careful bound for the evolution of the potential at a step, by modelling SGD as a dynamic load balancing process: each interaction corresponds to a weight generation step (in which gradients are generated) and a load balancing step, in which the “loads” of the two nodes (corresponding to their model values) are balanced through averaging.

In the second step of the proof, we first bound the rate at which the mean converges towards , where we crucially (and carefully) leverage the variance bound obtained above. This is our second key technical lemma. Next, with this in hand, we can apply a standard argument to characterize the rate at which the quantity converges towards .

Notation and Preliminaries.

In this section, we overview the analysis of the PopSGD protocol. We begin with some notation. Recall that is the number of nodes. We will analyze a sequence of time steps , each corresponding to an individual interaction between two nodes, which are usually denoted by and . We will consider that , and therefore w.h.p. results are assumed to hold throughout the execution. Recall the definition of parallel time , where counts the number of pairwise interactions. For any time , define by the “true” learning rate at time , where and are constants to be fixed later, such that for some constant . We denote by the optimum of the function .

Learning Rate Estimates.

Our first technical result characterizes the gap between the “global” learning rate (in terms of the true time ), and the individual learning rates at an arbitrary agent at the same time, denoted by .

Lemma 4.2.

Let , be the learning rate estimate of agent at time step , in terms of its time estimate . Then, there exists a constant such that, with probability at least (Here, is a total number of steps our algorithms takes), the following holds for every and agent :

 12≤ηtηit≤2. (4.2)

Step 1: Parameter Concentration. Next, let

be a vector of model estimates at time step

, that is . Also, let , be an average estimate at time step . The following potential function measures the concentration of the models around the average:

 Γt=n∑i=1∥Xit−μt∥2.

With this in place, one of our key technical results is to provide a supermartingale-type bound on the evolution of the potential , in terms of , , and the number of nodes .

Lemma 4.3.

For any time step and fixed learning rate used at , we have the bound

 E[Γt+1|Γt]≤(1−1n)Γt+4ηtM(Γtn)1/2+8η2tM2.

Next, we unroll this recurrence to upper bound in expectation for any time step , by choosing an appropriate series of non-constant learning rates.

Lemma 4.4.

If , then the potential is bounded as follows

 E[Γt]≤36nb2/(t+a)2M2=36nη2tM2.

Step 2: Convergence of the Mean and Risk Bound.

The above result allows us to characterize how well the individual parameters are concentrated around their mean, in terms of the second moment of the gradients, the number of nodes, and the learning rate. In turn, this will allow us to provide a recurrence for how fast the parameter average is moving towards the optimum, in terms of the variance and second-moment bounds of the gradients:

Lemma 4.5.

For , we have that

 E∥∥μt+1−x∗∥∥2≤(1−ηtℓn)E∥μt−x∗∥2−ηt2nE[f(μt)−f(x∗)]+16σ2η2tn2+288η3tM2Ln.

Finally, we wish to phrase this bound as a recurrence which will allow us to bound the expected risk of the weighted sum average. We aim to use the following standard result (see e.g. [45]):

Lemma 4.6.

Let , , , be sequences satisfying

 at+1≤(1−ℓαt)at−αtetA+α2tB+α3tC,

for , , , then

 ASTT−1∑t=0wtet≤ℓa34STa0+2T(T+2a)ℓSTB+16Tℓ2STC, (4.3)

for and .

To use the above lemma, we set , and the parameter . We also use , , and . Let . Also, let and . By the convexity of we have that

 E[f(yT)−f(x∗)]≤1STT−1∑t=0wtE[f(μt)−f(x∗)] (4.4)

Using this fact and Lemma 4.6 above we obtain the following final bound:

 E[f(yT)−f(x∗)]≤a3ℓ2ST∥μ0−x∗∥2+64T(T+2a)ℓSTσ2+9216Tn2ℓ2STM2L. (4.5)

To complete the proof of the Theorem, we only need to find the appropriate value of the parameter . For that, we list all the constraints on : and . These inequalities can be satisfied by setting ). This concludes our proof.

5 Extensions

Convergence of local models and alternative to computing yT.

Notice that Theorem 4.1 measures convergence of , where , is a weighted average of -s per step. Notice that actually computing can be expensive, since we need values of local models over steps and it does not necessarily guarantee convergence of each individual model. In order to circumvent this issue, we can look at the following inequality, which in combination with the Jensen’s inequality gives us the proof of Theorem 4.1 (Please see Appendix for details) :

 1STT−1∑t=0wtE[f(μt)−f(x∗)]≤a3ℓ2ST∥μ0−x∗∥2+64T(T+2a)ℓSTσ2+9216Tn2ℓ2STM2L. (5.1)

What we can do is, instead of computing , we just sample time step with probability and compute , by using single global averaging procedure. Observe that is exactly the left hand side of the above inequality.

Hence, we get the convergence identical to the one in Theorem 4.1 and additionally, since we are using global averaging, we also guarantee the same convergence for each local model. Finally, we would like to emphasize that in practice there is no need to compute or to use global averaging, since local models are already converged after interactions.

General Interaction Graphs.

Our analysis can be extended to more general interaction graphs by tying the evolution of the potential in this case. In the following, we present the results for a cycle, leaving the exact derivations for more general classes of expander graphs for the full version. In particular, we assume that each agent is a node on a cycle, and that it is allowed to interact only with its neighbouring nodes. Again, the scheduler chooses interaction edges uniformly at random. In this setting, we can show the following result, which is similar to Theorem 4.1:

Theorem 5.1.

Let be an -smooth, -strongly convex function satisfying conditions (2.3)—(2.5), whose minimum we are trying to find via the PopSGD procedure on a cycle. Let the learning rate for process at local time be where and are fixed(for some constant ). Let the sequence of weights be given by . Define , and . Then, for any time , we have with probability that

 E[f(yT)−f(x∗)]≤a3ℓ2ST∥μ0−x∗∥2+64T(T+2a)ℓSTσ2+25600Tn6ℓ3STM2L2.

Notice that for , the second term dominates convergence and we can repeat the same argument as for Theorem 4.1 to show convergence (where is the total number of interactions). Next we provide the sketch of a proof for the PopSGD on a cycle case. The crucial part of the proof is to show the following bound for the potential per step:

 E[Γt+1|Γt]≤(1−1O(n3))Γt+4ηtM(Γtn)1/2+8η2tM2.

Notice that the above inequality is similar to Lemma 4.3 (Except a factor in front of is larger for a cycle, since "information" propagates slower on a cycle) and it allows us to show that:

 E[Γt]≤O(n5η2tM2)

This, in turn, allows us to prove the above theorem , by carefully following the steps in the proof of Lemma 4.5 and then by using Lemma 4.6 to finish the proof.

The Non-Convex Case.

Next, we show convergence for non-convex, but smooth functions:

Theorem 5.2.

Let be an non-convex, -smooth, function satisfying conditions (2.3) and (2.5), whose minimum we are trying to find via the PopSGD procedure given in Algorithm 1. Define . For the total number of interactions - , time step and process , let (for any process, learning rate does not depend on current local or global time). Then, for any , we have that:

 1TT−1∑t=0E∥∇f(μt)∥2≤2√n(f(μ0)−f(x∗))√T+144LM2nT+16LM2√n√T (5.2)

The proof follows from the more general version of the theorem, which is proved in the appendix, see Theorem 9.2. Observe that, since is the total number of interactions and is equal to , where is a parallel time, we get convergence . This matches convergence of the sequential version. (Note that in the sequential case parallel time and the total number of interactions are the same.)

6 Experimental Results

In this section, we validate our results numerically by implementing PopSGD in Pytorch, using MPI for inter-node communication [2]. We are interested in the convergence behavior of the algorithm, and in the scalability with respect to the number of nodes. Our study is split into simulated experiments for convex objectives–to examine the validity of our analysis as increases—and large-scale real-world experiments for non-convex objectives (training neural networks), aimed to examine whether PopSGD can provide scalability and convergence for such objectives.

Convex Objectives.

To validate our analysis in the convex case, we evaluated the performance of PopSGD on three datasets: (1) a real-world linear regression problem (the

Year Prediction dataset [17]) with a test/train split, and ; (2) a real-world classification problem (gisette [17]) with test/train split, and ; (3) a synthetic least-squares problem of the form (2.2) with where and , with and variable . As a baseline, we employ vanilla SGD with manual learning rate tuning. The learning rate is adjusted in terms of the number of local steps each node has taken, similar to our analysis.

Our first set of experiments examines train and test loss for PopSGD on the real-world tasks specified above. We examine the test loss behavior with respect to the number of nodes , and execute for powers of between and . Each node obtains a stochastic gradient by sampling elements from the training set in a batch. We tuned the learning rate parameter for each instance independently, through line search, and obtained learning rates in the interval for Gisette, and for Year Prediction.

for the results.(The number of epochs is cropped to maintain visibility, but the trends are maintained in general.) The results confirm our analysis; notice in particular the clear separation between instances for different

, which follows exactly the increase in the number of nodes, although the X axis values correspond to the same number of gradient steps for the local model. In Appendix 8, we present additional experiments which precisely examine the reduction in variance versus the number of nodes on the synthetic regression task, confirming our analysis.

Training Neural Networks.

Our second set of experiments tests PopSGD in a realistic distributed environment. For this, we implemented PopSGD in Pytorch using MPI one-sided primitives [2], which allow nodes to read eachothers’ models for averaging without explicit synchronization. We used PopSGD to train ResNets on the classic CIFAR-10 and ImageNet datasets, and deploy our code on the CSCS Piz Daint supercomputer, which is composed of Cray XC50 nodes, each with a Xeon E5-2690v3 CPU and an NVIDIA Tesla P100 GPU, using a state-of-the-art Aries interconnect.

Training proceeds in epochs, each of which is structured as follows. At the beginning of each epoch, we shuffle the dataset and split it into partitions, ensuring that each partition will be assigned to exactly two processes. We define a fixed constant , which counts the number of times each process will iterate through its partition in an epoch. In our experiments, takes values between and . Intuitively, follows the intuition given by Theorems 4.1 and 5.2, which suggest that PopSGD needs additional iterations for the information in each partition to propagate to all nodes. Given this setup, PopSGD may appear wasteful, since each sample is processed times in each epoch. We compensate for this by compressing the standard training schedules for the networks we examine, dividing the total number of epochs by , and scaling the learning rate updates accordingly. We keep local batch sizes constant with respect to the sequential baseline. That is, in an experiment with nodes and multiplier , PopSGD processes each sample less times than standard sequential or data-parallel SGD, and performs less gradient updates per model. Surprisingly, we found this to be sufficient to preserve both train and test accuracy. Figure 2 shows the test and train accuracies for the ResNet18 model trained on the ImageNet dataset, with 32 Piz Daint nodes and , as well as scalability versus number of nodes.

The results suggest that PopSGD can indeed preserve convergence while ensuring scalability for this complex task. We note that the hyperparameters used for model training are identical to the standard sequential recipe (batch size 128 per node), with the sole exception of the

mult parameter, for which we found low constant values () to be sufficient. Appendix 8 presents additional experiments for ResNet50/Imagenet and ResNet20/CIFAR-10, which further substantiate this claim.

7 Discussion and Future Work

We have analyzed for the first time the convergence of decentralized SGD in the population model of distributed computing. We have shown that, despite the extremely weak synchronization characteristics of this model, SGD is able to still converge in this setting, and moreover, under parameter and objective assumptions, can even achieve linear speedup in the number of agents in terms of parallel time. The empirical results confirmed our analytical findings. The main surprising result is that PopSGD presents speedup behavior roughly similar to mini-batch SGD, even though a node only sees one gradient update and a single model at a time. This asymptotic speedup behavior is obviously optimal (assuming all other parameters are constant), since we cannot expect super-linear speedup in . Similar speedup behavior required either the existence of synchronized rounds (e.g. [32]), or global averaging steps [45], or both. Our work opens several avenues for future work. One natural extension is to study PopSGD with quantized communication, or allowing the interactions to present inconsistent (stale) model views to the two agents. Another avenue is to tighten the bounds in terms of their dependence on the problem conditioning, and on the objective assumptions.

References

Convex Losses.

In these experiments, we examine the convergence of PopSGD versus parallel time for different node counts, and compared it with the sequential baseline. More precisely, for PopSGD, we execute the protocol by simulating the entire sequence of interactions sequentially, and track the evolution of train and test loss at an arbitrary fixed model with respect to the number of SGD steps it performs. Notice that this is practically equivalent to tracking with respect to parallel time. In this case, the theory suggests that loss convergence and variance should both improve when increasing the number of nodes. Figure 3(a) presents the results for the synthetic linear regression example with , for various values of , for constant learning rate across all models, and batch size for each local gradient. Figure 3(b) compares PopSGD convergence (with local batch size ) against sequential mini-batch SGD with batch size equal to the number of nodes .

Examining Figure 3(a), we observe that both the convergence and loss variance improve as we increase the number of nodes , even though the target model executes exactly the same number of gradient steps at the same point on the axis. Of note, variance decreases proportionally with the number of nodes, with having the smallest variance. Compared to mini-batch SGD with batch size = (Figure 3(b)), PopSGD with has similar, but notably higher variance, which follows the analytical bound in Theorem 4.1.

CIFAR-10 Experiments.

We illustrate convergence and scaling results for non-convex objectives by using PopSGD to train a standard ResNet20 DNN model on CIFAR-10 in Pytorch, using 8 GPU nodes, comparing against vanilla and local SGD performing global averaging every 100 batches (we found this value necessary for the model to converge). We measure the error/loss at an arbitrary process for PopSGD. We run the parallel versions at and nodes.

The results in Figure 4(c) show that (a,b) PopSGD does indeed converge faster as we increase population size, tracking the trend from the convex case; and (c) PopSGD can provide non-trivial scalability, comparable or better than data-parallel and local SGD.

Training ResNet50 on ImageNet. Figure 4 shows the test and train accuracies for the ResNet50 model trained on the ImageNet dataset, with 32 Piz Daint nodes and . PopSGD achieves test accuracy within relative to the Torchvision baseline, despite the vastly inferior number of iterations, in a total of 29 hours. By way of comparison, end-to-end training using standard data-parallel SGD takes approximately 48h on the same setup (using 8 GPUs instead of 32 to avoid large-batch effects).

9 Complete Correctness Argument

Lemma 4.2.

Let , be the learning rate estimate of agent at time step , in terms of its time estimate . Then, there exists a constant such that, with probability at least (Here, is a total number of steps our algorithms takes), the following holds for every and agent :

 12≤ηtηit≤2. (9.1)
Proof.

Let for some fixed constant . The following lemma is proved as Theorem 2.10 in [38]:

Lemma 9.1.

For any , and some fixed constants and ,

Subsequently, we can show that for any and agent :

 Pr[∣tn−Vit∣≥qζlogT]≤Pr[Gt≥Tq]Markov≤4θζϵnTq. (9.2)

Hence, for large enough constant , using union bound over steps, we can show that there exists a constant such that for every and agent , , with probability at least .

Let be , thus . This allows us to finish the proof of the lemma:

 12≤a+t−qζnlogTa+t≤ηtηit≤a+t+qζnlogTa+t≤2. (9.3)

This allows us to bound the per step change of potential , in terms of global learning rate .

Lemma 4.3.

For any time step and fixed learning rate used at , we have the bound

 E[Γt+1|Γt]≤(1−1n)Γt+4ηtM(Γtn)1/2+8η2tM2.
Proof.

First we bound change in potential for some time step . Let be a change in potential when we choose different agents and at random and let be a change in potential when we select the same node . We get that

 E[Δt|Xt]=∑i∑i≠j1n2E[Δi,jt|Xt]+n∑i=11n2E[Δit|Xt]. (9.4)

We proceed by bounding a change in potential for fixed . Observe, that in this case
and .
Hence,

 Xit+1−μt+1=Xjt+1−μt+1=(Xit+Xjt)/2−n−22n(ηt˜gi(Xit)+ηt˜gj(Xjt))−μt.

For , since we get that

 Xkt+1−μt+1=Xkt+1n(ηt˜gi(Xit)+ηt˜gj(Xjt))−μt.

This gives us that

 E[Δi,jt|Xt] = E∥∥(Xit+Xjt)/2−n−22n(ηit˜gi(Xit)+ηjt˜gj(Xjt))−μt∥2−∥Xit−μt∥2 +E∥∥(Xit+Xjt)/2−n−22n(ηit˜gi(Xit)+ηjt˜gj(Xjt))−μt∥2−∥Xjt−μt∥2 +∑k∉{i,j}(E∥∥Xkt+1n(ηit˜gi(Xit)+ηjt˜gj(Xjt))−μt∥∥2−∥Xkt−μt∥2) = 2∥(Xit−μt)/2+(Xjt−μt)/2∥2−∥Xit−μt∥2−∥Xjt−μt∥2 −n−2nE⟨ηit˜gi(Xit)+ηjt˜gj(Xjt),(Xit−μt)+(Xjt−μt)⟩ +2(n−22n)2E∥ηit˜gi(Xit)+ηjt˜gj(Xjt)∥2 +∑k∉{i,j}(2nE⟨ηit˜gi(Xit)+ηjt˜gj(Xjt),Xkt−μt⟩+1n2E∥ηit˜gi(Xit)+ηjt˜gj(Xjt)∥2)

Observe that

 E∥ηit˜gi(Xit)+ηt˜gj(Xjt)∥2≤2(ηit)2E∥gi(Xit)∥2+2(ηjt)2E∥gj(Xjt)∥2Fact (???)≤2M2((ηit)2+(ηjt)2)Lemma ???≤16η2tM2.

and

 n∑k=1E⟨ηit˜gi(Xit)+ηjt˜gj(Xjt),Xkt−μt⟩=0.

Thus, we have that

 E[Δi,jt|Xt] ≤ 2∥(Xit−μt)/2+(Xjt−μt)/2∥2−∥Xit−μt∥2−∥Xjt−μt∥2 −E⟨ηit˜g