# Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication

We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over n machines that can only communicate to their neighbors on a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators with quality denoted by ω≤ 1 (ω=1 meaning no compression). We (i) propose a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate O(1/(nT) + 1/(T δ^2 ω)^2) for strongly convex objectives, where T denotes the number of iterations and δ the eigengap of the connectivity matrix. Despite compression quality and network connectivity affecting the higher order terms, the first term in the rate, O(1/(nT)), is the same as for the centralized baseline with exact communication. We (ii) present a novel gossip algorithm, CHOCO-GOSSIP, for the average consensus problem that converges in time O(1/(δ^2ω) (1/ϵ)) for accuracy ϵ > 0. This is (up to our knowledge) the first gossip algorithm that supports arbitrary compressed messages for ω > 0 and still exhibits linear convergence. We (iii) show in experiments that both of our algorithms do outperform the respective state-of-the-art baselines and CHOCO-SGD can reduce communication by at least two orders of magnitudes.

## Authors

• 6 publications
• 26 publications
• 66 publications
• ### SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized Optimization

In this paper, we consider the problem of communication-efficient decent...
05/13/2020 ∙ by Navjot Singh, et al. ∙ 3

• ### Compressed Distributed Gradient Descent: Communication-Efficient Consensus over Networks

Network consensus optimization has received increasing attention in rece...
12/10/2018 ∙ by Xin Zhang, et al. ∙ 0

• ### A Linearly Convergent Algorithm for Decentralized Optimization: Sending Less Bits for Free!

Decentralized optimization methods enable on-device training of machine ...
11/03/2020 ∙ by Dmitry Kovalev, et al. ∙ 22

• ### On the Benefits of Multiple Gossip Steps in Communication-Constrained Decentralized Optimization

In decentralized optimization, it is common algorithmic practice to have...
11/20/2020 ∙ by Abolfazl Hashemi, et al. ∙ 2

• ### SPARQ-SGD: Event-Triggered and Compressed Communication in Decentralized Stochastic Optimization

In this paper, we propose and analyze SPARQ-SGD, which is an event-trigg...
10/31/2019 ∙ by Navjot Singh, et al. ∙ 11

• ### On Biased Compression for Distributed Learning

In the last few years, various communication compression techniques have...
02/27/2020 ∙ by Aleksandr Beznosikov, et al. ∙ 15

• ### Swarming for Faster Convergence in Stochastic Optimization

We study a distributed framework for stochastic optimization which is in...
06/11/2018 ∙ by Shi Pu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Decentralized machine learning methods are becoming core aspects of many important applications, both in view of scalability to larger datasets and systems, but also from the perspective of data locality, ownership and privacy. In this work we address the general data-parallel setting where the data is distributed across different compute devices, and consider decentralized optimization methods that do not rely on a central coordinator (e.g. parameter server) but instead only require on-device computation and local communication with neighboring devices. This covers for instance the classic setting of training machine learning models in large data-centers, but also emerging applications were the computations are executed directly on the consumer devices, which keep their part of the data private at all times.111Note the optimization process itself (as for instance the computed result) might leak information about the data of other nodes. We do not focus on quantifying notions of privacy in this work.

Formally, we consider optimization problems distributed across devices or nodes of the form

 f⋆:=minx∈Rd[f(x):=1nn∑i=1fi(x)], (1)

where for are the objectives defined by the local data available on each node. We also allow each local objective

to have stochastic optimization (or sum) structure, covering the important case of empirical risk minimization in distributed machine learning and deep learning applications.

#### Decentralized Communication.

We model the network topology as a graph with edges if and only if nodes and are connected by a communication link, meaning that these nodes directly can exchange messages (for instance computed model updates). The decentralized setting is motivated by centralized topologies (corresponding to a star graph) often not being possible, and otherwise often posing a significant bottleneck on the central node in terms of communication latency, bandwidth and fault tolerance. Decentralized topologies avoid these bottlenecks and thereby offer hugely improved potential in scalability. For example, while the master node in the centralized setting receives (and sends) in each round messages from all workers, in total222For better connected topologies sometimes more efficient all-reduce and broadcast implementations are available., in decentralized topologies the maximal degree of the network is often constant (e.g. ring or torus) or a slowly growing function in (e.g. scale-free networks).

#### Decentralized Optimization.

For the case of deterministic (full-gradient) optimization, recent seminal theoretical advances show that the network topology only affects higher-order terms of the convergence rate of decentralized optimization algorithms on convex problems (Scaman et al., 2017, 2018). We prove the first analogue result for the important case of decentralized stochastic gradient descent (SGD), proving convergence at rate (ignoring for now higher order terms) on strongly convex functions where denotes the number of iterations.

This result is significant since stochastic methods are highly preferred for their efficiency over deterministic gradient methods in machine learning applications. Our algorithm, Choco-SGD, is as efficient in terms of iterations as centralized mini-batch SGD (and consequently also achieves a speedup of factor compared to the serial setting on a single node) but avoids the communication bottleneck that centralized algorithms suffer from.

#### Communication Compression.

In distributed training, model updates (or gradient vectors) have to be exchanged between the worker nodes. To reduce the amount of data that has to be send, gradient

compression has become a popular strategy. For instance by quantization (Alistarh et al., 2017; Wen et al., 2017; Lin et al., 2018) or sparsification (Wangni et al., 2018; Stich et al., 2018).

These ideas have recently been introduced also to the decentralized setting by Tang et al. (2018a). However, their analysis only covers unbiased compression operators with very (unreasonably) high accuracy constraints. Here we propose the first method that supports arbitrary low accuracy and even biased compression operators, such as in (Alistarh et al., 2018; Lin et al., 2018; Stich et al., 2018).

#### Contributions.

Our contributions can be summarized as follows:

• We show that the proposed Choco-SGD converges at rate , where denotes the number of iterations, the number of workers, the eigengap of the gossip (connectivity) matrix and the compression quality factor ( meaning no compression). We show that the decentralized method achieves the same speedup as centralized mini-batch SGD when the number  of workers grows. The network topology and the compression only mildly affect the convergence rate. This is verified experimentally on the ring topology and by reducing the communication by a factor of 100 ().

• We present the first provably-converging gossip algorithm with communication compression, for the distributed average consensus problem. Our algorithm, Choco-Gossip, converges linearly at rate for accuracy , and allows arbitrary communication compression operators (including biased and unbiased ones). In contrast, previous work required very high-precision quantization and could only show convergence towards a neighborhood of the optimal solution.

• Choco-SGD significantly outperforms state-of-the-art methods for decentralized optimization with gradient compression, such as ECD-SGD and DCD-SGD introduced in (Tang et al., 2018a), in all our experiments.

## 2 Related Work

Stochastic gradient descent (SGD) (Robbins & Monro, 1951; Bottou, 2010) and variants thereof are the standard algorithms for machine learning problems of the form (1), though it is an inherit serial algorithm that does not take the distributed setting into account. Mini-batch SGD (Dekel et al., 2012) is the natural parallelization of SGD for (1) in the centralized setting, i.e. when a master node collects the updates from all worker nodes, and serves a baseline here.

#### Decentralized Optimization.

The study of decentralized optimization algorithms can be tracked back at least to the 1980s (Tsitsiklis, 1984). Decentralized algorithms are sometimes referred to as gossip algorithms (Kempe et al., 2003; Xiao & Boyd, 2004; Boyd et al., 2006) as the information is not broadcasted by a central entity, but spreads—similar as gossip—along the edges specified by the communication graph. The most popular algorithms are based on (sub)gradient descent (Nedić & Ozdaglar, 2009; Johansson et al., 2010), alternating direction method of multipliers (ADMM) (Wei & Ozdaglar, 2012; Iutzeler et al., 2013) or dual averaging (Duchi et al., 2012; Nedić et al., 2015). He et al. (2018) address the more specific problem class of generalized linear models.
For the deterministic (non-stochastic) convex version of (1) a recent line of work developed optimal algorithms based on acceleration  (Jakovetić et al., 2014; Scaman et al., 2017, 2018; Uribe et al., 2018). Rates for the stochastic setting are derived in (Shamir & Srebro, 2014; Rabbat, 2015), under the assumption that the distributions on all nodes are equal. This is a strong restriction which prohibits most distributed machine learning applications. Our algorithm Choco-SGD avoids any such assumption. Also, (Rabbat, 2015) requires multiple communication rounds per stochastic gradient computation and so is not suited for sparse communication, as the required number of communication rounds would increase proportionally to the sparsity. Lan et al. (2018) applied gradient sliding techniques allowing to skip some of the communication rounds.
Lian et al. (2017); Tang et al. (2018b, a); Assran et al. (2018) consider the non-convex setting with Tang et al. (2018a) also applying gradient quantization techniques to reduce the communication cost. However, their algorithms require very high precision quantization, a constraint we can overcome here.

Instead of transmitting a full dimensional (gradient) vector , methods with gradient compression transmit a compressed vector instead, where is a (random) operator chosen such that can be more efficiently represented, for instance by using limited bit representation (quantization) or enforcing sparsity. A class of very common quantization operators is based on random dithering (Goodall, 1951; Roberts, 1962) that is in addition also unbiased, , , see (Alistarh et al., 2017; Wen et al., 2017; Zhang et al., 2017). Much sparser vectors can be obtained by random sparsification techniques that randomly mask the input vectors and only preserve a constant number of coordinates (Wangni et al., 2018; Konecny & Richtárik, 2018; Stich et al., 2018). Techniques that do not directly quantize gradients, but instead maintain additional states are known to perform better in theory and practice (Seide et al., 2014; Lin et al., 2018; Stich et al., 2018), an approach that we pick up here. Our analysis also covers deterministic and biased compression operators, such as in (Alistarh et al., 2018; Stich et al., 2018). We will not further distinguish between sparsification and quantization approaches, and refer to both of them as compression operators in the following.

#### Distributed Average Consensus.

In the decentralized setting, the average consensus problem consists in finding the average vector of local vectors (see (2) below for a formal definition). The problem is an important sub-routine of many decentralized algorithms. It is well known that gossip-type algorithms converge linearly for average consensus  (Kempe et al., 2003; Xiao & Boyd, 2004; Olfati-Saber & Murray, 2004; Boyd et al., 2006). However, for consensus algorithms with compressed communication it has been remarked that the standard gossip algorithm does not converge to the correct solution (Xiao et al., 2005). The proposed schemes in (Carli et al., 2007; Nedić et al., 2008; Aysal et al., 2008; Carli et al., 2010b; Yuan et al., 2012) do only converge to a neighborhood (whose size depends on the compression accuracy) of the solution.
In order to converge, adaptive schemes (with varying compression accuracy) have been proposed (Carli et al., 2010a; Fang & Li, 2010; Li et al., 2011; Thanou et al., 2013). However, these approaches fall back to full (uncompressed) communication to reach high accuracy. In contrast, our method converges linearly to the true solution, even for arbitrary compressed communication, without requiring adaptive accuracy. We are not aware of a method in the literature with similar guarantees.

## 3 Average Consensus with Communication Compression

In this section we present Choco-Gossip, a novel gossip algorithm for distributed average consensus with compressed communication. As mentioned, the average consensus problem is an important special case of type (1), and formalized as

 ¯¯¯x:=1nn∑i=1xi, (2)

for vectors distributed on nodes (consider in (1)). Our proposed algorithm will later serve as a crucial primitive in our optimization algorithm for the general optimization problem (1), but is of independent interest for any average consensus problem with communication constraints.

In Sections 3.13.3 below we first review existing schemes that we later consider as baselines for the numerical comparison. The novel algorithm follows in Section 3.4.

### 3.1 Gossip algorithms

The classic decentralized algorithms for the average consensus problem are gossip type algorithms (see e.g. (Xiao & Boyd, 2004)) that generate sequences on every node by iterations of the form

 x(t+1)i:=x(t)i+γn∑j=1wijΔ(t)ij. (3)

Here denotes a stepsize parameter, averaging weights and denotes a vector that is sent from node to node in iteration . Note that no communication is required if . If we assume symmetry, , the weights naturally define the communication graph with edges if and self-loops for . The convergence rate of scheme (3) crucially depends on the connectivity matrix of the network defined as , also called the interaction or gossip matrix.

###### Definition 1 (Gossip matrix).

We assume that is a symmetric () doubly stochastic (,

) matrix with eigenvalues

and spectral gap

 δ:=1−|λ2(W)|∈(0,1]. (4)

It will also be convenient to define

 ρ :=1−δ, and β :=∥I−W∥2∈[0,2]. (5)

Table 1 gives a few values of the spectral gap for commonly used network topologies (with uniform averaging between the nodes). It is well known that simple matrices with do exist for every connected graph.

### 3.2 Gossip with Exact Communication

For a fixed gossip matrix , the classical algorithm analyzed in (Xiao & Boyd, 2004) corresponds to the choice

 γ :=1, Δ(t)ij :=x(t)j−x(t)i, (E-G)

in (3), with (E-G) standing for exact gossip. This scheme can also conveniently be written in matrix notation as

 X(t+1):=X(t)+γX(t)(W−I), (6)

for iterates .

###### Theorem 1.

Let and be the spectral gap of . Then the iterates of (E-G) converge linearly to the average with the rate

 n∑i=1∥∥x(t)i−¯¯¯x∥∥2≤(1−γδ)2tn∑i=1∥∥x(0)i−¯¯¯x∥∥2.

For this corresponds to the classic result in e.g. (Xiao & Boyd, 2004), here we slightly extend the analysis for arbitrary stepsizes. The short proof shows the elegance of the matrix notation (that we will later also adapt for the proofs that will follow).

###### Proof for γ=1..

Let . Then for the theorem follows from the observation

 ∥∥X(t+1)−¯¯¯¯¯X∥∥2F ∥∥(X(t)−¯¯¯¯¯X)W∥∥2F =∥∥(X(t)−¯¯¯¯¯X)(W−1n11⊤)∥∥2F =ρ2∥∥X(t)−¯¯¯¯¯X∥∥2F.

Here on the second line we used the crucial identity , i.e. the algorithm preserves the average over all iterations. This can be seen from (6):

by Definition 1. The proof for arbitrary follows the same lines and is given in the appendix. ∎

### 3.3 Gossip with Quantized Communication

In every round of scheme (E-G) a full dimensional vector is exchanged between two neighboring nodes for every link on the communication graph (node sends to all its neighbors , ). A natural way to reduce the communication is to compress before sending it, denoted as , for a (potentially random) compression . Informally, we can think of as either a sparsification operator (that enforces sparsity of ) or a quantization operator that reduces the number of bits required to represent . For instance random rounding to less precise floating point numbers or to integers.

Aysal et al. (2008) propose the quantized gossip (Q1-G),

 γ :=1, Δ(t)ij :=Q(x(t)j)−x(t)i, (Q1-G)

in scheme (3), i.e. to apply the compression operator directly on the message that is send out from node to node . However, this algorithm does not preserve the average of the iterates over the iterations, for , and as a consequence does not converge to the optimal solution of (2) (though in practice often to a close neighborhood).

An alternative proposal by Carli et al. (2007) alleviates this drawback. The scheme

 γ :=1, Δ(t)ij :=Q(x(t)j)−Q(x(t)i), (Q2-G)

preserves the average of the iterates over the iterations. However, the scheme also fails to converge for arbitrary precision. If , the noise introduced by the compression, , does not vanish for . As a consequence, the iterates oscillate around when compression error becomes larger than the suboptimality .

Both these schemes have been theoretically studied in (Carli et al., 2010b) under assumption of unbiasendness, i.e. assuming for all (and we will later also adopt this theoretically understood setting in our experiments).

### 3.4 Proposed Method for Compressed Communication

We propose the novel compressed gossip scheme Choco-Gossip that supports a much larger class of compression operators, beyond unbiased quantization as for the schemes above. The algorithm can be summarized as

 ^x(t+1)j:=^x(t)j+Q(x(t)j−^x(t)j),Δ(t)ij:=^x(t+1)j−^x(t+1)i, (Choco-G)

for a stepsize depending on the specific compression operator (this will be detailed below). Here denote additional variables that are stored333A closer look reveals that actually only 2 additional vectors have to be stored per node (refer to Appendix E). by all neighbors of node , , as well as on node itself.

We will show in Theorem 2 below that this scheme (i) preserves the averages of the iterates , over the iterations . Moreover, (ii) the noise introduced by the compression operator vanishes as . Precisely, we will show that for for every . Consequently, the argument for in (Choco-G) goes to zero, and the noise introduced by can be controlled.

The proposed scheme is summarized in Algorithm 1. Every worker stores and updates its own local variable as well as the variables for all neighbors (including itself) .

Algorithm 1 seems to require each machine to store vectors. This is not necessary and the algorithm could be re-written in a way that every node stores only three vectors: , and . For simplicity, we omit this technical modification here and refer to Appendix E for the exact form of the memory-efficient algorithm.

### 3.5 Convergence Analysis for Choco-Gossip

We analyze Algorithm 1 under the following general quality notion for the compression operator .

###### Assumption 1 (Compression operator).

We assume that the compression operator satisfies

 ≤(1−ω)∥x∥2, ∀x∈Rd, (7)

for a parameter . Here denotes the expectation over the internal randomness of operator .

Example operators that satisfy (7) include

• sparsification: Randomly selecting out of coordinates (), or the coordinates with highest magnitude values () give (Stich et al., 2018, Lemma A.1).

• randomized gossip: Setting

with probability

and otherwise, gives .

• rescaled unbiased estimators

: suppose , and , then satisfies (7) with .

• random quantization: For precision (levels) , and the quantization operator

satisfies (7) with (Alistarh et al., 2017, Lemma 3.1).

###### Theorem 2.

Choco-Gossip (Algorithm 1) converges linearly for average consensus:

 et≤(1−δ2ω82)te0,

when using the stepsize , where  is the compression factor as in Assumption 1, and .

For the proof we refer to the appendix, where we used matrix notation to simplify derivations. For the exact communication case we recover the rate from Theorem 1 for stepsize up to constant factors (which seems to be a small artifact of our proof technique). The theorem shows convergence for arbitrary , showing the superiority of scheme (Choco-G) over (Q1-G) and (Q2-G).

## 4 Decentralized Stochastic Optimization

In this section we leverage our proposed average consensus Algorithm 1 to achieve consensus among the compute nodes in a decentralized optimization setting with communication restrictions.

In the decentralized optimization setting (1), not only does every node have a different local objective , but we also allow each to have stochastic optimization (or sum) structure, that is

 fi(x):=Eξi∼DiFi(x,ξi), (8)

for a loss function

and distributions which can be different on every node. Our framework therefore covers both stochastic optimization (e.g. when all are identical) and empirical risk minimization (as in machine learning and deep learning applications) when the ’s are discrete with disjoint support.

### 4.1 Proposed Scheme for Decentralized Optimization

Our proposed method Choco-SGD—Communication-Compressed Decentralized SGD—is stated in Algorithm 2.

The algorithm consists of four parts. The stochastic gradient step in line 3, application of the compression operator in step 4, and the (Choco-G) local communication in lines 5–8 followed by the final iterate update in line 9.

###### Remark 3.

As a special case without any communication compression, and for consensus stepsize as in exact gossip (E-G), Choco-SGD (Algorithm 2) recovers the following standard variant of decentralized SGD with gossip (similar e.g. to (Sirb & Ye, 2016; Lian et al., 2017)), stated for illustration in Algorithm 3.

### 4.2 Convergence Analysis for Choco-SGD

###### Assumption 2.

We assume that each function for is -smooth and

-strongly convex and that the variance on each worker is bounded

 Eξi∥∇Fi(x,ξi)−∇fi(x)∥2≤σ2i, ∀x∈Rd,i∈[n], Eξi∥∇Fi(x,ξi)∥2≤G2, ∀x∈Rd,i∈[n],

where denotes the expectation over . It will be also convenient to denote

 ¯¯¯σ2:=1nn∑i=1σ2i.

For the (standard) definitions of smoothness and strong convexity we refer to Appendix A.1. These assumptions could be relaxed to only hold for , the set of iterates of Algorithm 2.

###### Theorem 4.

Under Assumption 2, Algorithm 2 with SGD stepsizes for parameter for condition number and consensus stepsize chosen as in Theorem 2, converges with the rate

 EΥ(T)=O(¯¯¯σ2μnT)+O(κG2μω2δ4T2)+O(G2μω3δ6T3),

where for an averaged iterate with weights , and . As reminder, denotes the eigengap of , and the compression ratio.

For the proof we refer to the appendix. When and are sufficiently large, the second two terms become negligible compared to —and we recover the convergence rate of of mini-batch SGD in the centralized setting and with exact communication. This is because topology (parameter ) and compression (parameter only affect the higher-order terms in the rate. We also see that we obtain in this setting a speed up compared to the serial implementation of SGD on only one worker.

## 5 Experiments

In this section we first compare Choco-Gossip to the gossip baselines from Section 5.2 and then compare the Choco-SGD to state of the art decentralized stochastic optimization schemes (that also support compressed communication) in Section 5.3.

### 5.1 Shared Experimental Setup

For our experiments we always report the number of iterations of the respective scheme, as well as the number of transmitted bits. These quantities are independent of systems architectures and network bandwidth.

#### Datasets.

In the experiments we rely on the (Sonnenburg et al., 2008) and (Lewis et al., 2004) datasets (cf. Table 4).

#### Compression operators.

We use the (), () and compression operators as described in Section 3.5, where we choose to be of all coordinates and , only requiring , respectively bits to represent a coordinate.

Note that in contrast to Choco-Gossip, the earlier schemes (Q1-G) and (Q2-G) were both analyzed in (Carli et al., 2010b) for unbiased compression operators. In order to reflect this theoretical understood setting we use the rescaled operators () and in combination with those schemes.

### 5.2 Average Consensus

We compare the performance of the gossip schemes (E-G) (exact communication), (Q1-G), (Q2-G) (both with unbiased compression), and our scheme (Choco-G) in Figure 3 for the compression scheme and in Figure 3 for the random () compression scheme. In addition, we also depict the performance of Choco-Gossip with biased () compression. We use ring topology with uniformly averaging mixing matrix W as in Figure 1, left. The stepsizes that were used for Choco-Gossip are listed in the Table 4. We consider here the consensus problem (2) with data on the -machine was chosen to be the -th vector in the dataset. We depict the errors .

The proposed scheme (Choco-G) with 8 bit quantization converges with the same rate as (E-G) that uses exact communications (Fig. 3, left), while it requires much less data to be transmitted (Fig. 3, right). The schemes (Q1-G) and  (Q2-G) can do not converge and reach only accuracies of . The scheme (Q1-G) even starts to diverge, because the quantization error becomes larger than the optimization error.

With sparsified communication (), i.e. transmitting only of all the coordinates, the scheme (Q1-G) quickly zeros out all the coordinates, and (Q2-G) diverges because quantization error is too large already from the first step (Fig. 3). Choco-Gossip proves to be more robust and converges. The observed rate matches with the theoretical findings, as we expect the scheme with factor compression to be slower than (E-G) without compression. In terms of total data transmitted, both schemes converge at the same speed (Fig. 3, right). We also see that () sparsification can give additional gains and comes out as the most data-efficient method in these experiments.

### 5.3 Decentralized SGD

We asses the performance of Choco-SGD

on logistic regression, defined as

, where and are the data samples and denotes the number of samples in the dataset. We distribute the data samples evenly among the workers and consider two settings: (i) randomly shuffled, where datapoints are randomly assigned to workers, and the more difficult (ii) sorted

setting, where each worker only gets data samples just from one class (with the possible exception of one worker that gets two labels assigned). Moreover, we try to make the setting as difficult as possible, meaning that e.g. on the ring topology the machines with the same label form two connected clusters. We repeat each experiment three times and depict the mean curve and the area corresponding to one standard deviation. We plot suboptimality, i.e.

(obtained by optimizer from scikit-learn (Pedregosa et al., 2011)) versus number of iterations and the number of transmitted bits between workers, which is proportional to the actual running time if communication is a bottleneck.

#### Algorithms.

As baselines we consider Alg. 3 with exact communication (denoted as ‘plain’) and the communication efficient state-of-the-art optimization schemes DCD-SGD and ECD-SGD recently proposed in (Tang et al., 2018a) (for unbiased quantization operators) and compare them to Choco-SGD. We use decaying stepsize where the parameters are individually tuned for each algorithm and compression scheme, with values given in Table 4.

#### Impact of Topology.

In Figure 4 we depict the performance of the baseline Algorithm 3 with exact communication on different topologies (ring, torus and fully-connected; Fig. 1) with uniformly averaging mixing matrix . Note that Algorithm 3 for fully-connected graph corresponds to mini-batch SGD. Increasing the number of workers from to and shows the mild effect of the network topology on the convergence. We observe that the sorted setting is more difficult than the randomly shuffled setting (see Fig. 7 in the Appendix G), where the convergence behavior remains almost unaffected. In the following we focus on the hardest case, i.e. the ring topology.

#### Comparison to Baselines.

In Figures 5 and 6 depict the performance of these algorithms on the ring topology with nodes for sorted data of the and datasets. Choco-SGD performs almost as good as the exact Algorithm 3 in all situations, but using less communication with () sparsification (Fig. 5, right) and approximately less communication for quantization. The () variant performs slightly better than () sparsification.

Choco-SGD consistently outperforms DCD-SGD in all settings. We also observed that DCD-SGD starts to perform better for larger number of levels in the in the quantification operator (increasing communication cost). This is consistent with the reporting in (Tang et al., 2018a) that assumed high precision quantization. As a surprise to us, ECD-SGD, which was proposed in (Tang et al., 2018a) a the preferred alternative over DCD-SGD for less precise quantization operators, always performs worse than DCD-SGD, and often diverges.

Figures for randomly shuffled data and be found in the Appendix G. In that case Choco-SGD performs exactly as well as the exact Algorithm 3 in all situations.

#### Conclusion.

The experiments verify our theoretical findings: Choco-Gossip is the first linearly convergent gossip algorithm with quantized communication and Choco-SGD consistently outperforms the baselines for decentralized optimization, reaching almost the same performance as the exact algorithm without communication restrictions while significantly reducing communication cost. In view of the striking popularity of SGD as opposed to full-gradient methods for deep-learning, the application of Choco-SGD to decentralized deep learning—an instance of problem (1)— is a promising direction.

#### Acknowledgments.

We acknowledge funding from SNSF grant 200021_175796, as well as a Google Focused Research Award.

## References

• Aldous & Fill (2002) Aldous, D. and Fill, J. A.

Reversible markov chains and random walks on graphs, 2002.

Unfinished monograph, recompiled 2014, available at http://www.stat.berkeley.edu/aldous/RWG/book.html.
• Alistarh et al. (2017) Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), NIPS - Advances in Neural Information Processing Systems 30, pp. 1709–1720. Curran Associates, Inc., 2017.
• Alistarh et al. (2018) Alistarh, D., Hoefler, T., Johansson, M., Konstantinov, N., Khirirat, S., and Renggli, C. The convergence of sparsified gradient methods. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 5977–5987. Curran Associates, Inc., 2018.
• Assran et al. (2018) Assran, M., Loizou, N., Ballas, N., and Rabbat, M. Stochastic Gradient Push for Distributed Deep Learning. arXiv, November 2018.
• Aysal et al. (2008) Aysal, T. C., Coates, M. J., and Rabbat, M. G. Distributed average consensus with dithered quantization. IEEE Transactions on Signal Processing, 56(10):4905–4918, Oct 2008. ISSN 1053-587X.
• Bottou (2010) Bottou, L. Large-scale machine learning with stochastic gradient descent. In Lechevallier, Y. and Saporta, G. (eds.), Proceedings of COMPSTAT’2010, pp. 177–186, Heidelberg, 2010. Physica-Verlag HD. ISBN 978-3-7908-2604-3.
• Boyd et al. (2006) Boyd, S., Ghosh, A., Prabhakar, B., and Shah, D. Randomized gossip algorithms. IEEE/ACM Trans. Netw., 14(SI):2508–2530, June 2006. ISSN 1063-6692.
• Carli et al. (2007) Carli, R., Fagnani, F., Frasca, P., Taylor, T., and Zampieri, S. Average consensus on networks with transmission noise or quantization. In 2007 European Control Conference (ECC), pp. 1852–1857, July 2007.
• Carli et al. (2010a) Carli, R., Bullo, F., and Zampieri, S. Quantized average consensus via dynamic coding/decoding schemes. International Journal of Robust and Nonlinear Control, 20:156–175, 2010a. ISSN 1049-8923.
• Carli et al. (2010b) Carli, R., Frasca, P., Fagnani, F., and Zampieri, S. Gossip consensus algorithms via quantized communication. Automatica, 46:70–80, 2010b. ISSN 0005-1098.
• Dekel et al. (2012) Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res., 13(1):165–202, January 2012. ISSN 1532-4435.
• Duchi et al. (2012) Duchi, J. C., Agarwal, A., and Wainwright, M. J. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control, 57(3):592–606, March 2012. ISSN 0018-9286.
• Fang & Li (2010) Fang, J. and Li, H. Distributed estimation of gauss - markov random fields with one-bit quantized data. IEEE Signal Processing Letters, 17(5):449–452, May 2010. ISSN 1070-9908.
• Goodall (1951) Goodall, W. M. Television by pulse code modulation. The Bell System Technical Journal, 30(1):33–49, Jan 1951. ISSN 0005-8580.
• He et al. (2018) He, L., Bian, A., and Jaggi, M. Cola: Decentralized linear learning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 4541–4551. Curran Associates, Inc., 2018.
• Iutzeler et al. (2013) Iutzeler, F., Bianchi, P., Ciblat, P., and Hachem, W. Asynchronous distributed optimization using a randomized alternating direction method of multipliers. In Proceedings of the 52nd IEEE Conference on Decision and Control, CDC 2013, December 10-13, 2013, Firenze, Italy, pp. 3671–3676. IEEE, 2013.
• Jakovetić et al. (2014) Jakovetić, D., Xavier, J., and Moura, J. M. F. Fast distributed gradient methods. IEEE Transactions on Automatic Control, 59(5):1131–1146, May 2014. ISSN 0018-9286.
• Johansson et al. (2010) Johansson, B., Rabi, M., and Johansson, M. A randomized incremental subgradient method for distributed optimization in networked systems. SIAM Journal on Optimization, 20(3):1157–1170, 2010. doi: 10.1137/08073038X.
• Kempe et al. (2003) Kempe, D., Dobra, A., and Gehrke, J. Gossip-based computation of aggregate information. In Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’03, pp. 482–, Washington, DC, USA, 2003. IEEE Computer Society. ISBN 0-7695-2040-5.
• Konecny & Richtárik (2018) Konecny, J. and Richtárik, P. Randomized Distributed Mean Estimation: Accuracy vs. Communication. Frontiers in Applied Mathematics and Statistics, 4:1502, December 2018.
• Lan et al. (2018) Lan, G., Lee, S., and Zhou, Y. Communication-efficient algorithms for decentralized and stochastic optimization. Mathematical Programming, Dec 2018. ISSN 1436-4646.
• Lewis et al. (2004) Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res., 5:361–397, December 2004. ISSN 1532-4435.
• Li et al. (2011) Li, T., Fu, M., Xie, L., and Zhang, J. Distributed consensus with limited communication data rate. IEEE Transactions on Automatic Control, 56(2):279–292, Feb 2011. ISSN 0018-9286.
• Lian et al. (2017) Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 5330–5340. Curran Associates, Inc., 2017.
• Lin et al. (2018) Lin, Y., Han, S., Mao, H., Wang, Y., and Dally, B. Deep gradient compression: Reducing the communication bandwidth for distributed training. In ICLR 2018 - International Conference on Learning Representations, 2018.
• Nedić & Ozdaglar (2009) Nedić, A. and Ozdaglar, A. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, Jan 2009. ISSN 0018-9286.
• Nedić et al. (2008) Nedić, A., Olshevsky, A., Ozdaglar, A., and Tsitsiklis, J. N. Distributed subgradient methods and quantization effects. In Proceedings of the 47th IEEE Conference on Decision and Control, CDC 2008, pp. 4177–4184, 2008. ISBN 9781424431243.
• Nedić et al. (2015) Nedić, A., Lee, S., and Raginsky, M. Decentralized online optimization with global objectives and local communication. In 2015 American Control Conference (ACC), pp. 4497–4503, July 2015.
• Olfati-Saber & Murray (2004) Olfati-Saber, R. and Murray, R. M. Consensus problems in networks of agents with switching topology and time-delays. IEEE Transactions on Automatic Control, 49(9):1520–1533, Sep. 2004. ISSN 0018-9286.
• Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
• Rabbat (2015) Rabbat, M. Multi-agent mirror descent for decentralized stochastic optimization. In 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pp. 517–520, Dec 2015.
• Rakhlin et al. (2012) Rakhlin, A., Shamir, O., and Sridharan, K. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, pp. 1571–1578, USA, 2012. Omnipress. ISBN 978-1-4503-1285-1.
• Robbins & Monro (1951) Robbins, H. and Monro, S. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400–407, September 1951.
• Roberts (1962) Roberts, L. Picture coding using pseudo-random noise. IRE Transactions on Information Theory, 8(2):145–154, February 1962. ISSN 0096-1000.
• Scaman et al. (2017) Scaman, K., Bach, F., Bubeck, S., Lee, Y. T., and Massoulié, L. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3027–3036, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
• Scaman et al. (2018) Scaman, K., Bach, F., Bubeck, S., Massoulié, L., and Lee, Y. T. Optimal algorithms for non-smooth distributed optimization in networks. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 2745–2754. Curran Associates, Inc., 2018.
• Seide et al. (2014) Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Li, H., Meng, H. M., Ma, B., Chng, E., and Xie, L. (eds.), INTERSPEECH, pp. 1058–1062. ISCA, 2014.
• Shamir & Srebro (2014) Shamir, O. and Srebro, N. Distributed stochastic optimization and learning. 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 850–857, 2014.
• Sirb & Ye (2016) Sirb, B. and Ye, X. Consensus optimization with delayed and stochastic gradients on decentralized networks. In 2016 IEEE International Conference on Big Data (Big Data), pp. 76–85, Dec 2016.
• Sonnenburg et al. (2008) Sonnenburg, S., Franc, V., Yom-Tov, E., and Sebag, M. Pascal large scale learning challenge. 25th International Conference on Machine Learning (ICML2008) Workshop. J. Mach. Learn. Res, 10:1937–1953, 01 2008.
• Stich (2018) Stich, S. U. Local SGD Converges Fast and Communicates Little. arXiv e-prints, art. arXiv:1805.09767, May 2018.
• Stich et al. (2018) Stich, S. U., Cordonnier, J.-B., and Jaggi, M. Sparsified SGD with memory. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 4452–4463. Curran Associates, Inc., 2018.
• Tang et al. (2018a) Tang, H., Gan, S., Zhang, C., Zhang, T., and Liu, J. Communication compression for decentralized training. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 7663–7673. Curran Associates, Inc., 2018a.
• Tang et al. (2018b) Tang, H., Lian, X., Yan, M., Zhang, C., and Liu, J. : Decentralized training over decentralized data. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 4848–4856, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018b. PMLR.
• Thanou et al. (2013) Thanou, D., Kokiopoulou, E., Pu, Y., and Frossard, P. Distributed average consensus with quantization refinement. IEEE Transactions on Signal Processing, 61(1):194–205, Jan 2013. ISSN 1053-587X.
• Tsitsiklis (1984) Tsitsiklis, J. N. Problems in decentralized decision making and computation. PhD thesis, Massachusetts Institute of Technology, 1984.
• Uribe et al. (2018) Uribe, C. A., Lee, S., and Gasnikov, A. A Dual Approach for Optimal Algorithms in Distributed Optimization over Networks. arXiv, September 2018.
• Wangni et al. (2018) Wangni, J., Wang, J., Liu, J., and Zhang, T. Gradient sparsification for communication-efficient distributed optimization. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 1306–1316. Curran Associates, Inc., 2018.
• Wei & Ozdaglar (2012) Wei, E. and Ozdaglar, A. Distributed alternating direction method of multipliers. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pp. 5445–5450, Dec 2012.
• Wen et al. (2017) Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., and Li, H. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), NIPS - Advances in Neural Information Processing Systems 30, pp. 1509–1519. Curran Associates, Inc., 2017.
• Xiao & Boyd (2004) Xiao, L. and Boyd, S. Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1):65–78, 2004. ISSN 0167-6911.
• Xiao et al. (2005) Xiao, L., Boyd, S., and Lall, S. A scheme for robust distributed sensor fusion based on average consensus. In IPSN 2005. Fourth International Symposium on Information Processing in Sensor Networks, 2005., pp. 63–70, April 2005. doi: 10.1109/IPSN.2005.1440896.
• Yuan et al. (2012) Yuan, D., Xu, S., Zhao, H., and Rong, L. Distributed dual averaging method for multi-agent optimization with quantized communication. Systems & Control Letters, 61(11):1053 – 1061, 2012. ISSN 0167-6911. doi: https://doi.org/10.1016/j.sysconle.2012.06.004. URL http://www.sciencedirect.com/science/article/pii/S0167691112001193.
• Zhang et al. (2017) Zhang, H., Li, J., Kara, K., Alistarh, D., Liu, J., and Zhang, C. ZipML: Training linear models with end-to-end low precision, and a little bit of deep learning. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 4035–4043, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/zhang17e.html.

## Appendix A Basic Identities and Inequalities

### a.1 Smooth and Strongly Convex Functions

###### Definition 2.

A differentiable function is -strongly convex for parameter if

 f(y) ∀x,y∈Rd. (9)
###### Definition 3.

A differentiable function is -strongly convex for parameter if

 f(y) ∀x,y∈Rd. (10)
###### Remark 5.

If is -smooth with minimizer s.t , then

 ∥∇f(x)∥2=∥∇f(x)−∇f(x⋆)∥2≤2L(f(x)−f(x⋆)). (11)

### a.2 Vector and Matrix Inequalities

###### Remark 6.

For ,

 ∥AB∥F≤∥A∥F∥B∥2. (12)
###### Remark 7.

For arbitrary set of vectors ,

 ∥∥ ∥∥n∑i=1ai∥∥ ∥∥2≤nn∑i=1∥ai∥2. (13)
###### Remark 8.

For given two vectors

 ∀γ>0. (14)
###### Remark 9.

For given two vectors

 ∥a+b∥2≤(1+α)∥a∥2+(1+α−1)∥b∥2, ∀α>0. (15)

This inequality also holds for the sum of two matrices in Frobenius norm.

### a.3 Implications of the bounded gradient and bounded variance assumption

###### Remark 10.

If are convex functions with ,

 Eξ1,…,ξn∥∂F(X,ξ)∥2F≤nG2, ∀X.
###### Remark 11 (Mini-batch variance).

If for functions , defined in (8) , then

 Eξ(t)1,…,ξ(t)n∥∥ ∥∥1nn∑j=1(∇fj(x(t)j)−∇Fj(x(t)j,ξ(t)j))∥∥ ∥∥2≤¯¯¯σ2n,

where .

###### Proof.

This follows from

for . Expectation of scalar product is equal to zero because is independent of since . ∎

## Appendix B Consensus in Matrix notation

In the proofs in the next section we will use the matrix notation, as already introduced in the main text. We define

 X(t) :=[x(t)1,…,x(t)n]∈Rd×n, Q(t):=[q(t)1,…,q(t)n]∈Rd×n, ^X(t):=[^x(t)1,…,^x(t)n]∈Rd×n. (16)

Then using matrix notation we can rewrite Algorithm 1 as