# MATCHA: Speeding Up Decentralized SGD via Matching Decomposition Sampling

The trade-off between convergence error and communication delays in decentralized stochastic gradient descent (SGD) is dictated by the sparsity of the inter-worker communication graph. In this paper, we propose MATCHA, a decentralized SGD method where we use matching decomposition sampling of the base graph to parallelize inter-worker information exchange so as to significantly reduce communication delay. At the same time, under standard assumptions for any general topology, in spite of the significant reduction of the communication delay, MATCHA maintains the same convergence rate as that of the state-of-the-art in terms of epochs. Experiments on a suite of datasets and deep neural networks validate the theoretical analysis and demonstrate the effectiveness of the proposed scheme as far as reducing communication delays is concerned.

## Authors

• 37 publications
• 15 publications
• 1 publication
• 28 publications
• 40 publications
• ### Taming Convergence for Asynchronous Stochastic Gradient Descent with Unbounded Delay in Non-Convex Learning

Understanding the convergence performance of asynchronous stochastic gra...
05/24/2018 ∙ by Xin Zhang, et al. ∙ 0

• ### Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling

Stochastic Gradient Descent (SGD) is a popular optimization method which...
05/13/2014 ∙ by Peilin Zhao, et al. ∙ 0

• ### Decentralized Federated Learning: Balancing Communication and Computing Costs

Decentralized federated learning (DFL) is a powerful framework of distri...
07/26/2021 ∙ by Wei Liu, et al. ∙ 0

• ### Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers

We consider the setting where a master wants to run a distributed stocha...
02/25/2020 ∙ by Serge Kas Hanna, et al. ∙ 0

• ### Guided parallelized stochastic gradient descent for delay compensation

Stochastic gradient descent (SGD) algorithm and its variations have been...
01/17/2021 ∙ by Anuraganand Sharma, et al. ∙ 10

• ### Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms

State-of-the-art distributed machine learning suffers from significant d...
08/22/2018 ∙ by Jianyu Wang, et al. ∙ 0

• ### Decentralized Stochastic Gradient Tracking for Empirical Risk Minimization

Recent works have shown superiorities of decentralized SGD to centralize...
09/06/2019 ∙ by Jiaqi Zhang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Due to the enormity of the training data used today, distributing the data and the computation over a network of worker nodes has attracted intensive research efforts in recent years. In this paper, we will focus on parallelizing synchronous SGD in a decentralized setting without central coordinators (i.e., parameter server). Given an arbitrary network topology, all nodes can only exchange parameters or gradients with their local neighbors. This scenario is common and useful when training in large-scale sensor networks, multi-agent systems, as well as federated learning in edge devices.

Error-Runtime Trade-off in Decentralized SGD. In the context of decentralized optimization, previous works have studied the error convergence in terms of iterations or communication rounds for decentralized gradient descent [22, 7, 40, 42, 10, 27, 30, 13]

mostly for (strongly) convex loss functions. Recent works have extended the analysis to decentralized SGD for non-convex loss functions and subsequently applied it to distributed deep learning in both synchronous

[17, 11, 35] and asynchronous settings [1, 18]. However, most of existing works do not explicitly consider how the topology affects the runtime, that is, wall-clock time required to complete each SGD iteration. Well-connected networks encourage faster consensus and give better mean square error convergence rates, but they incur communication delays which increases with increasing node degree. To strike the best error-runtime trade-off, one can carefully design the network topology, for example, using expander graphs that are sparse while being well connected [6, 23]. However, systems constraints such as locality may preclude us from designing such arbitrary network topologies. Other approaches for optimizing the per-epoch rate of convergence of decentralized procedures through efficient link-scheduling or constraining the number of allowable links, have been proposed [4]. However, these design criteria do not take into account the wall-clock time which depend on the parallel versus sequential scheduling of the communication links which we describe later. This raises a pertinent question: for a given topology of worker nodes, how can we achieve the fastest convergence in terms of mean square error versus wall-clock time for a synchronous decentralized SGD algorithm?

Related Works. There have been massive amount of work in the context of algorithms [37, 33, 36, 5, 26] and systems [16, 43, 9] that improve the communication efficiency of synchronous distributed SGD in a fully-connected network. However, it is still unclear whether theses strategies can be directly applied to any general decentralized setting. Given an arbitrary network topology, recent works [14, 29] propose to compress the transmitted message size to reduce the communication bandwidth. However, these methods may not help if the network latency (i.e., time to establish handshakes) is high. Other communication efficient schemes [25, 31], which focus on reducing the number of communications by sparsifying communications over time have also been proposed which do not take into account communication delays. Instead, we focus on a complementary idea of reducing the effective node degree so as to reduce the communication delay, which is suitable for both high latency and low bandwidth settings and can be easily combined with existing compression schemes.

Our Proposed Method Matcha. In this paper, we propose Matcha, a decentralized SGD method based on matching decomposition sampling, that drastically reduces the communication delay per iteration for any given node topology while maintaining the same error convergence speed. The following key ideas allow us to achieve this: 1) we decompose the graph topology into matchings consisting of disjoint communication links that can operate in parallel and save communication delay, and 2) in each iteration, we carefully sample a subset of these matchings to construct a sparse subgraph of the base topology, 3) this sequence of subgraphs results in more frequent communication over connectivity-critical links (ensuring fast error convergence) and less frequent communication over other links (saving communication delays).

An illustration of the advantages of using Matcha is presented in Figure 1. It shows that the reduction of communication complexity at different nodes is not uniform. In particular, when the communication budget is (using time to communicate at each iteration compared to vanilla decentralized SGD), critical links (such as edge ) end up being used for communication with high priority. As a result, the communication time at a degree node (node for example) does not change. On the other hand, links, which are incident to the busiest node, will be used for communication infrequently. The communication time at a node of degree (node for example) is directly reduced by half, as it is the bottleneck of run time per iteration. We further validate the effectiveness of Matcha through theoretical analysis and extensive experiments (see Sections 5 and 4).

Besides a win-win in the wall clock time-error trade-off, Matcha has many more practical benefits. First, the proposed algorithm is simple, in the sense that the communication schedule (i.e., the sequence of sparse subgraphs) of Matcha can be obtained apriori. There is no additional runtime overhead during training. Furthermore, Matcha provides a highly flexible communication scheme among nodes. By setting the communication budget, one can easily tailor the communication time to various system and problem settings, allowing a better trade-off between communication and computation. In our experiments on CIFAR-100, Matcha gets a x reduction in communication delay per iteration, and up to x reduction in wall-clock time to achieve the same training accuracy.

## 2 Problem Formulation and Preliminaries

Consider a network of worker nodes. The communication links connecting the workers are represented by an arbitrary possibly sparse undirected connected graph with vertex set and edge set . Each node can only communicate with its neighbors, that is, it can communicate with node only if .

Furthermore, each worker node only has access to its own local data distribution . Our objective is to use this network of nodes to train a model using the joint dataset. In other words, we seek to minimize the objective function , which is defined as follows:

 minx∈RdF(x):=1mm∑i=1Fi(x)=1mm∑i=1\Exss∼Di\bracketsℓ(x;s) (1)

where denotes the model parameters (for instance, the weights and biases of a neural network), is the local objective function, denotes a single data sample, and is the loss function for sample , defined by the learning model.

Decentralized SGD (DecenSGD). Decentralized SGD (or consensus-based distributed SGD) [32, 22, 40, 42, 11, 17, 10] is an effective way to optimize the empirical risk creftypecap 1 in the considered setting. The algorithm alternates between the consensus and gradient steps as follows 111One can also use another update rule: . All insights and conclusions in this paper will remain the same.:

 x(k+1)i=m∑j=1Wij% consensus step\bracketsx(k)j−g(x(k)j;ξ(k)j)local gradient step (2)

where denotes a mini-batch sampled uniformly at random from local data distribution at iteration , denotes the stochastic gradient, and is the -th element of mixing matrix . In particular, only if node and node are connected, i.e., . In order to guarantee that all nodes can reach consensus and converge to a common stationary point, mixing matrix can be taken to be symmetric and doubly stochastic. For instance, if node only connects with node and , then the first row of can be , where is constant.

Convergence in terms of Error Versus Wallclock Time. The total training time of an optimization algorithm is a product of two factors: 1) total iterations; and 2) run time per iteration. In a decentralized setup involving multiple worker nodes without a coordinating master node, both of these two factors are closely related to the graph topology. While there has been extensive literature studying the first factor [22, 7, 21], the second factor is less explored from a theoretical point of view.

In DecenSGD, each node needs to communicate with all of its neighbors at each iteration. The node with the highest degree in the graph (the busiest node) turns out to be the bottleneck as far as reducing communication time to finish one consensus step is concerned. Intuitively, the communication time per iteration monotonically increases with the maximal node degree. In general, the scaling is linear as commonly used in previous works [9, 31, 6, 28, 18], since the bandwidth is limited and both the total transmitted message size and number of handshakes are linear in the degree of the node. In this paper, we will focus on this linear scaling delay model, but the main idea can also be extended to other scaling rules. Without loss of generality, we assume the communication (sending and receiving model parameters) over one link costs unit of time. Then, the communication per iteration takes at least the maximal degree units of time. Although a denser base graph may require less iterations to converge, it consumes more communication time, resulting in longer training time.

Preliminaries on Graph Theory. The communication graph can be abstracted by an adjacency matrix . In particular, if ; otherwise. The graph Laplacian is defined as , where denotes the -th node’s degree. When

is a connected graph, the second smallest eigenvalue

of the graph Laplacian is strictly greater than and referred to as algebraic connectivity [2]. A larger value of implies a denser graph. Moreover, we will use the notion of matching, defined as follows. [Matching] A matching in is a subgraph of , in which each vertex is incident with at most one edge.

## 3 Matcha: Proposed Matching Decomposition Sampling Strategy

Following the intuition that it is beneficial to communicate over critical links more frequently and less over other links, the algorithm consists of three key steps as follows. A brief illustration is also shown in Figure 2.

Step 1: Matching Decomposition. First, we decompose the base communication graph into total disjoint matchings, i.e., and . This decomposition procedure can be achieved via Misra & Gries edge coloring algorithm [20], which guarantees that the number of disjoint matchings equals to either or , where is the maximal degree of graph .

The main benefit of using matchings is that it allows parallel communication, due to the disjoint links connecting nodes. Recall that a matching is a set of edges without common vertices. In each matching, nodes have at most one neighbor. Thus, all edges (or links) can be used to communicate over in parallel. The communication time for each matching is exactly unit. Inspired by this matching decomposition scheme, communicating over all matchings sequentially is a simple and efficient way to implement the consensus step in decentralized training algorithm. The total communication time will be linear in the number of matchings and be bounded by units, which matches with the communication time model discussed in Section 2 and previous works [31, 6, 28, 18].

Step 2: Computing Matching Activation Probabilities.

In order to control the communication time, we assign each matching a Bernoulli random variable

, which is with probability and otherwise, . Then, at each iteration, only when the realization of is , links in the corresponding matching will be used for information exchange between the corresponding worker nodes. As a result, when ’s are independent to each other, the communication time per iteration can be written as

 Expected Communication Time=\Exs\bracketsM∑j=1Bj=M∑j=1pj. (3)

We define as the activation probability. By controlling the summation of all activation probabilities, one can easily change the expected communication time. When all ’s equal to , the algorithm reduces to that of vanilla DecenSGD and takes units of time to finish one consensus step. We further define communication budget (CB), in terms of fraction of communication time of vanilla DecenSGD (e.g., CB means using only communication time per iteration of vanilla DecenSGD). Given a CB, there can be many feasible activation probabilities. As mentioned before, a key contribution of this paper is that we give more importance to critical links. This is achieved by controlling the activation probabilities for the matchings. Formally, we choose a set of activation probabilities by solving the following optimization problem:

 maxp1,…,pMλ2\parenth∑Mj=1pjLjsubject to∑Mj=1pj≤CB⋅M, 0≤pj≤1, ∀j∈{1,2,…,M}, (4)

where denotes the Laplacian matrix of the -th subgraph and can be considered as the Laplacian of the expected graph. CB is the pre-determined communication budget. Moreover, recall that represents the algebraic connectivity and is a concave function [12, 2]. Thus, it directly follows that (4) is a convex problem and can be solved efficiently.

Step 3: Generating Random Topology Sequence. At the -th iteration, the communication among nodes only happen over links in the activated topology , which is sparse or even disconnected. According to this activated topology, we need to further specify in what proportions the local models are averaged together in order to perform the consensus step in creftypecap 2. A common practice is to use an equal weight matrix [38, 12, 7] as follows:

 W(k)=I−αL(k)=I−αM∑j=1B(k)jLj, (5)

where denotes the graph Laplacian at the -th iteration. The matrix is symmetric and doubly stochastic by construction. The parameter represents the weight of neighbor’s information in the consensus step. By setting a proper value of , the convergence of Matcha to a stationary point can be guaranteed. In particular, we select a value of that minimizes the optimization error upper bound. In Section 4 Section 4.2, we will show that optimizing can be formulated as a semi-definite programming problem. It needs to be solved only once at the beginning of training.

Extension to Other Design Choices. To sum up, the inputs of the proposed algorithm Matcha are: a base communication topology and a target communication budget CB. Then, following the steps 1 to 3, the algorithm will output a random topology sequence and a value of that defines the inter-node information exchange. All of these information can be obtained and assigned apriori to worker nodes before starting the training procedure.

We note that the framework involving randomly activating subgraphs, is very general and can be extended to various other delay models and graph decomposition methods. For example, instead of activating all matchings independently, one can choose to activate only one matching at each iteration; instead of assuming all links cost same amount of time, one can model the communication time for each link as a random variable and modify the formula creftypecap 3 accordingly. Moreover, rather than matching decomposition, it is also possible to decompose the base topology into subgraphs of different types. For instance, each subraph can be a single edge in the base graph .

Among all possible variants, we would like to highlight one special case: Periodic DecenSGD (P-DecenSGD), which has appeared in previous works [31, 35]. In P-DecenSGD, all links in the base topology are activated together () after every few iterations. In this case, the communication budget is equivalent to communication frequency. In Sections 5 and 4, we will use P-DecenSGD as another benchmark for comparison.

## 4 Theoretical Analyses

In this section, we provide convergence guarantees for Matcha. To be specific, we first provide convergence guarantees where we explicitly quantify the dependence of the mean square error on the arbitrary random topology sequence. Then, in Section 4.2, we analyze the spectral norm of the random topology generated by Matcha. All proofs are provided in the Appendix.

In order to facilitate the analysis, we define the averaged iterate as and the lower bound of the objective function as . Since, we focus on general non-convex loss functions, the quantity of interest is the averaged gradient norm: . When it approaches zero, the algorithm converges to a stationary point. The convergence analysis is centered around the following assumptions, which are common in distributed optimization literature [3, 22, 17]. Each local objective function is -Lipschitz: .

Stochastic gradients at each worker node is an unbiased estimator of the true gradient of local objectives:

, where denotes the sources of randomness upto time , i.e., sigma algebra generated by noise of the stochastic gradients and the graph activation probabilities before iteration .

The variance of stochastic gradients at each worker node is uniformly bounded:

.

### 4.1 Convergence Analysis for Arbitrary Random Topology

[Basic Convergence Result] Suppose that all local models are initiated at the same iterate and

is an i.i.d. random matrix sequence. Then, under Assumptions 1 to 3, if the learning rate satisfies

, then after total iterations, we have that,

 1KK∑k=1\Exs\brackets\vecnorm∇F(¯¯¯x(k))2≤ 2[F(¯¯¯x(1))−Finf]ηK+ηLσ2mcentralized SGD+η2L2σ22ρ1−ρ +η2L22ρ(1−√ρ)21KK∑k=1\Exs\brackets1mm∑i=1\vecnorm∇Fi(x(k)i)2, (6)

where

is the spectral norm (i.e., largest singular value) of matrix

. The result in Section 4.1 can be further refined by introducing new assumptions on the dissimilarities among local objectives. For the brevity of result, we simply assume the local gradients are uniformly bounded () as [7, 39, 14] and derive the following corollary. In the Appendix, we provide another version of corollary with weaker assumption as in [17]. Suppose for each local objective, we have and the learning rate is set as , then after total iterations,

 1KK∑k=1\Exs\brackets\vecnorm∇F(¯¯¯x(k))2 ≤2L[F(¯¯¯x(1))−Finf]+σ2√mK+2mρK\bracketsσ21−ρ+D(1−√ρ)2 =O\parenth1√mK+mKρ(1−√ρ)2, (7)

where all the other constants are subsumed in . Dependence on the Random Topology. Section 4.1 together with Section 4.1 show that when the other algorithm parameters are fixed, the mean square error monotonically increases with . Typically, the value of spectral norm relates to the connectivity of the random topology. If the activated topology is fully connected, i.e., , then and Section 4.1 recovers the convergence results for centralized SGD. However, if there are two groups of nodes which are not connected during the whole training procedure, then . Local models cannot achieve consensus and the iterates will diverge. Since in Matcha, we optimize the connectivity of the average activated topology, it is important to guarantee . We further prove this statement in Section 4.2.

### 4.2 Analysis for Random Topology Sequence Generated by Matcha

[Existence Proof] Suppose the base graph is connected. Let denote the Laplacian matrix of the activated topology at -th iteration in Matcha. If the mixing matrix is defined as , then there exists a value of such that . Section 4.2 and Section 4.1 guarantee the convergence of Matcha. When the communication budget (or activation probability) varies, the value of should be changed. However, finding an optimal value of , which minimizes the spectral norm, is not trivial. It is hard to get the analytic form of . However, we show that optimizing can be formulated as a semi-definite program. Thus, it can be efficiently solved via numerical methods. [Optimizing ] Given subgraphs and their corresponding activation probabilities, optimizing the mixing matrix can be formulated as a semi-definite programming problem:

 minρ,α,β  ρ,  subject to  α2−β≤0, I−2α¯¯¯¯L+β[¯¯¯¯L2+2˜L]−1m\one\one\tp⪯ρI, (8)

where is an auxiliary variable, and . Dependence on Communication Budget. In Figure 3, we present simulation results on how the minimal spectral norm (solution of creftypecap 8) changes along with the communication budget. Recall that a lower spectral norm means better error-convergence in terms of iterations. It can be observed that Matcha can reduce communication time while preserving the same spectral norm as vanilla DecenSGD. By setting a proper communication budget (for instance in Figure 2(b)), Matcha can have even lower spectral norm than vanilla DecenSGD. Besides, to achieve the same spectral norm, Matcha always requires much less communication budget than periodic DecenSGD. Moreover, even if one sets a very low communication budget, since the spectral norm only influences the higher order terms in creftypecap 7, Matcha can still achieve the rate after sufficiently large number of iterations. These theoretical findings are corroborated by extensive experiments in Section 5.

## 5 Experimental Results

Experimental Setting We evaluate the performance of the proposed algorithm in multiple deep learning tasks: (1) image classification on CIFAR-10 and CIFAR-100 [15]; (2) Language modeling on Penn Treebank corpus (PTB) dataset [19]. All training datasets are evenly partitioned over a network of workers. All algorithms are trained for sufficiently long times until convergence or overfitting. Besides, in order to guarantee a fair comparison for each task, the learning rate is fine-tuned for vanilla DecenSGD and kept the same in all other algorithms. More detailed descriptions on the datasets and training configurations are provided in Appendix A.1.

Effectiveness of Matcha. We compare the performance of Matcha with various communication budgets () and vanilla DecenSGD in Figure 4. The base communication topology is shown in Figure 1. From Figures 3(f), 3(e) and 3(d), one can observe that when the communication budget is set to , Matcha has the nearly identical training loss as vanilla DecenSGD at every epoch. But it only requires, at most, half of the communication time per iteration. This empirical finding reinforces the claim regarding the similarity of the algorithms’ performance in terms of epochs in Section 4 (see Figure 2(a)). When we continue to decrease the communication budget, Matcha attains significantly faster convergence with respect to wall-clock time in communication-intense tasks. In particular, the proposed algorithm can reduce communication time per iteration and achieve a training loss of using x less time than vanilla DecenSGD on CIFAR-100 (see Figure 3(a)).

Effects of Base Communication Topology. In order to further verify the generality of Matcha, we evaluate it on another base topology with varying connectivity using worker nodes. In Figure 5, we present experimental results on three different base topologies, which are random geometric graphs and have different maximal degrees. In particular, when the maximal degree is (see Figure 4(b)), Matcha with communication budget not only can reduce the communication time per iteration by x, but also has lower error than vanilla DecenSGD. This result corroborates its corresponding spectral norm versus communication budget curve shown in Figure 2(b). When we further increase the density of the base topology (see Figure 4(c)), Matcha reduces communication time per iteration by x without hurting the error-convergence.

Another interesting observation is that Matcha gives more communication reduction for denser base graphs. As shown in Figure 5, along with the increase in the density of the base graph, the training time of vanilla DecenSGD also increases from to minutes to finish epochs. However, in Matcha, since the effective maximal degree in all cases is maintained to be about by controlling communication budget, the total training time of epochs remains nearly the same (about minutes). Moreover, Matcha also takes less and less time to achieve a training loss of , on the contrast to vanilla DecenSGD and P-DecenSGD.

Comparison to Periodic DecenSGD. As discussed in Sections 3 and 4, a naive way to reduce the communication time per iteration is to introduce a communication frequency for the whole base graph [35, 31]. Instead, in Matcha, we allow matchings to have different communication frequencies. Similar to the theoretical simulations in Figure 3, the results in Figure 5 show that given a fixed communication budget, Matcha consistently outperforms periodic DecenSGD. More results are presented in the Appendix.

## 6 Concluding Remarks

In this paper, we have proposed Matcha to reduce and control the communication delay of decentralized SGD algorithm in any general topology worker networks. The key idea in Matcha is that workers communicate over the connectivity-critical links with high priority, which we achieve via matching decomposition sampling. Rigorous theoretical analysis and experimental results show that Matcha can reduce the communication delay while maintaining the same error-convergence rate in terms of epochs. Future directions includes adaptively changing the communication time per iteration as [34], and extending Matcha to directed communication graphs.

## Appendix A More Experimental Results

### a.1 Detailed Experimental Setting

Image Classification Tasks. CIFAR-10 and CIFAR-100 consist of color images in and classes, respectively. For CIFAR-10 and CIFAR-100 training, we set the initial learning rate as and it decays by after and epochs. The mini-batch size per worker node is . We train vanilla DecenSGD for epochs and all other algorithms for the same wall-clock time as vanilla DecenSGD.

Language Model Task. The PTB dataset contains training words. We train ResNet-50 [8], and WideResNet-2810 [41] on the image classification tasks. A two-layer LSTM with hidden nodes in each layer [24] is adopted for language modeling. For the training on PTB dataset, we set the initial learning rate as and it decays by when the training procedure saturates. The mini-batch size per worker node is . The embedding size is . All algorithms are trained for epochs.

Machines. Unless otherwise stated, the training procedure is performed in a network of nodes, each of which is equipped with one NVIDIA TitanX Maxwell GPU and has a MB/s Ethernet interface. Matcha

is implemented with PyTorch and MPI4Py.

## Appendix B Proofs of Theorem 1 and Corollary 1

### b.1 Preliminaries

In the proof, we will use the following matrix forms:

 X(k)= (9) G(k)= \bracketsg1(x(k)1),g2(x(k)2),…,gm(x(k)m), (10) ∇F(k)= (11)

Recall the assumptions we make:

 \vecnorm∇Fi(x)−∇Fi(y)≤L\vecnormx−y, (12) \Exs\bracketsgi(x)|x=∇Fi(x), (13) \Exs\brackets\vecnormgi(x)−∇Fi(x)2|x≤σ2. (14)

### b.2 Lemmas

Let be an i.i.d. symmetric and doubly stochastic matrices sequence. The size of each matrix is . Then, for any matrix ,

 \Exs\brackets\fronormB\parenthn∏l=1W(l)−J2≤ρn\fronormB2 (15)

where = [W^(k)⊤W^(k)]-J.

###### Proof.

For the ease of writing, let us define and use to denote the

-th row vector of

. Since for all , we have and . Thus, one can obtain

 (16)

Then, taking the expectation with respect to ,

 \ExsW(n)\brackets\fronormBA1,n2= d∑i=1\ExsW(n)\brackets\vecnormbi\tpA1,n2 (17) = d∑i=1\ExsW(n)\bracketsbi\tpA1,n−1(W(n)⊤W(n)−J)A1,n−1\tpbi (18) = d∑i=1bi\tpA1,n−1\ExsW(n)\brackets(W(n)⊤W(n)−J)A1,n−1\tpbi. (19)

Let and , then

 \ExsW(n)\brackets\fronormBA1,n2= d∑i=1vi\tpCvi (20) ≤ σmax(C)d∑i=1vi\tpvi (21) = ρ\fronormBA1,n−12. (22)

Repeat the following procedure, since ’s are i.i.d. matrices, we have

 \ExsW(1)…\ExsW(n−1)\ExsW(n)\brackets\fronormBA1,n2≤ρn\fronormB2. (23)

Here, we complete the proof. ∎

### b.3 Proof of Theorem 1

Since the objective function is Liptchitz smooth, it means that

 F(¯¯¯x(k+1))−F(¯¯¯x(k))≤\inprod∇F(¯¯¯x(k))¯¯¯x(k+1)−¯¯¯x(k)+L2\vecnorm¯¯¯x(k+1)−¯¯¯x(k)2. (24)

Plugging into the update rule , we have

 F(¯¯¯x(k+1))−F(¯¯¯x(k))≤−η\inprod∇F(¯¯¯x(k))G(k)\onem+η2L2\vecnormG(k)\onem2. (25)

Then, taking the expectation with respect to random mini-batches at -th iteration,

 \Exsk\bracketsF(¯¯¯x(k+1))−F(¯¯¯x(k))≤ −η\inprod∇F(¯¯¯x(k))∇F(k)\onem+η2L2\Exsk\brackets\vecnormG(k)\onem2. (26)

For the first term in (26), since , we have

 \inprod∇F(¯¯¯x(k))∇F(k)\onem= \inprod∇F(¯¯¯x(k))1mm∑i=1∇Fi(x(k)i) (27) = 12\brackets\vecnorm∇F(¯¯¯x(k))2+\vecnorm1mm∑i=1∇Fi(x(k)i)2−\vecnorm∇F(¯¯¯x(k))−1mm∑i=1∇Fi(x(k)i)2 (28)

Recall that ,

 \vecnorm∇F(¯¯¯x(k))−1mm∑i=1∇Fi(x(k)i)2= \vecnorm1mm∑i=1\brackets∇Fi(¯¯¯x(k))−∇Fi(x(k)i)2 (29) Jensen's Inequality≤ 1mm∑i=1\vecnorm∇Fi(¯¯¯x(k))−∇Fi(x(k)i)2 (30) ≤ L2mm∑i=1\vecnorm¯¯¯x(k)−x(k)i2 (31)

where the last inequality follows the Lipschitz smooth assumption. Then, plugging creftypecap 31 into (28), we obtain

 \inprod∇F(¯¯¯x(k))∇F(k)\onem≥ 12\vecnorm∇F(¯¯¯x(k))2+12\vecnorm∇F(k)\onem2−L22m\fronormX(k)(I−J)2. (32)

Next, for the second part in (26),

 \Exsk\brackets1mm∑i=1\bracketsgi(x(k)i)−∇Fi(x(k)i)+∇Fi(x(k)i)2 (33) = 1m2m∑i=1\Exsk\brackets\vecnormgi(x(k)i)−∇Fi(x(k)i)2+\vecnorm1mm∑i=1∇Fi(x(k)i)2 (34) ≤ σ2m+\vecnorm∇F(k)\onem2 (35)

where the last inequality is according to the bounded variance assumption. Then, combining creftypepluralcap 35 and 32 and taking the total expectation over all random variables, one can obtain:

 \Exs\bracketsF(¯¯¯x(k+1))−F(¯¯¯x(k))≤ −η2\Exs\brackets\vecnorm∇F(¯¯¯x(k))2−η2(1−ηL)\Exs\brackets\vecnorm∇F(k)\onem2+ ηL22m\Exs\brackets\fronormX(k)(I−J)2+η2Lσ22m. (36)

Summing over all iterates and taking the average,

 \Exs\bracketsF(¯¯¯xK)−F(¯¯¯x(1))K≤ −η21KK∑i=1\Exs\brackets\vecnorm∇F(¯¯¯x(k))2−η2(1−ηL)1KK∑k=1\Exs\brackets\vecnorm∇F(k)\onem2+ ηL22mKK∑i=1\Exs\brackets\fronormX(k)(I−J)2+η2Lσ22m. (37)

By minor rearranging, we get

 1KK∑i=1\Exs\brackets\vecnorm∇F(¯¯¯x(k))2≤ 2\Exs\bracketsF(¯¯¯x(1))−F(¯¯¯x(K))ηK−1−ηLm1KK∑k=1\Exs\brackets\vecnorm∇F(k)\onem2+ L2mKK∑i=1\Exs\brackets\fronormX(k)(I−J)2+ηLσ2m (38) ≤ 2\bracketsF(¯¯¯x(1))−FinfηK−1−ηLm1KK∑k=1\Exs\brackets\vecnorm∇F(k)\onem2+ L2mKK∑i=1\Exs\brackets\fronormX(k)(I−J)2+ηLσ2m. (39)

Now we complete the first part of the proof. Then, we’re going to show that the discrepancies among local models is upper bounded. According to the update rule of decentralized SGD and the special property of gossip matrix , we have

 X(k)(I−J)= \parenthX(k−1)−ηG(k−1)W(k−1)(I−J) (40) = X(k−1)(I−J)W(k−1)−ηG(k−1)W(k−1)(I−J) (41) ⋮ (42) = X(1)(I−J)k−1∏q=1W(q)−ηk−1∑q=1G(q)\parenthk−1∏l=qW(l)−J. (43)

Since all local models are initiated at the same point, . Thus, we can obtain

 \fronormX(k)(I−J)2= η2\fronormk−1∑q=1G(q)\parenthk−1∏l=qW(l)−J2 (44) = η2\fronormk−1∑q=1\parenthG(q)−∇F(q)+∇F(q)\parenthk−1∏l=qW(l)−J2 (45) ≤ 2η2\fronormk−1∑q=1\parenthG(q)−∇F(q)\parenthk−1∏l=qW(l)−J2T1+2η2\fronormk−1∑q=1∇F(q)\parenthk−1∏l=qW(l)−J2T2. (46)

For the first term in (46), we have

 \Exs\bracketsT1= k−1∑q=1\Exs\brackets\fronorm\parenthG(q)−∇F(q)\parenthk−1∏l=qW(l)−J2 (47) ≤ k−1∑q=1ρk−q\Exs\brackets\fronormG(q)−∇F(q)2 (48) ≤ mσ2ρ\parenth1+ρ+ρ2+⋯+ρk−2 (49) ≤ mσ2ρ1−ρ (50)

where (48) comes from Section B.2. For the second term in (46), define