# Communication-Efficient Network-Distributed Optimization with Differential-Coded Compressors

Network-distributed optimization has attracted significant attention in recent years due to its ever-increasing applications. However, the classic decentralized gradient descent (DGD) algorithm is communication-inefficient for large-scale and high-dimensional network-distributed optimization problems. To address this challenge, many compressed DGD-based algorithms have been proposed. However, most of the existing works have high complexity and assume compressors with bounded noise power. To overcome these limitations, in this paper, we propose a new differential-coded compressed DGD (DC-DGD) algorithm. The key features of DC-DGD include: i) DC-DGD works with general SNR-constrained compressors, relaxing the bounded noise power assumption; ii) The differential-coded design entails the same convergence rate as the original DGD algorithm; and iii) DC-DGD has the same low-complexity structure as the original DGD due to a self-noise-reduction effect. Moreover, the above features inspire us to develop a hybrid compression scheme that offers a systematic mechanism to minimize the communication cost. Finally, we conduct extensive experiments to verify the efficacy of the proposed DC-DGD and hybrid compressor.

## Authors

• 75 publications
• 76 publications
• 15 publications
• 6 publications
12/10/2018

### Compressed Distributed Gradient Descent: Communication-Efficient Consensus over Networks

Network consensus optimization has received increasing attention in rece...
11/01/2021

Due to the explosion in the size of the training datasets, distributed l...
03/20/2018

### AC/DC: In-Database Learning Thunderstruck

We report on the design and implementation of the AC/DC gradient descent...
11/03/2020

### Gradient Coding with Dynamic Clustering for Straggler Mitigation

In distributed synchronous gradient descent (GD) the main performance bo...
07/01/2020

### Linear Convergent Decentralized Optimization with Compression

Communication compression has been extensively adopted to speed up large...
03/01/2021

### Gradient Coding with Dynamic Clustering for Straggler-Tolerant Distributed Learning

Distributed implementations are crucial in speeding up large scale machi...
12/22/2016

### An efficient hybrid tridiagonal divide-and-conquer algorithm on distributed memory architectures

In this paper, an efficient divide-and-conquer (DC) algorithm is propose...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Network-distributed optimization, a canonical topic dating back to[1], has received significant interests in recent years thanks to its ever-increasing applications, e.g., distributed learning[2, 3, 4], multi-agent systems[5], resource allocation[6], localization[7], etc. All these applications involve geographically dispersed datasets that are too big to aggregate due to high communication costs or privacy/security risks, hence necessitating distributed optimization over the network. A notable feature in network-distributed optimization is that there is a lack of shared memory due to the absence of a dedicated parameter server – a key component in the hierarchical distributed master/slave architecture. As a result, every node can only exchange and aggregate information with its local neighbors to reach a consensus on a global optimal decision.

In the literature, a classic algorithm for solving network-distributed optimization problems is the decentralized gradient descent method (DGD) proposed by Nedic and Ozdaglar[8]. The enduring popularity DGD lies in its simple gossip-like structure, which can be easily implemented in networks. Specifically, in each iteration, the update at each node combines a weighted average of the state information from its local neighbors (obtained by gossiping) and a gradient step based on its own local objective function and state information. Further, DGD achieves the same convergence rate as the centralized gradient descent method, implying that distributed computation does not sacrifice convergence rate.

However, despite the aforementioned salient features, a major limitation of the DGD method is that it requires full information exchanges of the state variables between nodes. Hence, the DGD algorithm is communication-inefficient when solving large-size high-dimensional optimization problems in networks with low-speed communication links. For example, consider a distributed image regression problem over a satellite network, where each satellite has images of typical resolution [9]. In this case, the parameter dimension is and the communication load per DGD iteration is MB (32-bit floating-point). This is problematic for many satellite networks with low-speed RF (radio frequency) links (typically in the range of hundreds Mbps [10]). To improve DGD’s communication efficiency, recent years have seen a line of research based on exchanging compressed information between nodes (see, e.g., [11, 12, 13, 14]). Specifically, by leveraging various compression techniques (e.g., quantization/rounding[15], sparsification[16]), a high-dimensional state space can be represented by a small codebook, hence alleviating the communication load in the network.

However, although progress has been made to various extents, most of the existing works on compressed DGD algorithms suffer from the following key limitations (see Section II for more in-depth discussions): 1) extra parameter tunings resulted from far more complex algorithmic structures compared to DGD; 2) restricted assumptions on compressors having bounded compression noise power; 3) convergence speed is slow and sensitive to problem structure; 4) strong i.i.d. (independently identically distributed) distribution assumptions on datasets at different locations, which often do not hold in practice. In addition, most of the existing works simply treat compressors as “blackbox operators” and do not consider how to minimize communication load with specific compression coding scheme designs. In light of the ever-increasing demand for large-scale network-distributed data analytics, the above limitations motivate us to develop new compression-based algorithms for communication-efficient network-distributed optimization.

The major contribution of this paper is that we propose a differential-coded compression-based DGD algorithmic framework (DC-DGD), which overcomes the above limitations and offers significant improvements over the existing works. Moreover, based on the proposed DC-DCD framework, we propose a hybrid compression scheme that integrates gradient sparsification and ternary operators, which enables dynamic communication load minimization. Our main technical results and their significance are summarized as follows:

• We propose a new differential-coded

DC-DGD algorithmic framework, where “differential-coded” means that the information exchanged between nodes is the differential between two successive iterations of the variables, rather than the variables themselves. We show that DC-DGD allows us to work with a wide range of general compressors that are only constrained by SNR (signal-to-noise-ratio) and thus could have unbounded noise power. The use of SNR-constrained compressors

relaxes the commonly adopted assumption on bounded compression noise power in the literature[11, 12, 13]. More specifically, we show that if a compressor’s SNR is greater than , where

is the smallest eigenvalue of the consensus matrix used in all DGD-type algorithms, then our DC-DGD algorithm achieves the

same convergence rate as the original DGD method.

• Not only does the use of SNR-constrained compressors make our DC-DGD framework more general and practical, it also induces a nice “self-compression-noise-power-reduction effect” that keeps the algorithmic structure of DC-DGD simple. More specifically, based on a quadratic Lyapunov function of the consensus form of the optimization problem, we show that the accumulated compression noise under DC-DGD shrinks to zero under SNR-constrained compressors and differential-coded information exchange. Hence, there is no need to introduce extra mechanisms or parameters to tame the accumulated compression noise for ensuring convergence. As a result, DC-DGD enjoys the same low-complexity and efficient convergence rate as the original DGD method.

• The insights on the relationship between DC-DCD and SNR-constrained compressors further inspires us to develop a hybrid compression scheme that integrates gradient sparsification and ternary operators to obtain controllable SNR and a high compression ratio simultaneously. The proposed hybrid compression scheme achieves the best of both worlds through a meticulously designed mechanism to minimize the communication load. Specifically, under the hybrid compressor, the communication load minimization can be formulated as an integer programming problem. Based on the special problem structure, we show that the problem can be solved efficiently by a greedy algorithm.

Our results in this paper contribute to the state of the art of theories and algorithm design for communication-efficient network-distributed optimization. The rest of the paper is organized as follows. In Section II, we further review related works on the state of the art of compressed DGD-based optimization algorithms. In Section III, we first present our DC-DGD algorithm and then analyze its convergence gaurantees. In Section IV, we developed a family of hybrid operators and a greedy algorithm is proposed to choose the optimal hybrid operator. Numerical results are provided in Section V. We conclude this paper in Section VI.

## Ii Related Works

As mentioned earlier, compression-based DGD algorithms have received increasing attention in recent years. In this section, we provide a more in-depth survey on the state of the art in this area to put our work into comparative perspectives. Broadly speaking, compression-based DGD algorithms can be categorized as follows (some fall into multiple categories):

1) Uncoded Noise-Power-Constrained Compressed DGD: In the literature, most of the early attempts on compressed DGD were focused on noise-power-constrained compressors, which are easier to analyze. One notable recent work is the QDGD method proposed by Reisizadeh et al.[11]. The main idea of QDGD is to introduce an -scaled aggregation of compressed local copies coupled with an -scaled local gradient step, where is an extra diminishing parameter introduced in each iteration to dampen the noise power. However, due to the timid gradient step-size ( is the original local gradient step-size in DGD), the convergence rate of QDGD is , which is much slower than the original DGD. Also, the algorithm is more complex to use than DGD due to the sensitivity in tuning the extra parameter . Moreover, QDGD was focused on strongly convex cases and it is unclear whether its performance results can be straightforwardly extended to non-convex cases.

2) Differential-Coded DGD with Noise-Power-Constrained Compressors: Another more recently emerging line of research is the differential-coded DGD approach. For example, in[12], Tang et al. proposed the ECD-PSGD algorithm, where extrapolated information is used in each iteration to reduce compression noise. However, it requires computing an optimized step-size in each iteration, which leads to high per-iteration complexity. Also, the convergence rate of ECD-PSGD is , which is slower than the original DGD and its stochastic variant. Another notable example is the ADC-DGD algorithm proposed by Zhang et al.[13], where a -amplified differential-coded information (with ) is used in each iteration . It is shown in [13] that ADC-DGD achieves the same convergence rate as that of the original DGD. However, ADC-DGD runs the risk of arithmetic overflow due to the asymptotically unbounded -amplification factor. This extra -parameter selection of ADC-DGD also makes it complex to use compared to DGD.

3) Differential-Coded DGD with SNR-Constrained Compressors: The most related algorithm to ours is the DCD-PSGD algorithm proposed by Tang et al. in [12], which is by far the only differential-coded algorithm that can work with SNR-constrained compressors. Although DCD-PSGD shares the above similarities with us, our DC-DGD algorithm differs from DCD-PSGD in the following key aspects: i) DCD-PSGD is designed for parallel training, where a key assumption is that the data at each node are i.i.d., which guarantees that the local objectives are identical. However, our work relaxes this assumption and allows the local objectives to be non-identically distributed. ii) The final output of DCD-PSGD is the average of all nodes in the network, which could be difficult to implement in network-distributed settings. In contrast, DC-DGD does not require such an averaging at the final output since each node reaches a global optimal consensus. iii) Although both algorithms work with SNR-constrained compressors, the SNR constraint of DCD-PSGD is lower bounded by , while the SNR lower bound of our DC-DGD is , where is the smallest eigenvalue of the consensus matrix. It can be readily verified that our SNR lower bound is much smaller, which implies that our DC-DGD can work with more aggressive compression schemes. iv) To achieve the best convergence rate, DCD-PSGD requires an optimal step-size determined by a set of complex parameters (cf. step-size “” in Theorem 1 and Corollary 2 in [12]) and hard to implement in practice. In contrast, the step-size selection in our DC-DGD uses simple sublinearly diminishing series and is easy to implement.

## Iii Differential-Coded Decentralized Gradient Descent with SNR-Constrained Compressors

In this section, we first present the problem formulation of network-distributed optimization in Section III-A. Then, we will present our DC-DGD algorithm in Section III-B and its main theoretical results in Section III-C. Lastly, we provide proof sketches for the main theoretical results in Section III-D.

### Iii-a Problem Formulation of Network-Distributed Optimization

We use an undirected connected graph to represent a network, where and are the sets of nodes and links, respectively, with and . We let

denote a global decision vector to be optimized. In network-distributed optimization, we want to distributively solve a network-wide optimization problem:

, where can be decomposed node-wise as follows:

 minx∈RDf(x)=minx∈RDN∑i=1fi(x), (1)

where each local objective function is only observable to node . Problem (1) has many real-world applications. For example, in the satellite network image regression problem in Section I, each satellite distributively collects image data , where , , and represent the pixels, geographical information, and ground-truth label of the -th image at satellite , respectively, and is the size of the local dataset. Suppose that the regression is based on a linear model with parameters . Then, the problem can be written as: , where . Note that Problem (1) can be written as the following equivalent consensus form:

 Minimize N∑i=1fi(xi) (2) subject to xi=xj, ∀(i,j)∈L.

where is the local copy of at node . The constraints in Problem (2) guarantee that the all local copies are equal to each other, hence the name consensus form.

### Iii-B The DC-DGD Algorithm

To facilitate the presentation of our DC-DGD algorithm, we first need to formally define two technical notions. The first one is the SNR-constrained unbiased stochastic compressors:

###### Definition 1 (SNR-Constrained Stochastic Unbiased Compressor).

A stochastic compression operator is said to be unbiased and constrained by an SNR threshold if it satisfies , with and , .

We can see from Definition 1 that, for a given compressor, is its lowest SNR yielded by its largest compression noise power . We note that SNR-constrained stochastic unbiased compressors are much less restricted than the noise-power-constrained stochastic unbiased compressors previously assumed in the literature (see, e.g., [11, 12, 13]), which satisfies and , . That is, the compression noise power is universally upper bounded by a constant regardless of the input signal. In contrast, the noise power under SNR-constrained compressors could be arbitrarily large as long as it satisfies a certain SNR requirement, hence being more general. For example, the following are two typical SNR-constrained stochastic unbiased compressors:

###### Example 1.

[The Sparsifier Operator [16]] For any vector outputs a sparse vector with the -th element following the Bernoulli distribution:

 {Pr([C(z)]i=zkp)=p,Pr([C(z)]i=0)=1−p,

where is a constant. The operation is unbiased and the SNR is lower bounded by is .

###### Example 2.

[The Ternary Operator [17]] For any vector where is the Hadamard product and is a random vector with the -th element

following the Bernoulli distribution:

 {Pr([bz]i=1)=|zi|/∥z∥∞,Pr([bz]i=0)=1−|zi|/∥z∥∞.

The operation is unbiased and the noise power and hence .

Next, we introduce the notion of consensus matrix, which is denoted as in this paper. As will be seen later, the entries in define the weight parameters used by each node to perform local information aggregation. Mathematically, satisfies the following properties:

• Doubly Stochastic: .

• Symmetric: , .

• Network-Defined Sparsity Pattern: if and otherwise, .

Collectively, properties a) and b) imply that the spectrum of (i.e., the set of all eigenvalues) lies in the interval on the real line, with exactly one eigenvalue being equal to 1. Further, since all eigenvalues are real, they can be sorted as . For convenience, we define a parameter , i.e., the second-largest eigenvalue of in magnitude. Simply speaking, the use of the consensus matrix is due to the fact that if and only if , ,[8] where and represents the Kronecker product. Therefore, Problem (2) can be reformulated as , , which further leads to the original DGD algorithmic design[8].

With the notions of SNR-constrained unbiased stochastic compressors and consensus matrix, we are now in a position to present our DC-DGD algorithmic framework. To this end, we let denote the set of local neighbors of node . Then, our DC-DGD is stated as follows:

Algorithm 1: Differential-Coded Compressed Decentralized Gradient Descent Method (DC-DGD).   Initialization:

1. [topsep=1pt, itemsep=-.1ex, leftmargin=.2in]

2. Set the initial state , .

3. Let , and .

Main Loop:

1. [topsep=1pt, itemsep=-.1ex, leftmargin=.2in]

2. In the -th iteration, each node sends the differential-coded compressed information to its neighbors, where is an SNR-constrained stochastic unbiased compressor. Meanwhile, upon the reception of all neighbors’ information, each node performs the following updates:

 (3) b) Weighted local aggregation step: yi,t=yi,t−1+∑j∈Ni[W]ijC(dj,t). (4) c) Local gradient step: zi,t+1=yi,t−αt∇fi(xi,t). (5) d) Local differential update: di,t+1=zi,t+1−xi,t. (6)
3. Stop if some preferred convergence criterion is met; otherwise, let and go to Step 3.

Several important remarks on the DC-DGD algorithm are in order: 1) The combined update structure in Steps 3-b) and 3-c) is the same as the original DGD algorithm, which contains a weighted local aggregation step and a local gradient step. Notably, DC-DGD only has one parameter: the step-size (same as DGD). Thus, DC-DGD enjoys the identical structural complexity as that of the original DGD.

2) DC-DGD is memory-efficient: In DC-DGD, each node only needs to store three local variables: and This is in stark contrast to some DGD-based algorithms, e.g., ADC-DGD[13] and DCD-PSGD[12], where each node needs to store all values of the previous iteration from its neighbors, which is unscalable for large and dense networks where node degrees are high.

3) Compared to the original DGD algorithm and many of its variants, a notable difference in DC-DGD is that the gradient in Step 3-c) is calculated based on an inexact update from and the compressed differential (i.e., Step 3-a)), rather than using an exact update. This is derived from the convergence of a chosen Lyapunov function (to be defined soon). Interestingly, we will show that this modification does not harm the algorithm’s convergence speed because the difference between inexact and exact updates is negligible when the Lyapunov function is near convergence.

Before we prove the convergence of DC-DGD, it is insightful to offer some intuitions on why DC-DGD retains most of the simple structural properties of the original DGD and does not need extra mechanism/parameter(s) to tame compression noises. First, we define the following Lyapunov function:

 Lαt(x) ≜12x⊤(I−W⊗Id)x+αtf(x). (7)

We note that is also used for proving the convergence of several other DGD-based algorithms (e.g., [18, 19]). To understand our DC-DGD algorithm, we rewrite its updates Steps 3-a) – 3-d) in the following vector form:

 ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩xt=xt−1+C(dt),yt=yt−1+(W⊗Id)C(dt),zt+1=yt−αt∇f(xt),dt+1=zt+1−xt, (8)

where and . Note that with we have by induction. Hence, we can rewrite the updates as:

 ⎧⎪⎨⎪⎩xt=xt−1+C(dt)=xt−1+zt−xt−1+ϵt=zt+ϵt,zt+1=(W⊗Id)xt−αt∇F(xt)=xt−∇Lαt(xt),dt+1=zt+1−xt=−∇Lαt(xt),

where is a compression noise satisfying and That is, the power of the noise depends on the difference between two successive iterations, which in turn is the gradient of the Lyapunov function . As the algorithm converges (to be proved soon), implies that . Hence, no extra effort is required to tame the noise power thanks to this self-compression-noise-power-reduction effect.

### Iii-C Main Theoretical Results

In this subsection, we will establish the convergence of the proposed DC-DGD algorithm. Our convergence results are proved under the following mild assumptions:

###### Assumption 1.

The local objective functions satisfies:

• (Lower boundedness) There exists an optimal with such that , ;

• (Lipschitz continuous gradient) there exists a constant such that ;

• (Bounded gradient) there exists a constant such that for all , , .

Note that the first two bullets are standard in convergence analysis: The first one ensures the existence of optimal solution and the second guarantees the smoothness of the local objectives. The third bullet is needed to bound the deviation of local copies to their mean (cf. Theorem 2). It is equivalent to being -Lipschitz continuous. This mild assumption has been widely adopted in analyzing non-convex optimization algorithms in the literature (see, e.g.,[20, 21, 22]).

To show the convergence of DC-DGD, we will show that the iterates and the gradient are bounded over all iterations, and the summation of the gradients of the Lyapunov function over the iterations is also bounded.

###### Theorem 1.

Under Assumption 1, if a constant step-size is used, where is the SNR threshold satisfying , then the gradients of the Lyapunov function is bounded, i.e.,

 t∑τ=0E[∥∇Lα(xτ)∥2]≤2α(f(0)−f(x∗))1+λN−αL−(1−λN+αL)/η.

Note that Theorem 1 has a key condition on the SNR threshold: . This SNR lower bound is to guarantee the feasible domain for the step-size Interestingly, it can be seen that as (i.e., a sparse consensus matrix ), the lower bound for SNR shrinks to zero, meaning that as the network gets sparser, we could adopt compressors with larger compression ratios.

Next, we bound the derivation of each local copy from the mean of all local copies in any iteration :

###### Theorem 2.

Under Assumption 1 and with the same step-size and SNR selections as in Theorem 1, in each iteration , the deviations of local copies from the mean can be bounded as:

 E[∥xt−¯xt∥2]≤(αND1−β)2+t∑τ=1β2(t−τ)E[∥∇Lα(xτ−1)∥2]/η,

where and

Theorem 2 requires that is bounded, which is guaranteed by Theorem 1. Lastly, based on Theorems 1 and 2, we show that DC-DGD converges to an error ball of the global objective’s stationary point at rate :

###### Theorem 3.

Under Assumption 1, if the step-size satisfies then it holds that

 t∑τ=0E[∥∇f(¯xτ)∥2]≤C1(α,β)[f(0)−f(x∗)]+α2N2D2L(1−β)2t,

where Thus, DC-DGD converges at rate to an error ball that depends on parameters :

 minτ=0,⋯,tE[∥∇f(¯xτ)∥2]≤C1(α,β)[f(0)−f(x∗)]t+α2N2D2L(1−β)2.

Note that in Theorem 3, similar to the original DGD algorithm, the size of the error ball is determined by two terms: The first one is a convergence error with sublinear diminishing rate ; The second term is the approximation error affected by the step-size and the network structure (characterized by and ). Therefore, to reach an optimal solution, the step-size needs to be small so that the second term is close to zero. However, as the coefficient for the convergence error which in turn requires more iterations for shrinking the first term.

The next result shows that with diminishing step-size DC-DGD converges to a first-order stationary point (optimal solution in convex problems) at rate :

###### Corollary 1.

Let , where and , then the convergence rate of DC-DGD is:

### Iii-D Proofs of the Main Theoretical Results

Due to space limitation, we provide proof sketches of the main theoretical results in this subsection.

###### Proof Sketch of Theorem 1.

Let denote a filtration. It can be shown that the Lyapunov function has -Lipschitz gradients. It then follows that:

 Lα(xt+1)≤Lα(xt)−⟨∇Lα(xt),∇Lα(xt)−ϵt+1⟩+(1−λN+αL)2[∥∇Lα(xt)∥2+∥ϵt+1∥2−2⟨∇Lα(xt),ϵt+1⟩].

Taking conditional expectation and using the properties of SNR-constrained unbiased compressors yield: . Since we have Then, by setting step-size as stated in the theorem, we have . It then follows that . Taking full expectation on both sides and telescoping from to , we have:

 −[αL−λN−1+(1−λN+αL)/η]×t∑τ=1E[∥∇Lα(xt)∥2]≤2(Lα(x0)−E[Lα(xt+1)]). (9)

Since , after rearranging terms, we can conclude that:

 t∑τ=1E[∥∇Lα(xt)∥2]≤2α(∑Ni=1fi(0)−∑Ni=1fi(x∗))1+λN−αL−(1−λN+αL)/η,

and the proof is complete. ∎

###### Proof Sketch of Theorem 2.

For notation convenience, We let From (8), we can obtain:

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩x1=~Wx0−α0∇f(x0)−ϵ1=−α0∇f(x0)−ϵ1,x2=~Wx1−α1∇f(x1)−ϵ2=−~Wα0∇f(x0)−α1∇f(x1)−~Wϵ1−~Wϵ2,⋮xt=−t−1∑τ=0α~Wt−τ−1∇f(xτ)−t∑τ=1~Wt−τϵτ.

Using the above equations, we can derive the following inequality for the deviation from the mean :

 ∥xt−¯xt∥2=∥xt−(1/N)11⊤xt∥2 +2t∑τ=1∥(~Wt−τ−(1/N)11⊤)ϵτ∥2

Taking the expectation on both sides, noting , and after some algebraic manipulations, we arrive at:

 E[∥xt−¯xt∥2]≤(αND1−β)2+t∑τ=1β2(t−τ)E[∥∇Lα(xτ−1)∥2]/η,

which completes the proof. ∎

###### Proof Sketch of Theorem 3.

First, we prove a key descending inequality on . From the update rule , we have . It then follows that:

 ¯f(¯xt+1)≤¯f(¯xt)−⟨∇¯f(¯xt),αNN∑i=1∇fi(xi,t)+¯ϵt+1⟩ +L2[∥∥αN∇f(xi,t)∥∥2+∥¯ϵt+1∥2+2⟨αN∇f(xi,t),¯ϵt+1⟩].

where Taking the conditional expectation on both sides and after some algebraic manipulations, we can show that

 E[¯f(¯xt+1)|Ft]≤¯f(¯xt)−α2∥∇¯f(¯xt)∥2+α2∥1nn∑i=1∇fi(xi,t)−∇¯f(¯xt)∥2+L2n2η∥∇Lα(xt)∥2.

Taking the full expectation, telescoping the inequality from to , and after further algebraic manipulations, we have:

 α2t∑τ=0E[∥∇f(¯xτ)∥2]≤[1N+(αL(1−β2)N2η+L2N2η)×2α1+λN−αL−(1−λN+αL)/η][f(0)−f(x∗)]+α3D2Lt(1−β)2,

which, after further rearrangements, yields the result stated in the theorem. This completes the proof. ∎

## Iv A Hybrid Compression Design under DC-DGD for Communication Cost Minimization

Inspired by previous theoretical insights, in this section, our goal is to design a hybrid SNR-constrained compression scheme to achieve high communication cost saving, while having a controllable SNR. Recall from Section III-A

that the sparsifier can control the compression noise power by adjusting the probability

and the expected communication cost for a -dimensional vector is where is the cost for sending a floating number and is the cost for value . Therefore, if the SNR threshold is large, the communication cost will be close to sending uncompressed copy . For the ternary operator, its compression noise power is which is not directly controllable by any parameter. The communication cost is where is the cost for the ternary values .

In general, the communication cost of a ternary-compressed vector is much smaller than that of the sparse-compressed vector: For example, if using -bit floating numbers and one bit for the zero value, the cost for a -dimensional sparse compressed vector is . In contrast, for the ternary operator, the cost will be if using -bit floating numbers and two bits for the ternary values. With a larger SNR threshold (i.e., larger ) and high dimensionality , the communication cost of the ternary compressor is much smaller. Therefore, to have a controllable compression noise power as well as high communication cost savings, a promising solution is to combine the sparse and the ternary compressors.

To this end, consider a -dimensional vector . We can sort and rearrange the elements of in descending order of magnitude to have: , with . For the first largest elements, we apply the ternary compressor, while for the rest of the elements, we use the sparse compressor, i.e.,

 z[1], z[2], ⋯, z[s1−1], z[s1]ternary compression,z[s1+1], ⋯, z[d−1], z[d]sparsifier compression⇒ z[1],   0,   ⋯,    −1,     1ternary compressed,z[s1+1]p, ⋯,      0,  z[d]psparsifer % compressed

As a result, the compression noise power levels of the first largest elements and the rest are and , respectively. In order to ensure the effective SNR of the hybrid scheme satisfies for some lower bound , we have:

 (ternary) : |z[i]|(|z[1]|−|z[i]|)<(1/C)z2[i], ∀i≤s1 (10) (sparsifer) : (1/p−1)z2[i]<(1/C)z2[i], ∀i>s1. (11)

To satisfy (10) and (11), we have and , respectively. Then, on average, the compressed vector has floating numbers and ternary values, which is more efficient compared to that under the sparsifier compressor.

In fact, the hybrid compression idea above can be generalized to achieve further communication cost savings: Instead of just using for the ternary compression, we can select multiple “anchor elements. There are elements between and . Thus, a -dimensional vector can be partitioned into groups. For the elements with indices in , we apply the ternary compressor based on . For the remaining elements, we apply the sparsifier operator. Similar to (10), we have

 |z[j]|(|z[qi]|−|z[j]|)<(1/C)z2[j], ∀j∈(qi,qi+si). (12)

Then, the compressed vector has floating and ternary values. Moreover, we need to save the indices of the anchor elements, for which we need bits per element.

Given a SNR threshold the communication saving of our hybrid compression scheme is highly dependent on the group number and the positions of the anchor elements, which can be optimized by solving an integer programming problem. Take 32-bit floating numbers and 2-bit ternary values as an example. To achieve the maximum communication saving, the group number and the locations of the anchor elements can be determined by solving:

 (13)

Problem (13

) is an integer optimization problem, which can be shown to be equivalent to bin packing problems, thus being NP-hard. However, an efficient greedy heuristic algorithm can be developed by leveraging the special problem structure. Specifically, we note that the objective function is increasing and decreasing with respect to

and , respectively. Therefore, we can find anchor points and their corresponding ternary sets (of size ) by checking (12); if the ternary cost of the elements is smaller than the sparsifier cost, we remove these elements from the current vector; otherwise, we use the sparsifier compressor on the current vector. We summarize the greedy algorithm as follows:   Algorithm 2: A greedy algorithm for solving Problem (13).   Initialization:

1. [topsep=1pt, itemsep=-.1ex, leftmargin=.2in]

2. Sort and rearrange the elements of vector in descending order of magnitude.

3. Let . Set the ternary set as empty.

Main Loop:

1. [topsep=1pt, itemsep=-.1ex, leftmargin=.2in]

2. Inner Loop:

1. [topsep=1pt, itemsep=-.1ex, leftmargin=.4in]

2. For each element , find the set: .

3. Set and .

3. Compare the ternary cost with the sparsifier cost

4. If the ternary cost is smaller, then remove the corresponding elements from the current vector and add them to , let and go to Step 3; otherwise, break the loop.

Final Step:

1. [topsep=1pt, itemsep=-.1ex, leftmargin=.2in]

2. Apply the ternary operator to each group in and the sparse operator to

Now, we analyze the running time complexity of the greedy algorithm. First of all, the sorting requires time. The worst-case number of iterations in the main loop is while in each inner loop, it takes steps to find the ternary set for each element. Hence, the overall time-complexity of Algorithm 2 is .

## V Numerical Results

In this section, we perform extensive numerical experiments to validate the performances of our proposed DC-DGD algorithm and the hybrid compression scheme.

1) Convergence of DC-DGD: In this simulation, we adopt the sparsifier compression in Example 1 and vary the probability parameter to induce different SNR threshold values. Consider a five-node circle network in Fig. 1 with the global objective function: , where

 fi(x)={log