I Introduction
Networkdistributed optimization, a canonical topic dating back to[1], has received significant interests in recent years thanks to its everincreasing applications, e.g., distributed learning[2, 3, 4], multiagent systems[5], resource allocation[6], localization[7], etc. All these applications involve geographically dispersed datasets that are too big to aggregate due to high communication costs or privacy/security risks, hence necessitating distributed optimization over the network. A notable feature in networkdistributed optimization is that there is a lack of shared memory due to the absence of a dedicated parameter server – a key component in the hierarchical distributed master/slave architecture. As a result, every node can only exchange and aggregate information with its local neighbors to reach a consensus on a global optimal decision.
In the literature, a classic algorithm for solving networkdistributed optimization problems is the decentralized gradient descent method (DGD) proposed by Nedic and Ozdaglar[8]. The enduring popularity DGD lies in its simple gossiplike structure, which can be easily implemented in networks. Specifically, in each iteration, the update at each node combines a weighted average of the state information from its local neighbors (obtained by gossiping) and a gradient step based on its own local objective function and state information. Further, DGD achieves the same convergence rate as the centralized gradient descent method, implying that distributed computation does not sacrifice convergence rate.
However, despite the aforementioned salient features, a major limitation of the DGD method is that it requires full information exchanges of the state variables between nodes. Hence, the DGD algorithm is communicationinefficient when solving largesize highdimensional optimization problems in networks with lowspeed communication links. For example, consider a distributed image regression problem over a satellite network, where each satellite has images of typical resolution [9]. In this case, the parameter dimension is and the communication load per DGD iteration is MB (32bit floatingpoint). This is problematic for many satellite networks with lowspeed RF (radio frequency) links (typically in the range of hundreds Mbps [10]). To improve DGD’s communication efficiency, recent years have seen a line of research based on exchanging compressed information between nodes (see, e.g., [11, 12, 13, 14]). Specifically, by leveraging various compression techniques (e.g., quantization/rounding[15], sparsification[16]), a highdimensional state space can be represented by a small codebook, hence alleviating the communication load in the network.
However, although progress has been made to various extents, most of the existing works on compressed DGD algorithms suffer from the following key limitations (see Section II for more indepth discussions): 1) extra parameter tunings resulted from far more complex algorithmic structures compared to DGD; 2) restricted assumptions on compressors having bounded compression noise power; 3) convergence speed is slow and sensitive to problem structure; 4) strong i.i.d. (independently identically distributed) distribution assumptions on datasets at different locations, which often do not hold in practice. In addition, most of the existing works simply treat compressors as “blackbox operators” and do not consider how to minimize communication load with specific compression coding scheme designs. In light of the everincreasing demand for largescale networkdistributed data analytics, the above limitations motivate us to develop new compressionbased algorithms for communicationefficient networkdistributed optimization.
The major contribution of this paper is that we propose a differentialcoded compressionbased DGD algorithmic framework (DCDGD), which overcomes the above limitations and offers significant improvements over the existing works. Moreover, based on the proposed DCDCD framework, we propose a hybrid compression scheme that integrates gradient sparsification and ternary operators, which enables dynamic communication load minimization. Our main technical results and their significance are summarized as follows:

We propose a new differentialcoded
DCDGD algorithmic framework, where “differentialcoded” means that the information exchanged between nodes is the differential between two successive iterations of the variables, rather than the variables themselves. We show that DCDGD allows us to work with a wide range of general compressors that are only constrained by SNR (signaltonoiseratio) and thus could have unbounded noise power. The use of SNRconstrained compressors
relaxes the commonly adopted assumption on bounded compression noise power in the literature[11, 12, 13]. More specifically, we show that if a compressor’s SNR is greater than , whereis the smallest eigenvalue of the consensus matrix used in all DGDtype algorithms, then our DCDGD algorithm achieves the
same convergence rate as the original DGD method. 
Not only does the use of SNRconstrained compressors make our DCDGD framework more general and practical, it also induces a nice “selfcompressionnoisepowerreduction effect” that keeps the algorithmic structure of DCDGD simple. More specifically, based on a quadratic Lyapunov function of the consensus form of the optimization problem, we show that the accumulated compression noise under DCDGD shrinks to zero under SNRconstrained compressors and differentialcoded information exchange. Hence, there is no need to introduce extra mechanisms or parameters to tame the accumulated compression noise for ensuring convergence. As a result, DCDGD enjoys the same lowcomplexity and efficient convergence rate as the original DGD method.

The insights on the relationship between DCDCD and SNRconstrained compressors further inspires us to develop a hybrid compression scheme that integrates gradient sparsification and ternary operators to obtain controllable SNR and a high compression ratio simultaneously. The proposed hybrid compression scheme achieves the best of both worlds through a meticulously designed mechanism to minimize the communication load. Specifically, under the hybrid compressor, the communication load minimization can be formulated as an integer programming problem. Based on the special problem structure, we show that the problem can be solved efficiently by a greedy algorithm.
Our results in this paper contribute to the state of the art of theories and algorithm design for communicationefficient networkdistributed optimization. The rest of the paper is organized as follows. In Section II, we further review related works on the state of the art of compressed DGDbased optimization algorithms. In Section III, we first present our DCDGD algorithm and then analyze its convergence gaurantees. In Section IV, we developed a family of hybrid operators and a greedy algorithm is proposed to choose the optimal hybrid operator. Numerical results are provided in Section V. We conclude this paper in Section VI.
Ii Related Works
As mentioned earlier, compressionbased DGD algorithms have received increasing attention in recent years. In this section, we provide a more indepth survey on the state of the art in this area to put our work into comparative perspectives. Broadly speaking, compressionbased DGD algorithms can be categorized as follows (some fall into multiple categories):
1) Uncoded NoisePowerConstrained Compressed DGD: In the literature, most of the early attempts on compressed DGD were focused on noisepowerconstrained compressors, which are easier to analyze. One notable recent work is the QDGD method proposed by Reisizadeh et al.[11]. The main idea of QDGD is to introduce an scaled aggregation of compressed local copies coupled with an scaled local gradient step, where is an extra diminishing parameter introduced in each iteration to dampen the noise power. However, due to the timid gradient stepsize ( is the original local gradient stepsize in DGD), the convergence rate of QDGD is , which is much slower than the original DGD. Also, the algorithm is more complex to use than DGD due to the sensitivity in tuning the extra parameter . Moreover, QDGD was focused on strongly convex cases and it is unclear whether its performance results can be straightforwardly extended to nonconvex cases.
2) DifferentialCoded DGD with NoisePowerConstrained Compressors: Another more recently emerging line of research is the differentialcoded DGD approach. For example, in[12], Tang et al. proposed the ECDPSGD algorithm, where extrapolated information is used in each iteration to reduce compression noise. However, it requires computing an optimized stepsize in each iteration, which leads to high periteration complexity. Also, the convergence rate of ECDPSGD is , which is slower than the original DGD and its stochastic variant. Another notable example is the ADCDGD algorithm proposed by Zhang et al.[13], where a amplified differentialcoded information (with ) is used in each iteration . It is shown in [13] that ADCDGD achieves the same convergence rate as that of the original DGD. However, ADCDGD runs the risk of arithmetic overflow due to the asymptotically unbounded amplification factor. This extra parameter selection of ADCDGD also makes it complex to use compared to DGD.
3) DifferentialCoded DGD with SNRConstrained Compressors: The most related algorithm to ours is the DCDPSGD algorithm proposed by Tang et al. in [12], which is by far the only differentialcoded algorithm that can work with SNRconstrained compressors. Although DCDPSGD shares the above similarities with us, our DCDGD algorithm differs from DCDPSGD in the following key aspects: i) DCDPSGD is designed for parallel training, where a key assumption is that the data at each node are i.i.d., which guarantees that the local objectives are identical. However, our work relaxes this assumption and allows the local objectives to be nonidentically distributed. ii) The final output of DCDPSGD is the average of all nodes in the network, which could be difficult to implement in networkdistributed settings. In contrast, DCDGD does not require such an averaging at the final output since each node reaches a global optimal consensus. iii) Although both algorithms work with SNRconstrained compressors, the SNR constraint of DCDPSGD is lower bounded by , while the SNR lower bound of our DCDGD is , where is the smallest eigenvalue of the consensus matrix. It can be readily verified that our SNR lower bound is much smaller, which implies that our DCDGD can work with more aggressive compression schemes. iv) To achieve the best convergence rate, DCDPSGD requires an optimal stepsize determined by a set of complex parameters (cf. stepsize “” in Theorem 1 and Corollary 2 in [12]) and hard to implement in practice. In contrast, the stepsize selection in our DCDGD uses simple sublinearly diminishing series and is easy to implement.
Iii DifferentialCoded Decentralized Gradient Descent with SNRConstrained Compressors
In this section, we first present the problem formulation of networkdistributed optimization in Section IIIA. Then, we will present our DCDGD algorithm in Section IIIB and its main theoretical results in Section IIIC. Lastly, we provide proof sketches for the main theoretical results in Section IIID.
Iiia Problem Formulation of NetworkDistributed Optimization
We use an undirected connected graph to represent a network, where and are the sets of nodes and links, respectively, with and . We let
denote a global decision vector to be optimized. In networkdistributed optimization, we want to distributively solve a networkwide optimization problem:
, where can be decomposed nodewise as follows:(1) 
where each local objective function is only observable to node . Problem (1) has many realworld applications. For example, in the satellite network image regression problem in Section I, each satellite distributively collects image data , where , , and represent the pixels, geographical information, and groundtruth label of the th image at satellite , respectively, and is the size of the local dataset. Suppose that the regression is based on a linear model with parameters . Then, the problem can be written as: , where . Note that Problem (1) can be written as the following equivalent consensus form:
Minimize  (2)  
subject to 
where is the local copy of at node . The constraints in Problem (2) guarantee that the all local copies are equal to each other, hence the name consensus form.
IiiB The DCDGD Algorithm
To facilitate the presentation of our DCDGD algorithm, we first need to formally define two technical notions. The first one is the SNRconstrained unbiased stochastic compressors:
Definition 1 (SNRConstrained Stochastic Unbiased Compressor).
A stochastic compression operator is said to be unbiased and constrained by an SNR threshold if it satisfies , with and , .
We can see from Definition 1 that, for a given compressor, is its lowest SNR yielded by its largest compression noise power . We note that SNRconstrained stochastic unbiased compressors are much less restricted than the noisepowerconstrained stochastic unbiased compressors previously assumed in the literature (see, e.g., [11, 12, 13]), which satisfies and , . That is, the compression noise power is universally upper bounded by a constant regardless of the input signal. In contrast, the noise power under SNRconstrained compressors could be arbitrarily large as long as it satisfies a certain SNR requirement, hence being more general. For example, the following are two typical SNRconstrained stochastic unbiased compressors:
Example 1.
[The Sparsifier Operator [16]] For any vector outputs a sparse vector with the th element following the Bernoulli distribution:
where is a constant. The operation is unbiased and the SNR is lower bounded by is .
Example 2.
[The Ternary Operator [17]] For any vector where is the Hadamard product and is a random vector with the th element
following the Bernoulli distribution:
The operation is unbiased and the noise power and hence .
Next, we introduce the notion of consensus matrix, which is denoted as in this paper. As will be seen later, the entries in define the weight parameters used by each node to perform local information aggregation. Mathematically, satisfies the following properties:

Doubly Stochastic: .

Symmetric: , .

NetworkDefined Sparsity Pattern: if and otherwise, .
Collectively, properties a) and b) imply that the spectrum of (i.e., the set of all eigenvalues) lies in the interval on the real line, with exactly one eigenvalue being equal to 1. Further, since all eigenvalues are real, they can be sorted as . For convenience, we define a parameter , i.e., the secondlargest eigenvalue of in magnitude. Simply speaking, the use of the consensus matrix is due to the fact that if and only if , ,[8] where and represents the Kronecker product. Therefore, Problem (2) can be reformulated as , , which further leads to the original DGD algorithmic design[8].
With the notions of SNRconstrained unbiased stochastic compressors and consensus matrix, we are now in a position to present our DCDGD algorithmic framework. To this end, we let denote the set of local neighbors of node . Then, our DCDGD is stated as follows:
Algorithm 1: DifferentialCoded Compressed Decentralized Gradient Descent Method (DCDGD). Initialization:

[topsep=1pt, itemsep=.1ex, leftmargin=.2in]

Set the initial state , .

Let , and .
Main Loop:

[topsep=1pt, itemsep=.1ex, leftmargin=.2in]

In the th iteration, each node sends the differentialcoded compressed information to its neighbors, where is an SNRconstrained stochastic unbiased compressor. Meanwhile, upon the reception of all neighbors’ information, each node performs the following updates:
(3) b) Weighted local aggregation step: (4) (5) (6) 
Stop if some preferred convergence criterion is met; otherwise, let and go to Step 3.
Several important remarks on the DCDGD algorithm are in order: 1) The combined update structure in Steps 3b) and 3c) is the same as the original DGD algorithm, which contains a weighted local aggregation step and a local gradient step. Notably, DCDGD only has one parameter: the stepsize (same as DGD). Thus, DCDGD enjoys the identical structural complexity as that of the original DGD.
2) DCDGD is memoryefficient: In DCDGD, each node only needs to store three local variables: and This is in stark contrast to some DGDbased algorithms, e.g., ADCDGD[13] and DCDPSGD[12], where each node needs to store all values of the previous iteration from its neighbors, which is unscalable for large and dense networks where node degrees are high.
3) Compared to the original DGD algorithm and many of its variants, a notable difference in DCDGD is that the gradient in Step 3c) is calculated based on an inexact update from and the compressed differential (i.e., Step 3a)), rather than using an exact update. This is derived from the convergence of a chosen Lyapunov function (to be defined soon). Interestingly, we will show that this modification does not harm the algorithm’s convergence speed because the difference between inexact and exact updates is negligible when the Lyapunov function is near convergence.
Before we prove the convergence of DCDGD, it is insightful to offer some intuitions on why DCDGD retains most of the simple structural properties of the original DGD and does not need extra mechanism/parameter(s) to tame compression noises. First, we define the following Lyapunov function:
(7) 
We note that is also used for proving the convergence of several other DGDbased algorithms (e.g., [18, 19]). To understand our DCDGD algorithm, we rewrite its updates Steps 3a) – 3d) in the following vector form:
(8) 
where and . Note that with we have by induction. Hence, we can rewrite the updates as:
where is a compression noise satisfying and That is, the power of the noise depends on the difference between two successive iterations, which in turn is the gradient of the Lyapunov function . As the algorithm converges (to be proved soon), implies that . Hence, no extra effort is required to tame the noise power thanks to this selfcompressionnoisepowerreduction effect.
IiiC Main Theoretical Results
In this subsection, we will establish the convergence of the proposed DCDGD algorithm. Our convergence results are proved under the following mild assumptions:
Assumption 1.
The local objective functions satisfies:

(Lower boundedness) There exists an optimal with such that , ;

(Lipschitz continuous gradient) there exists a constant such that ;

(Bounded gradient) there exists a constant such that for all , , .
Note that the first two bullets are standard in convergence analysis: The first one ensures the existence of optimal solution and the second guarantees the smoothness of the local objectives. The third bullet is needed to bound the deviation of local copies to their mean (cf. Theorem 2). It is equivalent to being Lipschitz continuous. This mild assumption has been widely adopted in analyzing nonconvex optimization algorithms in the literature (see, e.g.,[20, 21, 22]).
To show the convergence of DCDGD, we will show that the iterates and the gradient are bounded over all iterations, and the summation of the gradients of the Lyapunov function over the iterations is also bounded.
Theorem 1.
Under Assumption 1, if a constant stepsize is used, where is the SNR threshold satisfying , then the gradients of the Lyapunov function is bounded, i.e.,
Note that Theorem 1 has a key condition on the SNR threshold: . This SNR lower bound is to guarantee the feasible domain for the stepsize Interestingly, it can be seen that as (i.e., a sparse consensus matrix ), the lower bound for SNR shrinks to zero, meaning that as the network gets sparser, we could adopt compressors with larger compression ratios.
Next, we bound the derivation of each local copy from the mean of all local copies in any iteration :
Theorem 2.
Theorem 2 requires that is bounded, which is guaranteed by Theorem 1. Lastly, based on Theorems 1 and 2, we show that DCDGD converges to an error ball of the global objective’s stationary point at rate :
Theorem 3.
Under Assumption 1, if the stepsize satisfies then it holds that
where Thus, DCDGD converges at rate to an error ball that depends on parameters :
Note that in Theorem 3, similar to the original DGD algorithm, the size of the error ball is determined by two terms: The first one is a convergence error with sublinear diminishing rate ; The second term is the approximation error affected by the stepsize and the network structure (characterized by and ). Therefore, to reach an optimal solution, the stepsize needs to be small so that the second term is close to zero. However, as the coefficient for the convergence error which in turn requires more iterations for shrinking the first term.
The next result shows that with diminishing stepsize DCDGD converges to a firstorder stationary point (optimal solution in convex problems) at rate :
Corollary 1.
Let , where and , then the convergence rate of DCDGD is:
IiiD Proofs of the Main Theoretical Results
Due to space limitation, we provide proof sketches of the main theoretical results in this subsection.
Proof Sketch of Theorem 1.
Let denote a filtration. It can be shown that the Lyapunov function has Lipschitz gradients. It then follows that:
Taking conditional expectation and using the properties of SNRconstrained unbiased compressors yield: . Since we have Then, by setting stepsize as stated in the theorem, we have . It then follows that . Taking full expectation on both sides and telescoping from to , we have:
(9) 
Since , after rearranging terms, we can conclude that:
and the proof is complete. ∎
Proof Sketch of Theorem 2.
For notation convenience, We let From (8), we can obtain:
Using the above equations, we can derive the following inequality for the deviation from the mean :
Taking the expectation on both sides, noting , and after some algebraic manipulations, we arrive at:
which completes the proof. ∎
Proof Sketch of Theorem 3.
First, we prove a key descending inequality on . From the update rule , we have . It then follows that:
where Taking the conditional expectation on both sides and after some algebraic manipulations, we can show that
Taking the full expectation, telescoping the inequality from to , and after further algebraic manipulations, we have:
which, after further rearrangements, yields the result stated in the theorem. This completes the proof. ∎
Iv A Hybrid Compression Design under DCDGD for Communication Cost Minimization
Inspired by previous theoretical insights, in this section, our goal is to design a hybrid SNRconstrained compression scheme to achieve high communication cost saving, while having a controllable SNR. Recall from Section IIIA
that the sparsifier can control the compression noise power by adjusting the probability
and the expected communication cost for a dimensional vector is where is the cost for sending a floating number and is the cost for value . Therefore, if the SNR threshold is large, the communication cost will be close to sending uncompressed copy . For the ternary operator, its compression noise power is which is not directly controllable by any parameter. The communication cost is where is the cost for the ternary values .In general, the communication cost of a ternarycompressed vector is much smaller than that of the sparsecompressed vector: For example, if using bit floating numbers and one bit for the zero value, the cost for a dimensional sparse compressed vector is . In contrast, for the ternary operator, the cost will be if using bit floating numbers and two bits for the ternary values. With a larger SNR threshold (i.e., larger ) and high dimensionality , the communication cost of the ternary compressor is much smaller. Therefore, to have a controllable compression noise power as well as high communication cost savings, a promising solution is to combine the sparse and the ternary compressors.
To this end, consider a dimensional vector . We can sort and rearrange the elements of in descending order of magnitude to have: , with . For the first largest elements, we apply the ternary compressor, while for the rest of the elements, we use the sparse compressor, i.e.,
As a result, the compression noise power levels of the first largest elements and the rest are and , respectively. In order to ensure the effective SNR of the hybrid scheme satisfies for some lower bound , we have:
(10)  
(11) 
To satisfy (10) and (11), we have and , respectively. Then, on average, the compressed vector has floating numbers and ternary values, which is more efficient compared to that under the sparsifier compressor.
In fact, the hybrid compression idea above can be generalized to achieve further communication cost savings: Instead of just using for the ternary compression, we can select multiple “anchor elements” . There are elements between and . Thus, a dimensional vector can be partitioned into groups. For the elements with indices in , we apply the ternary compressor based on . For the remaining elements, we apply the sparsifier operator. Similar to (10), we have
(12) 
Then, the compressed vector has floating and ternary values. Moreover, we need to save the indices of the anchor elements, for which we need bits per element.
Given a SNR threshold the communication saving of our hybrid compression scheme is highly dependent on the group number and the positions of the anchor elements, which can be optimized by solving an integer programming problem. Take 32bit floating numbers and 2bit ternary values as an example. To achieve the maximum communication saving, the group number and the locations of the anchor elements can be determined by solving:
(13) 
Problem (13
) is an integer optimization problem, which can be shown to be equivalent to bin packing problems, thus being NPhard. However, an efficient greedy heuristic algorithm can be developed by leveraging the special problem structure. Specifically, we note that the objective function is increasing and decreasing with respect to
and , respectively. Therefore, we can find anchor points and their corresponding ternary sets (of size ) by checking (12); if the ternary cost of the elements is smaller than the sparsifier cost, we remove these elements from the current vector; otherwise, we use the sparsifier compressor on the current vector. We summarize the greedy algorithm as follows: Algorithm 2: A greedy algorithm for solving Problem (13). Initialization:
[topsep=1pt, itemsep=.1ex, leftmargin=.2in]

Sort and rearrange the elements of vector in descending order of magnitude.

Let . Set the ternary set as empty.
Main Loop:

[topsep=1pt, itemsep=.1ex, leftmargin=.2in]

Inner Loop:

[topsep=1pt, itemsep=.1ex, leftmargin=.4in]

For each element , find the set: .

Set and .


Compare the ternary cost with the sparsifier cost

If the ternary cost is smaller, then remove the corresponding elements from the current vector and add them to , let and go to Step 3; otherwise, break the loop.
Final Step:

[topsep=1pt, itemsep=.1ex, leftmargin=.2in]

Apply the ternary operator to each group in and the sparse operator to
Now, we analyze the running time complexity of the greedy algorithm. First of all, the sorting requires time. The worstcase number of iterations in the main loop is while in each inner loop, it takes steps to find the ternary set for each element. Hence, the overall timecomplexity of Algorithm 2 is .
V Numerical Results
In this section, we perform extensive numerical experiments to validate the performances of our proposed DCDGD algorithm and the hybrid compression scheme.
, respectively. The black solid curve is the original DGD algorithm. The other curves represent the error averaged over 50 trials and the shaded regions indicate the standard deviations of results over random trials.
Comments
There are no comments yet.