# Quantized Decentralized Consensus Optimization

We consider the problem of decentralized consensus optimization, where the sum of n convex functions are minimized over n distributed agents that form a connected network. In particular, we consider the case that the communicated local decision variables among nodes are quantized in order to alleviate the communication bottleneck in distributed optimization. We propose the Quantized Decentralized Gradient Descent (QDGD) algorithm, in which nodes update their local decision variables by combining the quantized information received from their neighbors with their local information. We prove that under standard strong convexity and smoothness assumptions for the objective function, QDGD achieves a vanishing mean solution error. To the best of our knowledge, this is the first algorithm that achieves vanishing consensus error in the presence of quantization noise. Moreover, we provide simulation results that show tight agreement between our derived theoretical convergence rate and the experimental results.

• 7 publications
• 34 publications
• 27 publications
• 35 publications
02/23/2020

### Quantized Push-sum for Gossip and Decentralized Optimization over Directed Graphs

We consider a decentralized stochastic learning problem where data point...
06/21/2018

### Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization

Large-scale distributed optimization is of great importance in various a...
09/14/2020

### Communication Efficient Distributed Learning with Censored, Quantized, and Generalized Group ADMM

In this paper, we propose a communication-efficiently decentralized mach...
11/24/2020

### Linear Convergence of Distributed Mirror Descent with Integral Feedback for Strongly Convex Problems

Distributed optimization often requires finding the minimum of a global ...
02/06/2020

### Achieving the fundamental convergence-communication tradeoff with Differentially Quantized Gradient Descent

The problem of reducing the communication cost in distributed training t...
02/01/2021

### Distributed Zero-Order Optimization under Adversarial Noise

We study the problem of distributed zero-order optimization for a class ...
09/16/2022

### Quantization for decentralized learning under subspace constraints

In this paper, we consider decentralized optimization problems where age...

## I Introduction

Distributed optimization of a sum of convex functions has a variety of applications in different areas including decentralized control systems [cao2013overview], wireless systems [ribeiro2010ergodic], sensor networks [rabbat2004distributed], networked multiagent systems [olfati2007consensus], multirobot networks [ren2007information]

, and large scale machine learning

[tsianos2012consensus]. In such problems, one aims to solve a consensus optimization problem to minimize cooperatively over nodes or agents that form a connected network. The function represents the local cost function of node that is only known by this node. Distributed optimization has been largely studied in the literature starting from seminal works in the 80s [tsitsiklis1986distributed, tsitsiklis1984problems]. Since then, various algorithms have been proposed to address decentralized consensus optimization in multiagent systems. The most commonly used algorithms are decentralized gradient descent or gradient projection method [nedic2009distributed, yuan2016convergence, jakovetic2014fast, ram2010distributed], distributed alternating direction method of multipliers (ADMM) [boyd2011distributed, shi2014linear, mokhtari64dqm], decentralized dual averaging [duchi2012dual, tsianos2012push], and distributed Newton optimization method [wei2013distributed, mokhtari2017network]. Furthermore, the decentralized consensus optimization problem has been considered in online or dynamic settings, where the dynamic cost function becomes an online regret function [yan2013distributed]. A major bottleneck in achieving fast convergence in decentralized consensus optimization is limited communication bandwidth among nodes. As the dimension of input data increases (which is the current trend in large-scale distributed machine learning), a considerable amount of information must be exchanged among nodes, over many iterations of the consensus algorithm. This causes a significant communication bottleneck that can substantially slow down the convergence time of the algorithm [seide20141, chowdhury2011managing]. Quantized communication for the agents is brought into the picture for bounded and stable control systems [yuksel2003quantization]. Furthermore, consensus distributed averaging algorithms are studied under discretized message passing [Kashyap2006QuantizedC]. Motivated by the energy and bandwidth-constrained wireless sensor networks, the work in [rabbat2005quantized] proposes distributed optimization algorithms under quantized variables and guarantees convergence within a non-vanishing error. Deterministic quantization has been considered in distributed averaging algorithms [el2016design] where the iterations converge to a neighborhood of the average of initials. However, randomized quantization schemes are shown to achieve the average of initials, in expectation [aysal2007distributed]. The work in [nedic2008distributed] also considers a consensus distributed optimization problem over a cooperative network of agents restricted to quantized communication. The proposed algorithm guarantees convergence to the optima within an error which depends on the network size and the number of quantization levels. Aligned with the communication bottleneck described earlier, [gravelle2014quantized] provides a quantized distributed load balancing scheme that converges to a set of desired states while the nodes are constrained to remain under maximum load capacities. More recently, 1-Bit SGD [seide20141]

was introduced in which at each time step, the agents sequentially quantizie their local gradient vectors by entry-wise signs while contributing the quantization error induced in previous iteration. Moreover, in

[alistarh2017qsgd], the authors propose the Quantized-SGD (QSGD), a class of compression scheme algorithms that is based on a stochastic and unbiased quantizer of the vector to be transmitted. QSGD provably provides convergence guarantees, as well a good practical performance. Recently, a different line of work has proposed the use of coding theoretic techniques to alleviate the communication bottleneck in distributed computation [li2016fundamental, lee2016speeding, ezzeldin2017communication, prakash2018coded]. In this paper, our goal is to analyze the quantized decentralized consensus optimization problem, where node transmits a quantized version of its local decision variable to the neighboring nodes instead of the exact decision variable . Motivated by the stochastic quantizer proposed in [alistarh2017qsgd], we consider two classes of unbiased random quantizers. While they both share the unbiasedness assumption, i.e.

, the corresponding variance differs for the two classes. We firstly consider variance bounded quantizers in which we have

for some fixed constant . Furthermore, we consider random quantizers for which the variance is bounded proportionally to the norm squared of the quatizer’s input, that is for a constant . Our main contribution is to propose a Quantized Decentralized Gradient Descent (QDGD) method, which involves a novel way of updating the local decision variables by combining the quantized message received from the neighbors and the local information such that proper averaging is performed over the local decision variable and the neighbors’ quantized vectors. We prove that under standard strong convexity and smoothness assumptions, for any unbiased and variance bounded quantizer, QDGD achieves a vanishing mean solution error: for all nodes we obtain that for any arbitrary and large enough , , where is the local decision variable of node at iteration and is the global optimum. To the best of our knowledge, this is the first decentralized gradient-based algorithm that achieves vanishing consensus error in the presence of non-vanishing quantization noise. We further generalize the convergence result to the second class of unbiased quantizers for which the variance is bounded proportionally to the norm squared of the quatizer’s input and prove that the propsoed algorithm attains the same convergence rate. We also provide simulation results – for both synthetic and real data – that corroborate our theoretical results.
Notation. In this paper, we denote by the set for any natural number . The gradient of a function is denoted by . For non-negative functions and of , we denote if there exist and constant such that for any . We use to indicate the least integer greater than or equal to .
Paper Organization. The rest of the paper is organized as follows. In Section II, we precisely formulate the quantized decentralized consensus optimization problem. We provide the description of the Quantized Decentralized Gradient Descent algorithm in Section III. The main theorems of the paper are stated and proved in Section IV. In Section V, we study the trade-off between communication cost and accuracy of the algorithm. We provide numerical studies in Section VI. Finally, we conclude the paper and discuss future directions in Section VII.

## Ii Problem Formulation

In this section, we formally define the consensus optimization problem that we aim to solve. Consider a set of nodes that communicate over a connected and undirected graph where and denote the set of nodes and edges, respectively. We assume that nodes are only allowed to exchange information with their neighbors and use the notation for the set of node ’s neighbors. In our setting, we assume that each node has access to a local convex function , and nodes in the network cooperate to minimize the aggregate objective function taking values . In other words, nodes aim to solve the optimization problem

 minx∈Rp f(x) = minx∈Rp n∑i=1fi(x). (1)

We assume the local objective functions are strongly convex and smooth, and, therefore, the aggregate function is also strongly convex and smooth. In the rest of the paper, we use to denote the unique minimizer of Problem (1). In decentralized settings, nodes have access to a single summand of the global objective function and to reach the optimal solution , communication with neighboring nodes is inevitable. To be more precise, nodes need to minimize their local objective functions, while they ensure that their local decision variables are equal to their neighbors’. This interpretation leads to an equivalent formulation of Problem (1). If we define as the decision variable of node , the alternative formulation of Problem (1) can be written as

 minx1,…,xn∈Rp n∑i=1fi(xi) subject toxi=xj,%forall i, j∈Ni. (2)

Since we assume that the underlying network is a connected graph, the constraint in (II) implies that any feasible solution should satisfy . Under this condition the objective function values in (1) and (II) are equivalent. Hence, it follows that the optimal solutions of Problem (II) are equal to the optimal solution of Problem (1), i.e., if we denote as the optimal solutions of Problem (II) it holds that . Therefore, we proceed to solve Problem (II) which is naturally formulated for decentralized optimization in lieu of Problem (1). The problem formulation in (II) suggests that each node should minimize its local objective function while keeping its decision variable close to the decision variable of its neighbors . This goal can be achieved by exchanging local variables among neighboring nodes to enforce consensus on the decision variables. Indeed, exchange of updated local vectors between the distributed nodes induces a potentially heavy communication load on the shared bus. To address this issue, we assume that each node provides a randomly quantized variant of its local updated variable to the neighboring nodes. That is, if we denote by the decision variable of node , then the corresponding quantized variant is communicated to the neighboring nodes, . Exchanging quantized vectors instead of the true vectors indeed reduces the communication burden at the cost of injecting noise to the information received by the nodes in the network. The main challenge in this setting is to ensure that nodes can still converge to the optimal solution of Problem (II), while they only have access to a quantized variant of their neighbors’ true decision variables.

## Iii QDGD Algorithm

In this section, we propose a quantized gradient based method to solve the decentralized optimization problem in (II) and consequently the original problem in (1) in a fully decentralized fashion. To do so, consider as the decision variable of node at step and as the quantized version of the vector . In the proposed Quantized Decentralized Gradient Descent (QDGD) method, nodes update their local decision variables by combining the quantized information received from their neighbors with their local information. To formally state the update of QDGD, we first define as the weight that node assigns to node . If nodes and are not neighbors then , and if they are neighbors the weight is nonnegative. At each time step , each node sends its quantized variant of its local vector to its neighbors and receives their corresponding vectors . Then, using the received information it updates its local decision variable according to the update

 xi,t+1=(1−ε+εwii)xi,t+ε∑j∈Niwijzj,t−αε∇fi(xi,t), (3)

where and are positive step-sizes. The update of QDGD in (3) shows that the updated decision variable is evaluated by proper averaging over the local decision variable and neighbors quantized vectors , and descending through the negative local gradient with a proper stepsize. Note that quantized decision variables of the neighboring nodes contribute to the descent direction proportionally to step-size , unlike the noiseless local gradient which is scaled by . The steps of the proposed QDGD method are summarized in Algorithm 1.

###### Remark 1.

The proposed QDGD algorithm can be interpreted as a variant of the decentralized (sub)gradient descent (DGD) method [nedic2009distributed, yuan2016convergence] for quantized decentralized optimization (see Section IV). Note that the vanilla DGD method converges to a neighborhood of the optimal solution in the presence of quantization noise where the radius of convergence depends on the variance of quantization error [nedic2009distributed, yuan2016convergence, rabbat2005quantized, nedic2008distributed]. QDGD improves the inexact convergence of quantized DGD by modifying the contribution of quantized information received from neighboring noise as described in update (3). In particular, as we show in Theorem 1, the sequence of iterates generated by QDGD converges to the optimal solution of Problem (1) in expectation.

Note that the proposed QDGD algorithm does not restrict the quantizer, except for few customary conditions. However, design of efficient quantizers has been taken into consideration. Consider the following example as such quantizers.

###### Example 1.

Consider a low-precision representation specified by and . The range representable by scale factor and bits is . For any in the representable range, the low-precision quantizer outputs

 Q(γ,b)(x)=⎧⎪⎨⎪⎩kγw.p. 1−x−kγγ,(k+1)γw.p. x−kγγ. (4)

For any in the range, the quantizer is unbiased and variance bounded, i.e. and .

In Section IV, we formally state the required conditions for the quantization scheme used in QDGD and show that a large class of well-known quantizers satisfy the required conditions.

## Iv Convergence Analysis

In this section, we prove that for sufficiently large number of iterations, the sequence of local iterates generated by QDGD converges to an arbitrarily precise approximation of the optimal solution of Problem (II) and consequently Problem (1). The following assumptions hold throughout the analysis of the algorithm.

###### Assumption 1.

Local objective functions are differentiable and smooth with parameter , i.e.,

 \norm∇fi(x)−∇fi(y)≤L\normx−y, (5)

for any . 111Local objectives may have different smoothness parameters, however, WLOG one can consider the largest smoothness parameter as the one for all the objectives.

###### Assumption 2.

Local objective functions are strongly convex with parameter , i.e.,

 ⟨∇fi(x)−∇fi(y),x−y⟩≥μ\normx−y2, (6)

for any .222Local objectives may have different strong convexity parameters, however, WLOG one can consider the smallest strong convexity parameter as the one for all the objectives.

###### Assumption 3.

The random quantizer is unbiased and has a bounded variance, i.e.,

 E[Q(x)|x]=x, and E[\normQ(x)−x2|x]≤σ2, (7)

for any ; and quantizations are carried out independently on distributed nodes.

###### Assumption 4.

The weight matrix with entries satisfies the following conditions

 W=W⊤,W1=1, and null(I−W)=span(1). (8)

The conditions in Assumptions 1 and 2 imply that the global objective function is strongly convex with parameter and its gradients are Lipschitz continuous with constant . Assumption 3 poses two customary conditions on the quantizer, that are unbiasedness and variance boundedness. Assumption 4 implies that weight matrix

is symmetric and doubly stochastic. The largest eigenvalue of

is and all the eigenvalues belong to , i.e., the ordered sequence of eigenvalues of are . We denote by

the spectral gap associated to the stochastic matrix

, where is the second largest magnitude of the eigenvalues of matrix . It is also customary to assume such that . In the following theorem we show that the local iterations generated by QDGD converge to the global optima, as close as desired.

###### Theorem 1.

Consider the distributed consensus optimization Problem (1) and suppose Assumptions 14 hold. Consider as an arbitrary scalar in and set and where and are arbitrary positive constants (independent of ). Then, for each node , the expected difference between the output of Algorithm 1 after iterations and the solution of Problem (1) is upper bounded by

 E[\normxi,T−˜x∗2] ≤O((4nc22D2(3+2L/μ)2(1−β)2 +2c1nσ2\normW−WD2μc2)1Tδ), (9)

if the total number of iterations satisfies , where is a function of , , , , , and . Moreover,

 D2=2Ln∑i=1(fi(0)−f∗i),f∗i=minx∈Rpfi(x). (10)

Theorem 1 demonstrates that the proposed QDGD provides an approximation solution with vanishing deviation from the optimal solution, despite the fact that the quantization noise does not vanish as the number of iterations progresses. By the first glance at the expression in (1) one might suggest to set to obtain the best possible sublinear convergence rate which is . However, , which is a lower bound on the total number of iterations , is an increasing function of , and by choosing very close to , the total number of iterations should be very large to obtain a fast convergence rate close to . Therefore, there is a trade-off between the convergence rate and the minimum number of required iterations. By setting close to we obtain a fast convergence rate but at the cost of running the algorithm for a large number of iterations, and by selecting close to the lower bound on the total number of iterations becomes smaller at the cost of having a slower convergence rate. We will illustrate this trade-off in the numerical experiments. Moreover, note that the result in (1) shows a balance between the variance of quantization and the mixing matrix. To be more precise, if the variance of quantization is small nodes should assign larger weights to their neighbors which decreases and increases . Conversely, when the variance is large, to balance the terms in (1) nodes should assign larger weights to their local decision variables which decreases the term and increases .

### Iv-a Proof of Theorem 1

To analyze the proposed QDGD method, we start by rewriting the update rule (3) as follows

 xi,t+1=xi,t−ε((1−wii)xi,t−∑j≠iwijzj,t+α∇fi(xi,t)). (11)

Note that to derive the expression in (11), we simply use the fact that when . The next step is to write the update (11) in a matrix form. To do so, we define the function as where and is the concatenation of the local variables . It is easy to verify that the gradient of the function is the concatenation of local gradients evaluated at the local variable, that is . We also define the matrix as the Kronecker product of the weight matrix

and the identity matrix

. Similarly, define , where denotes the diagonal matrix of the entries on the main diagonal of . For the sake of consistency, we denote by the boldface the identity matrix of size . According to above definitions, we can write the concatenated version of (11) as follows,

 xt+1=xt−ε((I−WD)xt+(WD−W)zt+α∇F(xt)). (12)

As we discussed in Section II, the distributed consensus optimization Problem (1) can be equivalently written as Problem (II). The constraint in the latter restricts the feasible set to the consensus vectors, that is . According to the discussion on rank of the weight matrix , the null space of the matrix is . Hence, the null space of is the set of all consensus vectors, i.e., is feasible for Problem (II) if and only if , or equivalently . Therefore, the alternative Problem (II) can be compactly represented as the following linearly-constrained problem,

 minx∈Rnp F(x)=n∑i=1fi(xi) (13) subject to (I−W)1/2x=0.

We denote by the unique solution to (13). Now, for given penalty parameter , one can define the quadratic penalty function corresponding to the linearly constraint problem (13) as follows,

 hα(x)=12x⊤(I−W)x+αF(x). (14)

Since is a positive semi-definite matrix and is -smooth and -strongly convex, the function is -smooth and -strongly convex on having and . We denote by the unique minimizer of , i.e.,

 x∗α=argminx∈Rnphα(x)=argminx∈Rnp12x⊤(I−W)x+αF(x). (15)

In the following, we link the solution of Problem (15) to the local variable iterations provided by Algorithm 1. Specifically, for sufficiently large number of iterations , we demonstrate that for proper choice of step-sizes, the expected squared deviation of from vanishes sub-linearly. This result follows from the fact that the expected value of the descent direction in (12

) is an unbiased estimator of the gradient of the function

.

###### Lemma 1.

Consider the optimization Problem (15) and suppose Assumptions 14 hold. Then, the expected deviation of the output of QDGD from the solution to Problem (15) is upper bounded by

 E[\normxT−x∗α2]≤O(c1nσ2\normW−WD2μc21Tδ), (16)

for , , any and , where and are positive constants independent of , and

 T1\coloneqqmax⎧⎪⎨⎪⎩ee11−2δ,⌈(c1c2μ)12δ⌉,⎡⎢ ⎢ ⎢⎢(c1(2+c2L)2c2μ)1δ⎤⎥ ⎥ ⎥⎥⎫⎪⎬⎪⎭. (17)
###### Proof.

See Appendix A. ∎

Lemma 1 guarantees convergence of the proposed iterations according to the update in (3) to the solution of the later-defined Problem (15). Loosely speaking, Lemma 1 ensures that is close to for large . So, in order to capture the deviation of from the global optima , it suffices to show that is close to , as well. As the problem in (15) is a penalized version of the original constrained program in (1), the solutions to these two problems should not be significantly different if the penalty coefficient is small. We formalize this claim in the following lemma.

###### Lemma 2.

Consider the distributed consensus optimization Problem (1) and the problem defined in (15). If Assumptions 1, 2 and 4 hold, then the difference between the optimal solutions to (13) and its penalized version (15) is bounded above by

 \normx∗α−x∗≤O(√2nc2D(3+2L/μ)1−β1Tδ/2), (18)

for and , where is a positive constant independent of , is an arbitrary constant, and

 T2\coloneqqmax⎧⎪⎨⎪⎩⎡⎢ ⎢ ⎢⎢(c2L1+λn(W))2δ⎤⎥ ⎥ ⎥⎥,⌈c42(μ+L)2δ⌉⎫⎪⎬⎪⎭. (19)
###### Proof.

See Appendix B. ∎

The result in Lemma 2 shows that if we set the penalty coefficient small enough, i.e., , then the distance between the optimal solutions of the constrained problem in (1) and the penalized problem in (15) is of . Having set the main lemmas, now it is straightforward to prove the claim of Theorem 1. For the specified step-sizes and and large enough iterations , Lemmas 1 and 2 are applicable and we have

 E[\normxT−x∗2] =E[\normxT−x∗α+x∗α−x∗2] ≤2E[\normxT−x∗α2]+2\normx∗α−x∗2 =O(1Tδ), (20)

where we used to derive the first inequality; and the constants can be found in the proofs of the two lemmas. Since for any , the inequality in (IV-A) implies the claim of Theorem 1.

### Iv-B Extension to more quantizers

Based on the condition in Assumption 3, so far we have been considering only unbiased quantizers for which the variance of quantization is bounded by a constant scalar, i.e., . However, there are widely used representative quantizers where the quantization noise induced on the input is bounded proportionally to the input’s magnitude, i.e., [alistarh2017qsgd]. Indeed, this condition is more challenging since the set of iterates norm are not necessarily bounded, and we cannot uniformly bound the variance of the noise induced by quantization. In this subsection, we show that the proposed algorithm is converging with the same rate for quantizers satisfying this new assumption. Let us first formally state this assumption.

###### Assumption 5.

The random quantizer is unbiased and its variance is proportionally bounded by the input’s squared norm, that is,

 E[Q(x)|x]=x, and E[\normQ(x)−x2|x]≤η2\normx2, (21)

for a constant and any ; and quantizations are carried out independently on distributed nodes.

Before characterizing the convergence properties of the proposed QDGD method under the conditions in Assumption 5, let us review a subset of quantizers that satisfy this condition.

###### Example 2 (Low-precision quantizer).

Consider the low precision quantizer which is defined as

 QLPi(x)=\normx⋅sign(xi)⋅ξi(x,s), (22)

where

is a random variable defined as

 ξi(x,s)=⎧⎪ ⎪⎨⎪ ⎪⎩lsw.p.  1−q(|xi|\normx,s),l+1sw.p.  q(|xi|\normx,s), (23)

and for any . In above, the tuning parameter corresponds to the number of quantization levels and is an integer such that . It is not hard to check that [alistarh2017qsgd] the low precision quantizer defined in (22) is an unbiased estimator of the vector and the variance is bounded above by

 E[\normQLP(x)−x2]≤min(ps2,√ps)\normx2. (24)

The bound in (24) illustrates the trade-off between communication cost and quantization variance. Choosing a large reduces the variance of quantization at the cost of increasing the levels of quantization and therefore increasing the communication cost.

The following example provides another quantizer which satisfies the conditions in Assumption 5.

###### Example 3 (Gradient sparsifier).

The gradient sparsifier denoted by is defined as

 QGSi(x)={xi/qiw.p.% qi,0otherwise, (25)

where

is probability that coordinate

is selected. It is easy to verify that this quantizer is unbiased, as for each , . Moreover, one can show that the variance of this quantizer is bounded as follows,

 E[\normQGS(x)−x2]=p∑i=1(1qi−1)x2i≤(1qmin−1)\normx2, (26)

where denotes the minimum of probabilities .

In the following theorem, we extend our result in Theorem 1 to the case that variance of quantizer may not be uniformly bounded and is proportional to the squared norm of quantizer’s input.

###### Theorem 2.

Consider the distributed consensus optimization Problem (1) and suppose Assumptions 1, 2, 4, 5 hold. Then, for each node , the expected squared difference between the output of the QDGD method outlined in Algorithm 1 and the optimal solution of Problem (1) is upper bounded by

 E[\normxi,T−˜x∗2] ≤O((4nc22D2(3+2L/μ)2(1−β)2 +4c1n˜B2η2\normW−WD2μc2)1Tδ), (27)

for , , any and , where , and are positive constants independent of .

###### Proof.

See Appendix C. ∎

The result in Theorem 2 shows that under Assumption 5, the proposed QDGD method converges to the optimal solution at a sublinear rate of which matches the result in Theorem 1. However, the lower bound on the total number of iterations for the result in Theorem 2 is in general larger than for the result in Theorem 1. The exact expression of could be found in Appendix C.

## V Optimal quantization level for reducing overall communication cost

In this section, we aim to study the trade-off between number of iterations until achieving a target accuracy and quantization levels. Indeed, by increasing quantization levels the variance of quantization reduces and the total number of iterations to reach a specific accuracy decreases, but the communication overhead of each round is higher as we have to transmit more bits. Conversely, if we use a quantization with a small number of levels the communication cost per iteration will be low; however, the total number of iterations could be very large. The fundamental question here is how to choose the quantization levels to optimize the overall communication cost which is the product of number of iterations and communication cost of each iteration. In this section, we only focus on unbiased quantizers for which the variance is proportionally bounded with the squared norm of the quantizer’s input vector, i.e., for any it holds that and for some fixed constant . Theorem 2 characterizes the (order-wise) convergence of the proposed algorithm considering this assumption. More precisely, for each node with the step-size choices in Theorem 2 we can write :

 E[\normxi,T−˜x∗2] ≤2E[\normxT−x∗α2]+2\normx∗α−x∗2 ≤2B1(T)+2B2(T)≈ [4nc22D2(3+2L/μ)2(1−β)2+4c1n˜B2η2\normW−WD2μc2]1Tδ, (28)

where the approximation is due to considering dominant terms in and (See Appendix B and C for notations and details of derivations). Therefore, given a target relative deviation error , the algorithm needs to iterate at least where

 T(ρ) :=[4nc22D2(3+2L/μ)2(1−β)2 (29)

It is shown in [alistarh2017qsgd] that for the low-precision quantizer defined in (22) and (23) there exists an encoding scheme such that for any and , the communication cost of the quantized vector satisfies

 E[|Codes(QLP(x))|] ≤b+(3+32log∗(2(s2+p)s2+√p))(s2+√p), (30)

where and denotes the number bits for representing one floating point number ( are typical values). For large , [alistarh2017qsgd] also proposes a simple encoding scheme which is proved to impose no more than the following communication cost on the quantized vector

 E[|Code′s(QLP(x))|] ≤b+(52+12log∗(1+s2+min(d,s√p)p))p. (31)

Now we can easily derive the expected total communication cost (in bits) of a quantized decentralized consensus optimization in order for each agent to achieve a predefined target error. For instance, assume that the low-precision quantizer described above is employed for the quanization operations. Using this quantizer, the expected communication cost (in bits) for transmitting a single -dimensional real vector is represented in (V) and (V) for two sparsity regimes of the tuning parameter . On the other hand, in order for each agent to obtain a relative error , the proposed algorithm iterates times as denoted in (V). Therefore, the total (expected) communication cost across all of the agents is and for small and large , respectively. In the following, we numerically evaluate the communication cost for the following least squares problem

 minx∈Rpf(x)=n∑i=112\normAix−bi2. (32)

We assume that the network contains agents that collaboratively aim to solve problem (32) over the real field of size . The elements of the random matrices and the solution

are picked from the normal distribution

. Moreover, we let . All nodes update their local variables with respect to the proposed algorithm and send the quantized updates to the neighbors using a low-precision quantizer with quantization levels and bits for representing one floating point number, until they satisfy the predefined relative error . The underlying graph is an Erdős-Rényi with edge probability . The edge weight matrix is picked as where is the Laplacian with as its largest eigenvalue. We also set . Table V represents the total expected communication cost (in bits, as computed using (V), (V) and (V)) induced by the proposed algorithm to solve (32) using the low-precision quantizer –as described above– for four representative cases. As observed from this table and expected from the theoretical derivations, larger number of quantization levels translates to less noisy quantization and hence fewer iterations. Also, larger number of quantization levels induces more communication cost for each transmitted quantized data variable which results in larger code length per vector. However, the average total communication cost does not necessarily follow a monotonic trend. As Table V shows, the optimal induces the smallest total communication cost among all levels .

## Vi Numerical Experiments

In this section, we evaluate the performance of the proposed QDGD Algorithm on decentralized quadratic minimization and ridge regression problems and demonstrate the effect of various parameters on the relative expected error rate. We carry out the simulations on artificial and real data sets corresponding to quadratic minimization and ridge regression problems, respectively. In both cases, the graph of agents is a connected Erdős-Rényi with edge probability

. We set the edge weight matrix to be where is the Laplacian with as its largest eigenvalue.

### Vi-a Decentralized quadratic minimization

In this section, we evaluate the performance of the proposed QDGD Algorithm on minimizing a distributed quadratic objective. We pictorially demonstrate the effect of quantization noise and graph topology on the relative expected error rate. Consider the quadratic optimization problem

 minx∈Rp f(x)=n∑i=112x⊤Aix+b⊤ix, (33)

where denotes the local objective function of node . The unique solution to (33) is therefore . We pick diagonal matrices such that of the diagonal entries of each are drawn from the set and the other diagonal entries are drawn from the set , all uniformly at random. Entries of vectors are randomly picked from the interval . In our simulations, we let an additive noise model the quantization error, i.e. where . We first consider a connected Erdős-Rényi graph of nodes and connectivity probability of and dimension . Fig. 1 shows the convergence rate corresponding to three values of quantization noise and , compared to the theoretical upper bound derived in Theorem 1 in the logarithmic scale. As expected, Fig. 1 shows that the error rate linearly scales with the quantization noise; however, it does not saturate around a non-vanishing residual, regardless the variance. Moreover, Fig. 1 demonstrates that the convergence rate closely follows the upper bound derived in Theorem 1. For instance, for the plot corresponding to , the relative errors are evaluated as and for and , respectively. Therefore, which is upper bounded by . To observe the effect of graph topology, quantization noise variance is fixed to and we varied the connectivity ratio by picking three different values, i.e. where corresponds to the complete graph case. We also fix the parameter . As Fig. 2 depicts, for the same number of iterations, deviation from the optimal solution tends to increase as the graph is gets sparse. In other words, even noisy information of the neighbor nodes improves the gradient estimate for local nodes. It also highlights the fact that regardless of the sparsity of the graph, the proposed QDGD algorithm guarantees the consensus to the optimal solution for each local node, as long as the graph is connected.

### Vi-B Decentralized ridge regression

Consider the ridge regression problem:

 minx∈Rpf(x)=D∑j=1\normajx−bj2+λ2\normx22, (34)

over the data set where each pair

denotes the predictors-response variables corresponding to data point

where and is the regularization parameter. To make this problem decentralized, we pick agents and uniformly divide the data set among the agents, i.e., each agent is assigned with data points. Therefore, (34) can be decomposed as follows:

 minx∈Rpf(x)=n∑i=1fi(x), (35)

where the local function corresponding to agent is

 fi(x)=\normAix−bi2+λ2n\normx2, (36)

and

 Ai =[a(i−1)d+1;⋯;aid]∈Rd×p, (37) bi =[b(i−1)d+1;⋯;bid]∈Rd. (38)

The unique solution to (35) is

 ˜x∗=(n∑i=1A⊤iAi+λI)−1(n∑i=1A⊤ibi). (39)

To simulate the decentralized ridge regression (35), we pick “Pen-Based Recognition of Handwritten Digits Data Set” [Dua:2017] and use training samples with features and possible labels corresponding to digits . We pick and consider a connected Erdős-Rényi graph with agents and edge probability , i.e. each assigned with data points. The decision variables are quantized according to the low-precision quantizer with quantization level , as described in Example 2. Firstly, we fix and and vary the tuning parameter . Fig. 3 depicts the convergence trend corresponding to two values . Secondly, to observe the effect of graph density, we let the quantization level be and vary the graph configuration. For , Fig. 4 shows the resulting convergence rates for Erdős-Rényi random graphs with two vales of graph connectivity ratio , complete graph and cycle graph.

## Vii Conclusion

We proposed the QDGD algorithm to tackle the problem of quantized decentralized consensus optimization. The algorithm updates the local decision variables by combining the quantized messages received from the neighbors and the local information such that proper averaging is performed over the local decision variable and the neighbors’ quantized vectors. Under customary conditions for quantizers, we proved that the QDGD algorithm achieves a vanishing consensus error in mean-squared sense, and verified our theoretical results with numerical studies. An interesting future direction is to establish a fundamental trade-off between the convergence rate of quantized consensus algorithms and the communication. More precisely, given a target convergence rate, what is the minimum number of bits that one should communicate in decentralized consensus? Another interesting line of research is to develop novel source coding (quantization) schemes that have low computation complexity and are information theoretically near-optimal in the sense that they have small communication load and fast convergence rate.

## Appendix A Proof of Lemma 1

To prove the claim in Lemma 1 we first prove the following intermediate lemma.

###### Lemma 3.

Consider the non-negative sequence satisfying the inequality

 et+1≤(1−aT2δ)et+bT3δ, (40)

for , where and are positive constants, , and is the total number of iterations. Then, after iterations the iterate satisfies

 eT≤O(baTδ). (41)
###### Proof.

Use the expression in (40) for steps and to obtain

 et+1 ≤(1−a