I Introduction
In recent years, network consensus optimization has received increasing attention thanks to its generality and wide applicability. To date, network consensus optimization has found important applications in many scientific and engineering fields, e.g., distributed sensing in wireless sensor networks[1, 2, 3, 4]
, decentralized machine learning
[5, 6], multiagent robotic systems[7, 8, 9], smart grids[10, 11], to name just a few. Simply speaking, in a network consensus optimization problem, each node only has access to some component of the global objective function. That is, the global objective function is only partially known at each node. Through communications with local neighbors, all nodes in the network collaborate with each other and try to reach a consensus on an optimal solution, which minimizes the global objective function.Among various algorithms for solving network consensus optimization problems, one of the most effective methods is the distributed gradient descent (DGD) algorithm, a firstorder iterative method developed by Nedic and Ozdaglar approximately a decade ago[12]. The enduring popularity of DGD is primarily due to its implementation simplicity and elegant networking interpretation: In each iteration of DGD, each node performs an update by using a linear combination of a gradient step with respect to its local objective function and a weighted average from its local neighbors (also termed as a consensus step). It has been shown that DGD enjoys the same convergence speed as the classical gradient descent method, where denotes the number of iterations[12]. The simplicity and salient features of DGD have further inspired a large number of extensions to various network settings (see Section II for more indepth discussions).
However, despite its theoretical and engineering appeals, the performance of DGD may not always be satisfactory in practice. This is particularly true for solving a highdimensional consensus problem over a network with low network communication speed. In this case, due to the large amount of data sharing and the communication bottleneck, exchanging full highdimensional information between neighboring nodes is timeconsuming (or even infeasible), which significantly hinders the overall convergence of DGD. To improve the convergence speed, several secondorder approaches using Hessian approximation (with respect to local objective function) have been proposed (see, e.g., [13, 14]). Although these secondorder methods converge in a fewer number of iterations (hence less information exchanges), they require matrix inversion in each iteration, implying a periteration complexity for a dimensional problem. Hence, for highdimensional consensus problems (i.e., large ), lowcomplexity firstorder methods remain more preferable in practice.
To address DGD’s limitations in highdimensional network consensus over lowspeed networks, a naturally emerging idea is to compress the information exchanged between nodes. Specifically, by compressing the information in a highdimensional state space to a smaller set of quantized states, each node can use a codebook to represent the quantized states with a small number of bits. Then, rather than directly transmitting full information, each node can just transmit the smallsize codewords, which significantly reduces the communication burden. Moreover, from a cybersecurity standpoint, transmitting compressed information is also very helpful because each node can encrypt its codebook and avoid revealing full information to potential eavesdroppers in the network.
However, with compressed information being adopted in DGD, several fundamental questions immediately arise: i) Will DGD with compressed information exchanges still converge? ii) If the answer to i) is no, could we modify DGD to make it work with compressed information? iii) If the answer to ii) is yes, how fast does this modified DGD method converge? Indeed, answering all these questions are highly nontrivial and they constitute the main subjects of this paper. The main contribution in this paper is that we provide concrete answers to all three fundamental questions. Our key results and their significance are summarized as follows:

First, we show that DGD with straightforward compressed information exchange fails to converge because of a nonvanishing accumulated noise term resulted from compression over iterations. This motivates us to develop a noise variance reduction method. To this end, we propose a new idea called “amplifieddifferential compression DGD” (ADCDGD), where, instead of directly exchanging compressed estimates of the global optimization variable in DGD, we exchange an
amplified version of the state differential between consecutive iterations, hence the name. We show that ADCDGD effectively diminishes the accumulated noise from compression and induces convergence. 
We show that, under any unbiased compression operator, our ADCDGD method converges at rate to an neighborhood of an optimal solution with a constant stepsize . Under diminishing stepsizes, ADCDGD converges asymptotically at rate to an optimal solution. We note that these convergence rates are the best possible in the sense that they match with those of the original DGD without compression. This result is surprising since the information loss due to compression could be large. We also note that the convergence rate of ADCDGD outperforms other existing distributed firstorder methods with compression (see Section II for detailed discussions).

Based on the above convergence results of ADCDGD, we further investigate the impacts of ADCDGD’s amplifying factor on convergence speed and communication load. Interestingly, we reveal a phase transition phenomenon of the convergence speed with respect to the amplification exponent in ADCDGD. Specifically, when (sublinear growth of amplification), convergence speed approaches that of DGD as increases. However, as soon as , there is no further convergence speed improvement but network communication load continues to grow. This shows that is a critical point, under which we can trade communication overhead for convergence speed.
Collectively, our results contribute to a growing theoretical foundation of network consensus optimization. The rest of the paper is organized as follows. In Section II, we review related work. In Section III, we introduce the network consensus optimization problem and show that DGD with compressed information exchange fails to converge. In Section IV, we present our ADCDGD algorithm and its convergence performance analysis. Numerical results are provided in Section V and Section VI concludes this paper.
Ii Related Work
In this section, we first provide a quick overview on the historical development of DGDtype algorithms. We then focus on the recent advances of communicationconscious network consensus optimization, including related work that utilize compression.
1) DGDBased Algorithms for Network Consensus: Network consensus optimization can trace its roots to the seminal work by Tsitsiklis [15], where the system model and the analysis framework were first developed. As mentioned earlier, a wellknown method for solving network consensus optimization is the distributed (sub)gradient descent (DGD) method, which was proposed by Nedic and Ozdaglar in [12]. DGD was recently reexamined in [16] by Yuan et al. using a new Lyapunov technique, which offers further mathematical understanding of its convergence performance. In their followup work [17], the convergence behavior of DGD was further analyzed for nonconvex problems. Recently, several DGD variants have been proposed to enhance the convergence performance (e.g., achieving the same convergence rate with constant stepsize [18] or even under timevarying network graphs [19]).
2) CommunicationConscious Distributed Optimization: As mentioned earlier, studies have shown that communication costs of DGD could be a major concern in practice. To this end, Chow et al. [20] studied the tradeoff between communication requirements and prescribed accuracy. In [21], Berahas et al. developed an adaptive DGD framework called to balance the costs between communication and computation. Here, the parameter represents the number of consensus steps performed per gradient descent step ( corresponding to the original DGD). The larger the value, the cheaper the communication cost, and vice versa. The most related work to ours is by Tang et al.[22], which, to our knowledge, is also the only work in the literature that considers adopting compression in DGD. However, our algorithm differs from [22] in the following key aspects: i) The compression in [22] uses a quantized extrapolation between two successive iterates, which can be viewed as a diminishing stepsize strategy. In contrast, our ADCDGD algorithm uses an amplified differential of two successive iterates. As will be shown later, our algorithm can be interpreted as a variance reduction method; ii) Our convergence rate outperforms that of [22]. The fastest convergence rate of the algorithms in [22] is , while the convergence rate of our ADCDGD algorithm is ; iii) To reach the best convergence rate in [22], the extrapolation compression algorithm needs to solve a complex equation to obtain an optimal stepsize. In contrast, our ADCDGD algorithm uses the standard sublinearly diminishing stepsizes, which is of much lower complexity and can be easily implemented in practice.
Iii Network Consensus Optimization and Distributed Gradient Descent
In Section IIIA, we first introduce the network consensus optimization problem, which is followed by the basic version of the DGD method. Then in Section IIIB, we will illustrate an example where DGD with directly compressed information fails to converge, which motivates our subsequent ADCDGD approach in Section IV.
Iiia Consensus Optimization over Networks: A Primer
Consider an undirected connected graph , where and are the sets of nodes and links, respectively, with and . Let be some global decision variable to be optimized. Each node has a local objective function (only available to node ). The global objective function is the sum of all local objectives, i.e., . Our goal is to solve the following networkwide optimization problem in a distributed fashion:
(1) 
Problem (1) has a wide range of applications in practice. For example, consider a wireless sensor network, where each sensor node distributively collects some local monitored temporal data and collaborates to detect the changepoint in the global temporal data. This problem can be formulated as: , where is the CUSUM (cumulative sum control chart) statistics. Note that Problem (1) can be equivalently written in the following consensus form:
Minimize  (2)  
subject to 
where is the local copy of at node . In Problem (2), the constraints enforce that the local copy at each node is equal to those of its neighbors, hence the name consensus. It is wellknown [12] that Problem (2) can be reformulated as:
Minimize  (3)  
subject to 
where , denotes the
dimensional identity matrix, and the operator
denotes the Kronecker product. In (3), is referred to as the consensus matrix and satisfies the following properties:
is doubly stochastic: .

The sparsity pattern of follows the network topology: for and otherwise.

is symmetric and hence it has real eigenvalues.
The doubly stochastic property in 1) ensures that all eigenvalues of are in and exactly one eigenvalue is equal to 1. Hence, it follows from Property 3) that one can sort eigenvalues as Let . Clearly, we have It is shown in [12] that if and only if , . Therefore, Problems (2) and (3) are equivalent.
The equivalent network consensus formulation in Problem (3) motivates the design of the decentralized gradient descent (DGD) method as stated in Algorithm 1:
Algorithm 1: Decentralized Gradient Descent (DGD)[12]. Initialization:

[topsep=1pt, itemsep=.1ex, leftmargin=.2in]

Let . Choose initial values for and stepsize .
Main Loop:

[topsep=1pt, itemsep=.1ex, leftmargin=.2in]

In the th iteration, each node sends its local copy to its neighbors. Also, upon reception of all local copies from its neighbors, each node updates its local copy as follows:
(4) where is the entry in the th row and th column in , and represent ’s value and stepsize in the th iteration, respectively, and .

Stop if a desired convergence criterion is met; otherwise, let and go to Step 2.
We can see that the DGD update in (4) consists of a consensus step and a local gradient step, which can be easily implemented in a network. Also, DGD achieves the same convergence rate as in the classical gradient descent method. However, as mentioned in Section I, DGD may not work well for highdimensional consensus problem in lowspeed networks. Hence, we are interested in developing a DGDtype algorithm with compressed information exchanges in this paper. In what follows, we will first show that DGD fails to converge if compressed information is directly adopted in the consensus step.
IiiB DGD with Directly Compressed Information Exchange Does Not Converge: A Motivating Example
We first introduce the notion of unbiased stochastic compression operator, which has been widely used to represent compressions in the literature (see, e.g., [20, 23, 21, 24, 25, 26]).
Definition 1 (Unbiased Stochastic Compression Operator).
A stochastic compression operator is unbiased if it satisfies , with and , .
Defintion 1 guarantees that the noise caused by the compression has no effect on the mean of the parameter and its variance is bounded. Many compressed operators satisfy the above definition. The following is an example:
Example 1 (The Quantized Compressed Operater [24]).
For the th element of is:
where presents the largest integer smaller than and the probability .
Now, we consider the convergence of DGD with unbiased stochastic compressions. If local copies are compressed and then directly used in the consensus step in the DGD algorithm, then Eq. (4) in Algorithm 1 can be modified as:
(5) 
which shows that there is a nonvanishing noise term accumulated over iterations, which prevents the DGD algorithm from converging. For example, consider a simple 2node network with local objectives and . The quantized compressed operator[25] is adopted in DGD. The simulation results are illustrated in Fig. 1, where we can see that DGD fails to converge after 1000 iterations even for such a smallsize network consensus problem. This motivates us to pursue a new algorithmic design in Section IV.
Iv AmplifiedDifferential Distributed Gradient Descent Method (ADCDGD)
In this Section, we will first introduce our ADCDGD algorithm in Section IVA. Then, we will present the main theoretical results and their intuitions in Section IVB. The proofs for the main results are provided in Section IVC.
Iva The ADCDGD Algorithm
Our ADCDGD algorithm is stated in Algorithm 2:
Algorithm 2: AmplifiedDifferential Compression DGD. Initialization:

[topsep=1pt, itemsep=.1ex, leftmargin=.2in]

Let . Let , . Choose initial values for stepsize and the amplification exponent . Let , .
Main Loop:

[topsep=1pt, itemsep=.1ex, leftmargin=.2in]

In the th iteration, each node sends the compressed amplifieddifferential to its neighbors. Also, upon collecting all neighbors’ information, each node estimates neighbors’ (imprecise) values: . Then, each node updates its local value:
(6) Each node updates local differential: .

Stop if a desired convergence criterion is met; otherwise, let and go to Step 2.
Several important remarks on Algorithm 2 are in order: i) Compared to the original DGD, each node under ADCDGD requires additional memory to store the (imprecise) values of its neighbors in the previous iteration: . This additional memory allows the neighbors to only transmit the difference between successive iterations rather than directly. Note that this memory requirement is modest in practice since many computer networks are scalefree (i.e., node degree distribution follows a power law and hence most nodes have low degrees); ii) Each node sends out a compressed version of the amplifieddifferential . This information will then be deamplified at the receiving nodes as , which is a noisy version of . Based on the memory of the previous version, each node obtains their neighbors’ values estimation , . Clearly, ADCDGD is more communicationefficient compared to the original DGD; iii) Once , , are available, the update in (6) follows the same structure as in DGD, which also contains a consensus step and a local gradient step. Therefore, the complexity of ADCDGD are almost identical to the original DGD, which means that ADCDGD enjoys the same lowcomplexity.
IvB Main Convergence Results
Before presenting the convergence results of ADCDGD, we first state several needed assumptions:
Assumption 1.
The local objective functions satisfy:

(Lower boundedness) There exists an optimal with such that

(Lipschitz continuous gradient) there exists a constant such that .
Assumption 2 (Growth rate at infinity).
If the domain for is unbounded, then there exists a constant such that
where and
Assumption 1 is standard in convergence analysis of gradient descent type algorithms: The first bullet ensures the existence of optimal solution and the second bullet guarantees the smoothness of the local objectives. Assumption 2 is a technical result coming out of our proofs and guarantees that, at infinity, the growth rate of the objective function is at least faster than linear. We note that Assumption 2 is a mild assumption, which is evidenced by the following lemma (proof details are relegated to Appendix A).
In addition to convex objectives, many nonconvex functions also satisfy Assumption 2, , as shown below and in Fig. 2:
Example 2.
(Nonconvex functions satisfying Assumption 2):

with but is smaller than when

with but is smaller than when
Our first key result is on the convergence of local variables to the mean vector across nodes:
Theorem 1.
Let the mean vector at the th iteration be defined as with Under Assumptions 1, if is bounded by and the amplifying exponent is then:

For constant stepsize , , ;

For diminishing stepsize with some , .
Remark 1.
Theorem 1 says that the local copies will converge to the mean vector asymptotically with a diminishing stepsize, or stay within a bounded error ball of the mean vector if a constant stepsize is adopted.
Our second key convergence result is on the convergence rate of ADCDGD under constant stepsizes:
Theorem 2 (Constant StepSize).
Remark 2.
Under the same conditions of Theorem 2, we immediately have that Algorithm 2 has an ergodic convergence rate until reaching the error ball and the fastest rate is
Our third key convergence result is concerned with the convergence rate of ADCDGD under diminishing stepsizes:
Theorem 3 (Diminishing StepSizes).
Remark 3.
In Theorem 3, the exponent for the diminishing rate of stepsize is lower bounded (. Thus, the best convergence rate for this algorithm is which is faster than the rate in [22]. We also note that our convergence result is in “SmallO”, which is stronger than conventional “BigO” convergence results.
Remark 4 (Intuition and Design Rationale of ADCDGD).
To understand why ADCDGD converges, a closer look at (6) in Algorithm 2 reveals that:
(8) 
Thanks to the properties of the unbiased stochastic operator (cf. Definition 1), the noise term in the last step of (4) has zero mean and a vanishing variance as gets large. This is in contrast to the accumulated nonvanishing noise term in DGD (cf. Eq. (IIIB)). Eq. (4) also shows that our ADCDGD algorithm can be interpreted as a variance reduction method. Indeed, our proofs in Section IVC are based on these intuitions.
IvC Proofs of the Main Theorems
Due to space limitation, in this subsection, we outline the key steps of the proofs of Theorems 1–3. We relegate proof details to appendices. Some appendices provide proof sketches due to the lengths of the proofs.
Step 1): Introducing a Lyapunov Function. Consider the following Lyapunov function, which is also used in[16, 21]:
(9) 
where and so that . The following lemma is from[16], which says that the Lyapunov function has Lipschitzcontinuous gradient.
Lemma 2.
Under Assumption 1, the Lyapunov function has Lipschitz gradient, i.e.
Note that, using the notation , we can compactly rewrite the updating step (6) in Algorithm 2 as follows:
(10) 
where is the parameter in the th iteration, is the vector of imprecise parameters, and and . It can be seen that Eq. (IVC
) is onestep stochastic gradient descent for
and the noise term has zero mean and variance with diminishing bound , i.e.,(11)  
(12) 
where follows from the fact that the eigenvalues of are in and
Step 2) Convergence of the Objective Value. Note from (4) that the noise caused by compression is similar to the noise in the standard stochastic gradient descent method (SGD). Hence, we can apply similar analysis techniques from SGD on the iterations of ADCDGD to obtain the following results:
Theorem 4 (Bounded Gradient).
Theorem 4 shows that with an appropriate stepsize and an amplifying exponent, Algorithm 2 converges. But due to the compression noise, the convergence rate is sublinear. To see this, note that and Thus, which implies From Theorem 4, the convergence rate of is also
Step 3) Proving Theorem 1. Note from Algorithm 2 and (IVC) that the following hold:
(13) 
Eq. (13) characterize the trajectory of the iterates. Each iterate consists of two parts, one from gradients and the other from noises. Note that in (13), the variance of accumulated noises are in the form of . Next, we prove an interesting lemma for , which is useful in proving Theorem 1.
Lemma 3.
Define where and It follows that
Lemma 3 implies that the negative effect of compression noises can be ignored asymptotically, which induces convergence. With (13), Theorem 4 and Lemma 3, we can finally prove Theorem 1 and the details are relegated to Appendix D.
Step 4) Proving Theorems 2 and 3. With some algebraic derivation, we can show the following fundamental result:
Lemma 4.
IvD Understanding the Role of the Amplifying Exponent
In our algorithm, the amplifying exponent is a key component to adjust the communication rate. From Theorems 2 and 3, it can be seen that within the larger means the faster convergence. However, since the transmitted value is we can see that a larger leads to a larger which may lead to overflow error (for example, type ‘int8’ in Matlab could only present data within ). Hence, it is necessary to guarantee that would not grow too fast. Recalling Eqs. (IVC) and (6) in Algorithm 2, we have
Under the expectation, the transmitted value is bounded by
From Definition 1, we have that each element of is bounded by From Theorem 4, we have Thus, is bounded by . We state this result in the following proposition:
The insight from Proposition 5 is that with the growth speed for the transmitted value is slower than which is not very fast.
V Numerical Results
In this section, we will present several numerical experiments to further validate the performance of ADCDGD.
1) Effect of Compression: First, we compare ADCDGD with some existing methods to show its convergence rate and communicationefficiency. Consider a fournode network as shown in Fig. 4 with the following global objective function: , where , , , and . It can be seen that is nonconvex, while the rest are convex. The communication consensus matrix used in this experiment is shown in Fig. 4.
In our simulation, we compare our ADCDGD with the conventional DGD and For we consider two cases: and In ADCDGD, the amplifying exponent is set to . We use two stepsize strategies: 1) constant stepsize (i.e. ) and 2) diminishing stepsize (i.e. ). We adopt the quantized operator in [24] as the compression operator. After compression, the values are integer. Hence, they can be stored as type ‘int16’, which is 2 bytes. However, the uncompressed values are stored as type ‘double’, costing 8 bytes. The convergence results for one trial are illustrated in Fig. 6 and Fig. 6.
From the simulations, we can see that: 1) with a fixed stepsize, all algorithms converge to an error ball, while the radiuses of the conventional DGD and ADCDGD are relatively smaller. This is because with a larger becomes smaller and hence the error ball for becomes larger; 2) By using compression, the convergence process of ADCDGD is relatively less smooth. But the compression noise does not affect convergence. With the same stepsize, the conventional DGD and ADCDGD have the almost the same convergence rate; 3) By using diminishing stepsizes, the convergence speed for ADCDGD becomes slower. However, the objective value remains decreasing; 4) By comparing the amount of exchanged information, ADCDGD with the fixed stepsize converges the fastest, using only 2000 bytes. This shows that our algorithm is the most communicationefficient.
2) Effect of the Amplifying Exponent: Next, we show the effect of the amplifying exponent As discussed in Section IVD, with a small the noise caused by compression could lead to a slow convergence. On the other hand, with a large the transmitted value could be too large and cause overflow, especially for quantized compressed operator. Here, we change using and keep the rest of the parameters the same. For each we repeat the algorithm times and compute the average objective values, as well as the maximum transmitted value from all the nodes in each iteration. The simulation results are shown in Figs. 8 and 8. We can see that, with a larger value, the algorithm converges faster and the curve is smoother, while the transmitted values are increasing a little bit faster. In this example, we can see that strikes a good balance between convergence and maximum transmitted value.
3) Effect of Network Size: The following simulations indicate that our algorithm could be scaled to largesize networks. In our simulation, we consider the ‘circle’ system: each node only connects with two neighboring nodes and forms a circle. For example, Fig. 10 shows a fivenode circle. We set to be , , , in our experiment. The local objectives are in the form of In our simulation, are independently randomly generated: and . For each value of , we repeat trials and compute the average gradient norm. The convergence results are shown in Fig. 10. It can be seen that our algorithm works well as the network size increases, demonstrating the scalability of ADCDGD.
Vi Conclusion
In this paper, we considered designing communicationefficient network consensus optimization algorithms in networks with slow communication rates. We proposed a new algorithm called amplifieddifferential compression decentralized gradient descent (ADCDGD), which is based on compression to reduce communication costs. We investigated the convergence behavior of ADCDGD on smooth but possibly nonconvex objectives in this work. We showed that: 1) by employing a fixed stepsize , ADCDGD converges with the ergodic rate until reaching an error ball of size with the amplified parameter ; 2) ADCDGD enjoys the best convergence rate and converge to a stationary point almost surely with diminishing stepsizes. Consensus optimization with compressed information is an important and underexplored area. An interesting future topic is to generalize our ADCDGD algorithmic framework to analyze cases with local stochastic gradients, which could further lower the implementation complexity of ADCDGD.
References
 [1] Q. Ling and Z. Tian, “Decentralized sparse signal recovery for compressive sleeping wireless sensor networks,” IEEE Transactions on Signal Processing, vol. 58, no. 7, pp. 3816–3827, 2010.
 [2] J. B. Predd, S. R. Kulkarni, and H. V. Poor, “Distributed learning in wireless sensor networks,” arXiv preprint cs/0503072, 2005.
 [3] I. D. Schizas, A. Ribeiro, and G. B. Giannakis, “Consensus in ad hoc wsns with noisy links—part i: Distributed estimation of deterministic signals,” IEEE Transactions on Signal Processing, vol. 56, no. 1, pp. 350–364, 2008.
 [4] F. Zhao, J. Shin, and J. Reich, “Informationdriven dynamic sensor collaboration,” IEEE Signal processing magazine, vol. 19, no. 2, pp. 61–72, 2002.
 [5] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for distributed optimization: Convergence analysis and network scaling,” IEEE Transactions on Automatic control, vol. 57, no. 3, pp. 592–606, 2012.
 [6] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, “Consensusbased distributed optimization: Practical issues and applications in largescale machine learning,” in Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on. IEEE, 2012, pp. 1543–1550.
 [7] Y. Cao, W. Yu, W. Ren, and G. Chen, “An overview of recent progress in the study of distributed multiagent coordination,” IEEE Transactions on Industrial informatics, vol. 9, no. 1, pp. 427–438, 2013.
 [8] W. Ren, R. W. Beard, and E. M. Atkins, “Information consensus in multivehicle cooperative control,” IEEE Control Systems, vol. 27, no. 2, pp. 71–82, 2007.
 [9] K. Zhou, S. I. Roumeliotis et al., “Multirobot active target tracking with combinations of relative observations,” IEEE Transactions on Robotics, vol. 27, no. 4, pp. 678–695, 2011.
 [10] G. B. Giannakis, V. Kekatos, N. Gatsis, S.J. Kim, H. Zhu, and B. F. Wollenberg, “Monitoring and optimization for power grids: A signal processing perspective,” IEEE Signal Processing Magazine, vol. 30, no. 5, pp. 107–128, 2013.
 [11] V. Kekatos and G. B. Giannakis, “Distributed robust power system state estimation,” IEEE Transactions on Power Systems, vol. 28, no. 2, pp. 1617–1626, 2013.
 [12] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multiagent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
 [13] M. I. Jordan, J. D. Lee, and Y. Yang, “Communicationefficient distributed statistical inference,” arXiv preprint arXiv:1605.07689, 2016.
 [14] J. Wang, M. Kolar, N. Srebro, and T. Zhang, “Efficient distributed learning with sparsity,” arXiv preprint arXiv:1605.07991, 2016.
 [15] J. N. Tsitsiklis, “Problems in decentralized decision making and computation.” MASSACHUSETTS INST OF TECH CAMBRIDGE LAB FOR INFORMATION AND DECISION SYSTEMS, Tech. Rep., 1984.
 [16] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016.
 [17] J. Zeng and W. Yin, “On nonconvex decentralized gradient descent,” arXiv preprint arXiv:1608.05766, 2016.
 [18] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact firstorder algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
 [19] A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over timevarying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.
 [20] Y.T. Chow, W. Shi, T. Wu, and W. Yin, “Expander graph and communicationefficient decentralized optimization,” in Signals, Systems and Computers, 2016 50th Asilomar Conference on. IEEE, 2016, pp. 1715–1720.
 [21] A. S. Berahas, R. Bollapragada, N. S. Keskar, and E. Wei, “Balancing communication and computation in distributed optimization,” arXiv preprint arXiv:1709.02999, 2017.
 [22] H. Tang, C. Zhang, S. Gan, T. Zhang, and J. Liu, “Decentralization meets quantization,” arXiv preprint arXiv:1803.06443, 2018.
 [23] Y. Zhang, M. J. Wainwright, and J. C. Duchi, “Communicationefficient algorithms for statistical optimization,” in Advances in Neural Information Processing Systems, 2012, pp. 1502–1510.
 [24] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communicationefficient sgd via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, 2017, pp. 1707–1718.

[25]
W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” in
Advances in Neural Information Processing Systems, 2017, pp. 1508–1518.  [26] J. Wangni, J. Wang, J. Liu, and T. Zhang, “Gradient sparsification for communicationefficient distributed optimization,” arXiv preprint arXiv:1710.09854, 2017.
 [27] X. Zhang, J. Liu, and Z. Zhu, “Taming convergence for asynchronous stochastic gradient descent with unbounded delay in nonconvex learning,” arXiv preprint arXiv:1805.09470, 2018.
 [28] R. Hannah and W. Yin, “On unbounded delays in asynchronous parallel fixedpoint algorithms,” Journal of Scientific Computing, pp. 1–28, 2016.
Appendix A Proof for Lemma 1
Without loss of generality, we prove the case of onedimensional objective. Firstly, we consider reach the minimal at and With the convexity of Consider with and define Consider there exists a constant such that and thus Hence, It is easy to obtain the same result for negative values Therefore, Next, if and Consider the transformation, then is the minimal solution for and also maintains the convexity. From the above, we know that Therefore, Denote we have Consider the following limits:
which implies that Similarly, we can show
Comments
There are no comments yet.