# Communication-Efficient Distributed Learning via Lazily Aggregated Quantized Gradients

## Authors

• 69 publications
• 39 publications
• 99 publications
• 2 publications
05/25/2018

### LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning

This paper presents a new class of gradient methods for distributed mach...
02/05/2022

### Distributed Learning With Sparsified Gradient Differences

A very large number of communications are typically required to solve di...
02/26/2020

### LASG: Lazily Aggregated Stochastic Gradients for Communication-Efficient Distributed Learning

This paper targets solving distributed machine learning problems such as...
10/09/2019

### High-Dimensional Stochastic Gradient Quantization for Communication-Efficient Edge Learning

Edge machine learning involves the deployment of learning algorithms at ...
11/18/2019

### vqSGD: Vector Quantized Stochastic Gradient Descent

In this work, we present a family of vector quantization schemes vqSGD (...
09/09/2019

### Communication-Censored Distributed Stochastic Gradient Descent

This paper develops a communication-efficient algorithm to solve the sto...
02/14/2021

### Communication-Efficient Distributed Optimization with Quantized Preconditioners

We investigate fast and communication-efficient algorithms for the class...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Considering the massive amount of mobile devices, centralized machine learning via cloud computing incurs considerable communication overhead, and raises serious privacy concerns. Today, the widespread consensus is that besides in the cloud centers, future machine learning tasks have to be performed starting from the network edge, namely devices [19, 16]. Typically, distributed learning tasks can be formulated as an optimization problem of the form

 minθ ∑m∈Mfm(θ)   with   fm(θ):=Nm∑n=1ℓ(xm,n;θ) (1)

where  denotes the parameter to be learned,  with  denotes the set of servers,  represents the

-th data vector at worker

(e.g., feature and label), and  is the number of data samples at worker . In (1),  denotes the loss associated with and , and  denotes the aggregated loss corresponding to  and all data at worker . For the ease in exposition, we also define

as the overall loss function.

In the commonly employed worker-server setup, the server collects local gradients from the workers and updates the parameter using a gradient descent (GD) iteration given by

 GD iterationθk+1=θk−α∑m∈M∇fm(θk) (2)

where  denotes the parameter value at iteration ,  is the stepsize, and  is the aggregated gradient. When the data samples are distributed across workers, each worker computes the corresponding local gradient , and uploads it to the server. Only when all the local gradients are collected, the server can obtain the full gradient and update the parameter. To implement (2) however, the server has to communicate with all workers to obtain fresh gradients . In several settings though, communication is much slower than computation [15]. Thus, as the number of workers grows, worker-server communications become the bottleneck [9]

. This becomes more challenging when incorporating popular deep learning-based learning models with high-dimensional parameters, and correspondingly large-scale gradients.

### 1.1 Prior art

Communication-efficient distributed learning methods have gained popularity recently [22, 9]. Most popular methods build on simple gradient updates, and are centered around the key idea of gradient compression to save communication, including gradient quantization and sparsification.

Quantization. Quantization aims to compress gradients by limiting the number of bits that represent floating point numbers during communication, and has been successfully applied to several engineering tasks employing wireless sensor networks [21]. In the context of distributed machine learning, a 1-bit binary quantization method has been developed in [24, 5]. Multi-bit quantization schemes have been studied in [2, 18], where an adjustable quantization level can endow additional flexibility to control the tradeoff between the per-iteration communication cost and the convergence rate. Other variants of quantized gradient schemes include error compensation [32]

, variance-reduced quantization

[34], quantization to a ternary vector [31], and quantization of gradient difference [20].

Sparsification. Sparsification amounts to transmitting only gradient coordinates with large enough magnitudes exceeding a certain threshold [27]. Empirically, the desired accuracy can be attained even after dropping 99% of the gradients [1]. To avoid losing information, small gradient components are accumulated and then applied when they are large enough [17]. The accumulated gradient offers variance reduction of the sparsified stochastic (S)GD iterates [26, 11]. With its impressive empirical performance granted, except recent efforts [3], deterministic sparsification schemes lack performance analysis guarantees. However, randomized counterparts that come with the so-termed unbiased sparsification have been developed to offer convergence guarantees [30, 28].

Quantization and sparsification have been also employed simultaneously [12, 8, 13]. Nevertheless, they both introduce noise to (S)GD updates, and thus deterioratee convergence in general. For problems with strongly convex losses, gradient compression algorithms either converge to the neighborhood of the optimal solution, or, they converge at sublinear rate. The exception is [18], where the first linear convergence rate has been established for the quantized gradient-based approaches. However, [18] only focuses on reducing the required bits per communication, but not the total number of rounds. Nevertheless, for exchanging messages, e.g., the -dimensional or its gradient, other latencies (initiating communication links, queueing, and propagating the message) are at least comparable to the message size-dependent transmission latency [23]. This motivates reducing the number of communication rounds, sometimes even more so than the bits per round.

Distinct from the aforementioned gradient compression schemes, communication-efficient schemes that aim to reduce the number of communication rounds have been developed by leveraging higher-order information [25, 36], periodic aggregation [35, 19, 33], and recently by adaptive aggregation [6, 29, 10]; see also [4] for a lower bound on communication rounds. However, whether we can save communication bits and rounds simultaneously without sacrificing the desired convergence properties remains unresolved. This paper aims to address this issue.

### 1.2 Our contributions

Before introducing our approach, we revisit the canonical form of popular quantized (Q) GD methods [24]-[20] in the simple setup of (1) with one server and workers:

 QGD iteration  θk+1=θk−α∑m∈MQm(θk) (3)

where  is the quantized gradient that coarsely approximates the local gradient . While the exact quantization scheme is different across algorithms, transmitting generally requires fewer number of bits than transmitting . Similar to GD however, only when all the local quantized gradients  are collected, the server can update the parameter .

In this context, the present paper puts forth a quantized gradient innovation method (as simple as QGD) that can skip communication in certain rounds. Specifically, in contrast to the server-to-worker downlink communication that can be performed simultaneously (e.g., by broadcasting ), the server has to receive the workers’ gradients sequentially to avoid interference from other workers, which leads to extra latency. For this reason, our focus here is on reducing the number of worker-to-server uplink communications, which we will also refer to as uploads. Our algorithm Lazily Aggregated Quantized gradient descent (LAQ) resembles (3), and it is given by

 LAQ iterationθk+1=θk−α∇k   with   ∇k=∇k−1+∑m∈MkδQkm (4)

where  is an approximate aggregated gradient that summarizes the parameter change at iteration , and  is the difference between two quantized gradients of  at the current iterate  and the old copy . With a judicious selection criterion that will be introduced later,  denotes the subset of workers whose local  is uploaded in iteration , while parameter iterates are given by , , and , . Instead of requesting fresh quantized gradient from every worker in (3), the trick is to obtain  by refining the previous aggregated gradient ; that is, using only the new gradients from the selected workers in , while reusing the outdated gradients from the rest of workers. If  is stored in the server, this simple modification scales down the per-iteration communication rounds from QGD’s  to LAQ’s . Throughout the paper, one round of communication means one worker’s upload.

Compared to the existing quantization schemes, LAQ first quantizes the gradient innovation — the difference of current gradient and previous quantized gradient, and then skips the gradient communication — if the gradient innovation of a worker is not large enough, the communication of this worker is skipped. We will rigorously establish that LAQ achieves the same linear convergence as GD under the strongly convex assumption of the loss function. Numerical tests will demonstrate that our approach outperforms existing methods in terms of both communication bits and rounds.

Notation. Bold lowercase letters denote column vectors;  and  denote the -norm and -norm of , respectively; and  represents -th entry of ; while  denotes downward rounding of ; and  denotes the cardinality of the set or vector.

## 2 LAQ: Lazily aggregated quantized gradient

To reduce the communication overhead, two complementary stages are integrated in our algorithm design: 1) gradient innovation-based quantization; and 2) gradient innovation-based uploading or aggregation — giving the name Lazily Aggregated Quantized gradient (LAQ). The former reduces the number of bits per upload, while the latter cuts down the number of uploads, which together guarantee parsimonious communication. This section explains the principles of our two-stage design.

Quantization limits the number of bits to represent a gradient vector during communication. Suppose we use  bits to quantize each coordinate of the gradient vector in contrast to  bits as in most computers. With  denoting the quantization operator, the quantized gradient for worker at iteration is , which depends on the gradient  and the previous quantization . The gradient is element-wise quantized by projecting to the closest point in a uniformly discretized grid. The grid is a -dimensional hypercube which is centered at  with the radius . With  defining the quantization granularity, the gradient innovation  can be quantized by bits per coordinate at worker as:

 [qm(θk)]i=⎢⎢ ⎢ ⎢⎣[∇fm(θk)]i−[Qm(^θk−1m)]i+Rkm2τRkm+12⎥⎥ ⎥ ⎥⎦,   i=1,⋯,p (5)

which is an integer within , and thus can be encoded by  bits. Note that adding  in the numerator ensures the non-negativity of , and adding  in (5) guarantees rounding to the closest point. Hence, the quantized gradient innovation at worker  is (with )

 δQkm=Qm(θk)−Qm(^θk−1m)=2τRkmqm(θk)−Rkm1:   transmit   Rkm   and   qm(θk) (6)

which can be transmitted by  bits ( bits for   and bits for ) instead of the original  bits. With the outdated gradients  stored in the memory and known a priori, after receiving  the server can recover the quantized gradient as .

Figure 1 gives an example for quantizing one coordinate of the gradient with  bits. The original value is quantized with bits and  values, each of which covers a range of length  centered at itself. With  denoting the local quantization error, it is clear that the quantization error is less than half of the length of the range that each value covers, namely, . The aggregated quantized gradient is , and the aggregated quantization error is ; that is, .

The idea of lazy gradient aggregation is that if the difference of two consecutive locally quantized gradients is small, it is safe to skip the redundant gradient upload, and reuse the previous one at the server. In addition, we also ensure the server has a relatively “fresh" gradient for each

worker by enforcing communication if any worker has not uploaded during the last  rounds. We set a clock  for worker  counting the number of iterations since last time it uploaded information. Equipped with the quantization and selection, our LAQ update takes the form as (4).

Now it only remains to design the selection criterion to decide which worker to upload the quantized gradient or its innovation. We propose the following communication criterion: worker  skips the upload at iteration , if it satisfies

 ∥Qm(^θk−1m)−Qm(θk)∥22 ≤1α2M2D∑d=1ξd∥θk+1−d−θk−d∥22+3(∥εkm∥22+∥^εk−1m∥22); (7a) tm ≤¯t (7b)

where and  are predetermined constants,  is the current quantization error, and  is the error of the last uploaded quantized gradient. In next section we will prove the convergence and communication properties of LAQ under criterion (7).

### 2.3 LAQ algorithm development

In summary, as illustrated in Figure 2, LAQ can be implemented as follows. At iteration , the server broadcasts the learning parameter to all workers. Each worker calculates the gradient, and then quantizes it to judge if it needs to upload the quantized gradient innovation . Then the server updates the learning parameter after it receives the gradient innovation from the selected workers. The algorithm is summarized in Algorithm  2.

To make the difference between LAQ and GD clear, we re-write (4) as:

 θk+1= θk−α[∇Q(θk)+∑m∈Mkc(Qm(^θk−1m)−Qm(θk))] (8a) = θk−α[∇f(θk)−εk+∑m∈Mkc(Qm(^θk−1m)−Qm(θk))] (8b)

where , is the subset of workers which skip communication with server at iteration . Compared with the GD iteration in (2), the gradient employed here degrades due to the quantization error,  and the missed gradient innovation, . It is clear that if large enough number of bits are used to quantize the gradient, and all  are set 0 thus , then LAQ reduces to GD. Thus, adjusting  and  directly influences the performance of LAQ.

The rationale behind selection criterion (7) lies in the judicious comparison between the descent amount of GD and that of LAQ. To compare the descent amount, we first establish the one step descent amount of both algorithms. For all the results in this paper, the following assumption holds.

###### Assumption 1.

The local gradient  is -Lipschitz continuous and the global gradient  is -Lipschitz continuous, i.e., there exist constants  and  such that

 ∥∇fm(θ1)−∇fm(θ2)∥2≤ Lm∥θ1−θ2∥2, ∀θ1, θ2; (9a) ∥∇f(θ1)−∇f(θ2)∥2≤ L∥θ1−θ2∥2, ∀θ1, θ2. (9b)

Building upon Assumption 1, the next lemma describes the descent in objective by GD.

###### Lemma 1.

The gradient descent update yields following descent:

 f(θk+1)−f(θk)≤ΔkGD (10)

where .

The descent of LAQ distinguishes from that of GD due to the quantization and selection, which is specified in the following lemma.

###### Lemma 2.

The LAQ update yields following descent:

 f(θk+1)−f(θk)≤ΔkLAQ+α∥εk∥22 (11)

where .

In lazy aggregation, we consider only with the quantization error in (11) ignored. Rigorous theorem showing the property of LAQ taking into account the quantization error will be established in next section.

The following part shows the intuition for criterion (7a), which is not mathematically strict but provides the intuition. The lazy aggregation mechanism selects the quantized gradient innovation by judging its contribution to decreasing the loss function. LAQ is expected to be more communication-efficient than GD, that is, each upload results in more descent, which translates to:

 ΔkLAQ|Mk|≤ΔkGDM. (12)

which is tantamount to (see the derivations in the supplementary materials)

 ∥(Qm(^θk−1m)−Qm(θk)∥22≤∥∇f(θk)∥22/(2M2), ∀m∈Mkc. (13)

However, for each worker to check (38) locally is impossible because the fully aggregated gradient  is required, which is exactly what we want to avoid. Moreover, it does not make sense to reduce uploads if the fully aggregated gradient has been obtained. Therefore, we bypass directly calculating  using its approximation below.

 ∥∇f(θk)∥22≈2α2D∑k=1ξd∥θk+1−d−θk−d∥22 (14)

where  are constants. The fundamental reason why (39) holds is that  can be approximated by weighted previous gradients or parameter differences since  is -smooth. Combining (38) and (39) leads to our communication criterion (7a) with quantization error ignored.

We conclude this section by a comparison between LAQ and error-feedback (quantized) schemes.

Comparison with error-feedback schemes. Our LAQ approach is related to the error-feedback schemes, e.g., [24, 27, 32, 3, 26, 11]. Both lines of approaches accumulate either errors or delayed innovation incurred by communication reduction (e.g., quantization, sparsification, or skipping), and upload them in the next communication round. However, the error-feedback schemes skip communicating certain entries of the gradient, yet communicate with all workers. LAQ skips communicating with certain workers, but communicates all (quantized) entries. The two methods are not mutually exclusive, and can be used jointly.

## 3 Convergence and communication analysis

Our subsequent convergence analysis of LAQ relies on the following assumption on :

###### Assumption 2.

The function  is -strongly convex, e.g., there exists a constant  such that

 f(θ1)−f(θ2)≥⟨∇f(θ2),θ1−θ2⟩+μ2∥θ1−θ2∥22,   ∀θ1, θ2. (15)

With  denoting the optimal solution of (1), we define Lyapunov function of LAQ as:

 V(θk)=f(θk)−f(θ∗)+D∑d=1D∑j=dξjα∥θk+1−d−θk−d∥22 (16)

The design of Lyapunov function is coupled with the communication rule (7a) that contains parameter difference term. Intuitively, if no communication is being skipped at current iteration, LAQ behaves like GD that decreases the objective residual in ; if certain uploads are skipped, LAQ’s rule (7a) guarantees the error of using stale gradients comparable to the parameter difference in to ensure its descending. The following lemma captures the progress of the Lyapunov function.

###### Lemma 3.

Under Assumptions 1 and 2, if the stepsize and the parameters are selected as (with any  and )

 D∑d=1ξd ≤min{1−ρ14(1+ρ2),12(1+ρ−12)} (17a) α ≤min{2L(1−ρ14(1+ρ2)−D∑d=1ξd),2L(12(1+ρ−12)−D∑d=1ξd)} (17b)

then the Lyapunov function follows

 V(θk+1)≤σ1V(θk)+B[∥εk∥22+∑m∈Mkc(∥εkm∥22+∥^εk−1m∥22)] (18)

where constants  and  depend on  and ; see details in supplementary materials.

For the tight analysis, (17) appear to be involved, but it admits simple choices. For example, when we choose  and , respectively, then  and  satisfy (17).

If the quantization error in (18) is null, Lemma 3 readily implies that the Lyapunov function enjoys a linear convergence rate. In the following, we will demonstrate that under certain conditions, the LAQ algorithm can still guarantee linear convergence even if we consider the the quantization error.

###### Theorem 1.

Under the same assumptions and the parameters in Lemma 3, Lyapunov function and the quantization error converge at a linear rate; that is, there exists a constant  such that

 V(θk)≤σk2P; (19a) ∥εkm∥2∞≤τ2σk2P, ∀m∈M. (19b)

where  is a constant depending on the parameters in (17); see details in supplementary materials.

From the definition of Lyapunov function, it is clear that — the risk error converges linearly. The -smoothness results in — the gradient norm converges linearly. Similarly, the -strong convexity implies also converges linearly.

Compared to the previous analysis for LAG [6], the analysis for LAQ is more involved, since it needs to deal with not only outdated but also quantized (inexact) gradients. This modification deteriorates the monotonic property of the Lyapunov function in (18), which is the building block of analysis in [6]. We tackle this issue by i) considering the outdated gradient in the quantization (6); and, ii) incorporating quantization error in the new selection criterion (7). As a result, Theorem 1 demonstrates that LAQ is able to keep the linear convergence rate even with the presence of the quantization error. This is because the properly controlled quantization error also converges at a linear rate; see the illustration in Figure 3.

###### Proposition 1.

Under Assumption 1, if we choose the constants  satisfying  and define ,  as:

 dm:=maxd{d|L2m≤ξd/(3α2M2D),d∈{1,2,⋯,D}} (20)

then, worker  has at most  communications with the server until the -th iteration.

This proposition implies that the smoothness of the local loss function determines the communication intensity of the local worker.

## 4 Numerical tests and conclusions

To validate our performance analysis and verify its communication savings in practical machine learning problems, we evaluate the performance of the algorithm for the regularized logistic regression which is strongly convex, and the neural network which is nonconvex. The dataset we use is MNIST [14]

, which are uniformly distributed across

workers. In the experiments, we set ,  ; see the detailed setup in the supplementary materials. To benchmark LAQ, we compare it with two classes of algorithms, gradient-based algorithms and minibatch stochastic gradient-based algorithms — corresponding to the following two tests.

Gradient-based tests. We consider GD, QGD [18] and lazily aggregated gradient (LAG) [6]. The number of bits per coordinate is set as  for logistic regression and  for neural network, respectively. Stepsize is set as   for both algorithms. Figure 4 shows the objective convergence for the logistic regression task. Clearly, Figure 4(a) verifies Theorem  1, e.g., the linear convergence rate under strongly convex loss function. As shown in Figure 4(b), LAQ requires fewer number of communication rounds than GD and QGD thanks to our selection rule, but more rounds than LAG due to the gradient quantization. Nevertheless, the total number of transmitted bits of LAQ is significantly smaller than that of LAG, as demonstrated in Figure 4(c). For neural network model, Figure 5 reports the convergence of gradient norm, where LAQ also shows competitive performance for nonconvex problem. Similar to the results for logistic model, LAQ requires the fewest number of bits. Table  2 summarizes the number of iterations, uploads and bits needed to reach a given accuracy.

Figure 6 exhibits the test accuracy of above compared algorithms on three commonly used datasets, MNIST, ijcnn1 and covtype. Applied to all these datasets, LAQ saves transmitted bits and meanwhile maintains the same accuracy.

[2], sparsified stochastic gradient descent (SSGD) [30], and the stochastic version of LAQ abbreviated as SLAQ. The mini-batch size is  , , and the number of bits per coordinate is set as   for logistic regression and  for neural network. As shown in Figures  7 and 8, SLAQ incurs the lowest number of communication rounds and bits. In this stochastic gradient test, although the communication reduction of SLAQ is not as significant as LAQ compared with gradient based algorithms, SLAQ still outperforms the state-of-the-art algorithms, e.g., QSGD and SSGD. The results are summarized in Table  3. More results under different number of bits and the level of heterogeneity are reported in the supplementary materials.

This paper studied the communication-efficient distributed learning problem, and proposed LAQ that simultaneously quantizes and skips the communication based on gradient innovation. Compared to the original GD method, linear convergence rate is still maintained for strongly convex loss function. This is remarkable since LAQ saves both communication bits and rounds significantly. Numerical tests using (strongly convex) regularized logistic regression and (nonconvex) neural network models demonstrate the advantages of LAQ over existing popular approaches.

#### Acknowledgments

This work by J. Sun and Z. Yang is supported in part by the Shenzhen Committee on Science and Innovations under Grant GJHZ20180411143603361, in part by the Department of Science and Technology of Guangdong Province under Grant 2018A050506003, and in part by the Natural Science Foundation of China under Grant 61873118. The work by J. Sun is also supported by China Scholarship Council. The work by G. Giannakis is supported in part by NSF 1500713, and 1711471.

## References

• [1] Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In

Proc. Conf. Empi. Meth. Natural Language Process.

, Copenhagen, Denmark, Sep 2017.
• [2] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Proc. Advances in Neural Info. Process. Syst., pages 1709–1720, Long Beach, CA, Dec 2017.
• [3] Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. The convergence of sparsified gradient methods. In Proc. Advances in Neural Info. Process. Syst., pages 5973–5983, Montreal, Canada, Dec 2018.
• [4] Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and optimization. In Proc. Advances in Neural Info. Process. Syst., pages 1756–1764, Montreal, Canada, Dec 2015.
• [5] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. SignSGD: Compressed optimisation for non-convex problems. In Proc. Intl. Conf. Machine Learn., pages 559–568, Stockholm, Sweden, Jul 2018.
• [6] Tianyi Chen, Georgios Giannakis, Tao Sun, and Wotao Yin. LAG: Lazily aggregated gradient for communication-efficient distributed learning. In Proc. Advances in Neural Info. Process. Syst., pages 5050–5060, Montreal, Canada, Dec 2018.
• [7] Mert Gurbuzbalaban, Asuman Ozdaglar, and Pablo A Parrilo. On the convergence rate of incremental aggregated gradient algorithms. SIAM Journal on Optimization, 27(2):1035–1048, 2017.
• [8] Peng Jiang and Gagan Agrawal. A linear speedup analysis of distributed deep learning with sparse and quantized communication. In Proc. Advances in Neural Info. Process. Syst., pages 2525–2536, Montreal, Canada, Dec 2018.
• [9] Michael I Jordan, Jason D Lee, and Yun Yang. Communication-efficient distributed statistical inference. J. American Statistical Association, to appear, 2018.
• [10] Michael Kamp, Linara Adilova, Joachim Sicking, Fabian Hüger, Peter Schlicht, Tim Wirtz, and Stefan Wrobel. Efficient decentralized deep learning by dynamic model averaging. In Euro. Conf. Machine Learn. Knowledge Disc. Data., pages 393–409, Dublin, Ireland, 2018.
• [11] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In Proc. Intl. Conf. Machine Learn., pages 3252–3261, Long Beach, CA, Jun 2019.
• [12] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint:1610.05492, Oct 2016.
• [13] Jakub Konečnỳ and Peter Richtárik.

Randomized distributed mean estimation: Accuracy vs communication.

Frontiers in Applied Mathematics and Statistics, 4:62, Dec 2018.
• [14] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2:18, 2010.
• [15] Mu Li, David G Andersen, Alexander J Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In Proc. Advances in Neural Info. Process. Syst., pages 19–27, Montreal, Canada, Dec 2014.
• [16] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Proc. Advances in Neural Info. Process. Syst., pages 5330–5340, Long Beach, CA, Dec 2017.
• [17] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In Proc. Intl. Conf. Learn. Represent., Vancouver, Canada, Apr 2018.
• [18] Sindri Magnússon, Hossein Shokri-Ghadikolaei, and Na Li. On maintaining linear convergence of distributed learning and optimization under limited communication. arXiv preprint arXiv:1902.11163, 2019.
• [19] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proc. Intl. Conf. Artificial Intell. and Stat., pages 1273–1282, Fort Lauderdale, FL, April 2017.
• [20] Konstantin Mishchenko, Eduard Gorbunov, Martin Takáč, and Peter Richtárik. Distributed learning with compressed gradient differences. arXiv preprint:1901.09269, Jan 2019.
• [21] Eric J Msechu and Georgios B Giannakis. Sensor-centric data reduction for estimation with WSNs via censoring and quantization. IEEE Trans. Sig. Proc., 60(1):400–414, Jan 2011.
• [22] Angelia Nedić, Alex Olshevsky, and Michael Rabbat. Network topology and communication-computation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5):953–976, May 2018.
• [23] Larry L Peterson and Bruce S Davie. Computer Networks: A Systems Approach. Morgan Kaufman, Burlington, MA, 2007.
• [24] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Proc. Conf. Intl. Speech Comm. Assoc., Singapore, Sept 2014.
• [25] Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-efficient distributed optimization using an approximate newton-type method. In Proc. Intl. Conf. Machine Learn., pages 1000–1008, Beijing, China, Jun 2014.
• [26] Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified SGD with memory. In Proc. Advances in Neural Info. Process. Syst., pages 4447–4458, Montreal, Canada, Dec 2018.
• [27] Nikko Strom. Scalable distributed DNN training using commodity gpu cloud computing. In Proc. Conf. Intl. Speech Comm. Assoc., Dresden, Germany, Sept 2015.
• [28] Hongyi Wang, Scott Sievert, Shengchao Liu, Zachary Charles, Dimitris Papailiopoulos, and Stephen Wright. Atomo: Communication-efficient learning via atomic sparsification. In Proc. Advances in Neural Info. Process. Syst., pages 9850–9861, Montreal, Canada, Dec 2018.
• [29] Jianyu Wang and Gauri Joshi. Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms. arXiv preprint:1808.07576, August 2018.
• [30] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient distributed optimization. In Proc. Advances in Neural Info. Process. Syst., pages 1299–1309, Montreal, Canada, Dec 2018.
• [31] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Proc. Advances in Neural Info. Process. Syst., pages 1509–1519, Long Beach, CA, Dec 2017.
• [32] Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compensated quantized SGD and its applications to large-scale distributed optimization. arXiv preprint arXiv:1806.08054, 2018.
• [33] Hao Yu and Rong Jin. On the computation and communication complexity of parallel SGD with dynamic batch sizes for stochastic non-convex optimization. arXiv preprint:1905.04346, May 2019.
• [34] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning. In Proc. Intl. Conf. Machine Learn., pages 4035–4043, Sydney, Australia, Aug 2017.
• [35] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging SGD. In Proc. Advances in Neural Info. Process. Syst., pages 685–693, Montreal, Canada, Dec 2015.
• [36] Yuchen Zhang and Xiao Lin. DiSCO: Distributed optimization for self-concordant empirical loss. In Proc. Intl. Conf. Machine Learn., pages 362–370, Lille, France, June 2015.

## Appendix A Proof of Lemma  2

With the LAQ update, we have:

 ≤−α∥∇f(θk)∥22+α⟨∇f(θk),εk−∑m∈Mkc(Qm(^θk−1m)−Qm(θk))⟩+L2∥θk+1−θk∥22 =−α∥∇f(θk)∥22+α2[∥∇f(θk)∥22+∥εk−∑m∈Mkc(Qm(^θk−1m)−Qm(θk))∥22−∥θk+1−θk∥22α2]+L2∥θk+1−θk∥22 ≤−α2∥∇f(θk)∥22+α∥∑m∈Mkc(Qm(^θk−1m)−Qm(θk))∥22+(L2−12α)∥θk+1−θk∥22+α∥εk∥22

where the second equality follows from:  and the last inequality is resulted from: .

## Appendix B Proof of Lemma  3

With Assumption 1, under the LAQ we have:

 ≤ −⟨∇f(θk),αQ(θk)⟩+α2∥∇f(θk)∥22+α2∥∑m∈MkcQm(^θk−1m)−Qm(θk)∥22+L2∥θk+1−θk∥22

For the ease of expression, we define .  Then the Lyapunov function defined in (16) can be written as

 V(θk)=f(θk)−f(θ∗)+D∑d=1βd∥θk+1−d−θk−d∥22. (21)

Thus, we have

 V(θk+1)−V(θk) (22) ≤−α⟨∇f(θk),Q(θk)⟩+α2∥∇f(θk)∥22+α2∥∑m∈MkcQm(^θk−1m)−Qm(θk)∥22+(L2+β1)∥θk+1−θk∥22 +D−1∑d=1(βd+1−βd)∥θk+1−d−θk−d∥22−βD∥θk+1−D−θk−D∥22 ≤−α⟨∇f(θk),Q(θk)⟩+α2∥∇f(θk)∥22+[α2+(L2+β1)(1+ρ−12)α2]∥∑m∈MkcQm(^θk−1m)−Qm(θk)∥22 +(L2+β1)(1+ρ2)α2∥Q(θk)∥22+D−1∑d=1(βd+1−βd)∥θk+1−d−θk−d∥22−βD∥θk+1−D−θk−D∥22 ≤−α⟨∇f(θk),Q(θk)⟩+α2∥∇f(θk)∥22+[α2+(L2+β1)(1+ρ−12)α2]1α2D∑d=1ξd∥θk+1−d−θk−d∥22 +(L2+β1)(1+ρ2)α2∥Q(θk)∥22+D−1∑d=1(βd+1−βd)∥θk+1−d−θk−d∥22−βD∥θk+1−D−θk−D∥22 +[3α2+(3L2+3β1)(1+ρ−12)α2]M(∥εkm∥22+∥^εk−1m∥22)

where the second inequality follows from Young’s Equality: . The last inequality is resulted from

 ∥∑m∈MkcQm(^θk−1m)−Qm(θk)∥22≤|Mkc|∑m∈Mkc∥Qm(^θk−1m)−Qm(θk)∥22 (23) ≤|Mkc|2α2|M|2D∑d=1ξd∥θk+1−d−θk−d∥22+3|Mkc|∑m∈Mkc(∥εkm∥22+∥^εk−1m∥22) ≤1α2D∑d=1ξd∥θk+1−d−θk−d∥22+3M∑m∈Mkc(∥εkm∥22+∥^εk−1m∥22)

where the second inequality follows from (7a). Substituting  into (22) gives

 V(θk+1)−V(θk) (24) ≤−α2∥∇f(θk)∥