1 Introduction
Considering the massive amount of mobile devices, centralized machine learning via cloud computing incurs considerable communication overhead, and raises serious privacy concerns. Today, the widespread consensus is that besides in the cloud centers, future machine learning tasks have to be performed starting from the network edge, namely devices [19, 16]. Typically, distributed learning tasks can be formulated as an optimization problem of the form
(1) 
where denotes the parameter to be learned, with denotes the set of servers, represents the
th data vector at worker
(e.g., feature and label), and is the number of data samples at worker . In (1), denotes the loss associated with and , and denotes the aggregated loss corresponding to and all data at worker . For the ease in exposition, we also defineas the overall loss function.
In the commonly employed workerserver setup, the server collects local gradients from the workers and updates the parameter using a gradient descent (GD) iteration given by
(2) 
where denotes the parameter value at iteration , is the stepsize, and is the aggregated gradient. When the data samples are distributed across workers, each worker computes the corresponding local gradient , and uploads it to the server. Only when all the local gradients are collected, the server can obtain the full gradient and update the parameter. To implement (2) however, the server has to communicate with all workers to obtain fresh gradients . In several settings though, communication is much slower than computation [15]. Thus, as the number of workers grows, workerserver communications become the bottleneck [9]
. This becomes more challenging when incorporating popular deep learningbased learning models with highdimensional parameters, and correspondingly largescale gradients.
1.1 Prior art
Communicationefficient distributed learning methods have gained popularity recently [22, 9]. Most popular methods build on simple gradient updates, and are centered around the key idea of gradient compression to save communication, including gradient quantization and sparsification.
Quantization. Quantization aims to compress gradients by limiting the number of bits that represent floating point numbers during communication, and has been successfully applied to several engineering tasks employing wireless sensor networks [21]. In the context of distributed machine learning, a 1bit binary quantization method has been developed in [24, 5]. Multibit quantization schemes have been studied in [2, 18], where an adjustable quantization level can endow additional flexibility to control the tradeoff between the periteration communication cost and the convergence rate. Other variants of quantized gradient schemes include error compensation [32]
, variancereduced quantization
[34], quantization to a ternary vector [31], and quantization of gradient difference [20].Sparsification. Sparsification amounts to transmitting only gradient coordinates with large enough magnitudes exceeding a certain threshold [27]. Empirically, the desired accuracy can be attained even after dropping 99% of the gradients [1]. To avoid losing information, small gradient components are accumulated and then applied when they are large enough [17]. The accumulated gradient offers variance reduction of the sparsified stochastic (S)GD iterates [26, 11]. With its impressive empirical performance granted, except recent efforts [3], deterministic sparsification schemes lack performance analysis guarantees. However, randomized counterparts that come with the sotermed unbiased sparsification have been developed to offer convergence guarantees [30, 28].
Quantization and sparsification have been also employed simultaneously [12, 8, 13]. Nevertheless, they both introduce noise to (S)GD updates, and thus deterioratee convergence in general. For problems with strongly convex losses, gradient compression algorithms either converge to the neighborhood of the optimal solution, or, they converge at sublinear rate. The exception is [18], where the first linear convergence rate has been established for the quantized gradientbased approaches. However, [18] only focuses on reducing the required bits per communication, but not the total number of rounds. Nevertheless, for exchanging messages, e.g., the dimensional or its gradient, other latencies (initiating communication links, queueing, and propagating the message) are at least comparable to the message sizedependent transmission latency [23]. This motivates reducing the number of communication rounds, sometimes even more so than the bits per round.
Distinct from the aforementioned gradient compression schemes, communicationefficient schemes that aim to reduce the number of communication rounds have been developed by leveraging higherorder information [25, 36], periodic aggregation [35, 19, 33], and recently by adaptive aggregation [6, 29, 10]; see also [4] for a lower bound on communication rounds. However, whether we can save communication bits and rounds simultaneously without sacrificing the desired convergence properties remains unresolved. This paper aims to address this issue.
1.2 Our contributions
Before introducing our approach, we revisit the canonical form of popular quantized (Q) GD methods [24][20] in the simple setup of (1) with one server and workers:
(3) 
where is the quantized gradient that coarsely approximates the local gradient . While the exact quantization scheme is different across algorithms, transmitting generally requires fewer number of bits than transmitting . Similar to GD however, only when all the local quantized gradients are collected, the server can update the parameter .
In this context, the present paper puts forth a quantized gradient innovation method (as simple as QGD) that can skip communication in certain rounds. Specifically, in contrast to the servertoworker downlink communication that can be performed simultaneously (e.g., by broadcasting ), the server has to receive the workers’ gradients sequentially to avoid interference from other workers, which leads to extra latency. For this reason, our focus here is on reducing the number of workertoserver uplink communications, which we will also refer to as uploads. Our algorithm Lazily Aggregated Quantized gradient descent (LAQ) resembles (3), and it is given by
(4) 
where is an approximate aggregated gradient that summarizes the parameter change at iteration , and is the difference between two quantized gradients of at the current iterate and the old copy . With a judicious selection criterion that will be introduced later, denotes the subset of workers whose local is uploaded in iteration , while parameter iterates are given by , , and , . Instead of requesting fresh quantized gradient from every worker in (3), the trick is to obtain by refining the previous aggregated gradient ; that is, using only the new gradients from the selected workers in , while reusing the outdated gradients from the rest of workers. If is stored in the server, this simple modification scales down the periteration communication rounds from QGD’s to LAQ’s . Throughout the paper, one round of communication means one worker’s upload.
Compared to the existing quantization schemes, LAQ first quantizes the gradient innovation — the difference of current gradient and previous quantized gradient, and then skips the gradient communication — if the gradient innovation of a worker is not large enough, the communication of this worker is skipped. We will rigorously establish that LAQ achieves the same linear convergence as GD under the strongly convex assumption of the loss function. Numerical tests will demonstrate that our approach outperforms existing methods in terms of both communication bits and rounds.
Notation. Bold lowercase letters denote column vectors; and denote the norm and norm of , respectively; and represents th entry of ; while denotes downward rounding of ; and denotes the cardinality of the set or vector.
2 LAQ: Lazily aggregated quantized gradient
To reduce the communication overhead, two complementary stages are integrated in our algorithm design: 1) gradient innovationbased quantization; and 2) gradient innovationbased uploading or aggregation — giving the name Lazily Aggregated Quantized gradient (LAQ). The former reduces the number of bits per upload, while the latter cuts down the number of uploads, which together guarantee parsimonious communication. This section explains the principles of our twostage design.
2.1 Gradient innovationbased quantization
Quantization limits the number of bits to represent a gradient vector during communication. Suppose we use bits to quantize each coordinate of the gradient vector in contrast to bits as in most computers. With denoting the quantization operator, the quantized gradient for worker at iteration is , which depends on the gradient and the previous quantization . The gradient is elementwise quantized by projecting to the closest point in a uniformly discretized grid. The grid is a dimensional hypercube which is centered at with the radius . With defining the quantization granularity, the gradient innovation can be quantized by bits per coordinate at worker as:
(5) 
which is an integer within , and thus can be encoded by bits. Note that adding in the numerator ensures the nonnegativity of , and adding in (5) guarantees rounding to the closest point. Hence, the quantized gradient innovation at worker is (with )
(6) 
which can be transmitted by bits ( bits for and bits for ) instead of the original bits. With the outdated gradients stored in the memory and known a priori, after receiving the server can recover the quantized gradient as .
Figure 1 gives an example for quantizing one coordinate of the gradient with bits. The original value is quantized with bits and values, each of which covers a range of length centered at itself. With denoting the local quantization error, it is clear that the quantization error is less than half of the length of the range that each value covers, namely, . The aggregated quantized gradient is , and the aggregated quantization error is ; that is, .
2.2 Gradient innovationbased aggregation
The idea of lazy gradient aggregation is that if the difference of two consecutive locally quantized gradients is small, it is safe to skip the redundant gradient upload, and reuse the previous one at the server. In addition, we also ensure the server has a relatively “fresh" gradient for each
worker by enforcing communication if any worker has not uploaded during the last rounds. We set a clock for worker counting the number of iterations since last time it uploaded information. Equipped with the quantization and selection, our LAQ update takes the form as (4).
Now it only remains to design the selection criterion to decide which worker to upload the quantized gradient or its innovation. We propose the following communication criterion: worker skips the upload at iteration , if it satisfies
(7a)  
(7b) 
where and are predetermined constants, is the current quantization error, and is the error of the last uploaded quantized gradient. In next section we will prove the convergence and communication properties of LAQ under criterion (7).
2.3 LAQ algorithm development
In summary, as illustrated in Figure 2, LAQ can be implemented as follows. At iteration , the server broadcasts the learning parameter to all workers. Each worker calculates the gradient, and then quantizes it to judge if it needs to upload the quantized gradient innovation . Then the server updates the learning parameter after it receives the gradient innovation from the selected workers. The algorithm is summarized in Algorithm 2.
To make the difference between LAQ and GD clear, we rewrite (4) as:
(8a)  
(8b) 
where , is the subset of workers which skip communication with server at iteration . Compared with the GD iteration in (2), the gradient employed here degrades due to the quantization error, and the missed gradient innovation, . It is clear that if large enough number of bits are used to quantize the gradient, and all are set 0 thus , then LAQ reduces to GD. Thus, adjusting and directly influences the performance of LAQ.
The rationale behind selection criterion (7) lies in the judicious comparison between the descent amount of GD and that of LAQ. To compare the descent amount, we first establish the one step descent amount of both algorithms. For all the results in this paper, the following assumption holds.
Assumption 1.
The local gradient is Lipschitz continuous and the global gradient is Lipschitz continuous, i.e., there exist constants and such that
(9a)  
(9b) 
Building upon Assumption 1, the next lemma describes the descent in objective by GD.
Lemma 1.
The gradient descent update yields following descent:
(10) 
where .
The descent of LAQ distinguishes from that of GD due to the quantization and selection, which is specified in the following lemma.
Lemma 2.
The LAQ update yields following descent:
(11) 
where .
In lazy aggregation, we consider only with the quantization error in (11) ignored. Rigorous theorem showing the property of LAQ taking into account the quantization error will be established in next section.
1: Input: stepsize , quantization bit . 2: Initialize: . 3: for do 4: Server broadcasts to all workers. 5: for do 6: Worker computes and . 7: Worker uploads via (6). 8: end for 9: Server updates following (4) with . 10: end for  1: Input: stepsize , , , and . 2: Initialize: , and . 3: for do 4: Server broadcasts to all workers. 5: for do 6: Worker computes and . 7: if (7) holds for worker then 8: Worker uploads nothing. 9: Set and . 10: else 11: Worker uploads via (6). 12: Set , and . 13: end if 14: end for 15: Server updates according to (4). 16: end for 
The following part shows the intuition for criterion (7a), which is not mathematically strict but provides the intuition. The lazy aggregation mechanism selects the quantized gradient innovation by judging its contribution to decreasing the loss function. LAQ is expected to be more communicationefficient than GD, that is, each upload results in more descent, which translates to:
(12) 
which is tantamount to (see the derivations in the supplementary materials)
(13) 
However, for each worker to check (38) locally is impossible because the fully aggregated gradient is required, which is exactly what we want to avoid. Moreover, it does not make sense to reduce uploads if the fully aggregated gradient has been obtained. Therefore, we bypass directly calculating using its approximation below.
(14) 
where are constants. The fundamental reason why (39) holds is that can be approximated by weighted previous gradients or parameter differences since is smooth. Combining (38) and (39) leads to our communication criterion (7a) with quantization error ignored.
We conclude this section by a comparison between LAQ and errorfeedback (quantized) schemes.
Comparison with errorfeedback schemes. Our LAQ approach is related to the errorfeedback schemes, e.g., [24, 27, 32, 3, 26, 11]. Both lines of approaches accumulate either errors or delayed innovation incurred by communication reduction (e.g., quantization, sparsification, or skipping), and upload them in the next communication round. However, the errorfeedback schemes skip communicating certain entries of the gradient, yet communicate with all workers. LAQ skips communicating with certain workers, but communicates all (quantized) entries. The two methods are not mutually exclusive, and can be used jointly.
3 Convergence and communication analysis
Our subsequent convergence analysis of LAQ relies on the following assumption on :
Assumption 2.
The function is strongly convex, e.g., there exists a constant such that
(15) 
With denoting the optimal solution of (1), we define Lyapunov function of LAQ as:
(16) 
The design of Lyapunov function is coupled with the communication rule (7a) that contains parameter difference term. Intuitively, if no communication is being skipped at current iteration, LAQ behaves like GD that decreases the objective residual in ; if certain uploads are skipped, LAQ’s rule (7a) guarantees the error of using stale gradients comparable to the parameter difference in to ensure its descending. The following lemma captures the progress of the Lyapunov function.
Lemma 3.
For the tight analysis, (17) appear to be involved, but it admits simple choices. For example, when we choose and , respectively, then and satisfy (17).
If the quantization error in (18) is null, Lemma 3 readily implies that the Lyapunov function enjoys a linear convergence rate. In the following, we will demonstrate that under certain conditions, the LAQ algorithm can still guarantee linear convergence even if we consider the the quantization error.
Theorem 1.
From the definition of Lyapunov function, it is clear that — the risk error converges linearly. The smoothness results in — the gradient norm converges linearly. Similarly, the strong convexity implies — also converges linearly.
Compared to the previous analysis for LAG [6], the analysis for LAQ is more involved, since it needs to deal with not only outdated but also quantized (inexact) gradients. This modification deteriorates the monotonic property of the Lyapunov function in (18), which is the building block of analysis in [6]. We tackle this issue by i) considering the outdated gradient in the quantization (6); and, ii) incorporating quantization error in the new selection criterion (7). As a result, Theorem 1 demonstrates that LAQ is able to keep the linear convergence rate even with the presence of the quantization error. This is because the properly controlled quantization error also converges at a linear rate; see the illustration in Figure 3.
Proposition 1.
Under Assumption 1, if we choose the constants satisfying and define , as:
(20) 
then, worker has at most communications with the server until the th iteration.
This proposition implies that the smoothness of the local loss function determines the communication intensity of the local worker.
4 Numerical tests and conclusions
To validate our performance analysis and verify its communication savings in practical machine learning problems, we evaluate the performance of the algorithm for the regularized logistic regression which is strongly convex, and the neural network which is nonconvex. The dataset we use is MNIST [14]
, which are uniformly distributed across
workers. In the experiments, we set , ; see the detailed setup in the supplementary materials. To benchmark LAQ, we compare it with two classes of algorithms, gradientbased algorithms and minibatch stochastic gradientbased algorithms — corresponding to the following two tests.Algorithm  Iteration #  Communication #  Bit #  Accuracy  

LAQ  logistic  
neural network  
GD  logistic  
neural network  
QGD  logistic  
neural network  
LAG  logistic  
neural network 
Gradientbased tests. We consider GD, QGD [18] and lazily aggregated gradient (LAG) [6]. The number of bits per coordinate is set as for logistic regression and for neural network, respectively. Stepsize is set as for both algorithms. Figure 4 shows the objective convergence for the logistic regression task. Clearly, Figure 4(a) verifies Theorem 1, e.g., the linear convergence rate under strongly convex loss function. As shown in Figure 4(b), LAQ requires fewer number of communication rounds than GD and QGD thanks to our selection rule, but more rounds than LAG due to the gradient quantization. Nevertheless, the total number of transmitted bits of LAQ is significantly smaller than that of LAG, as demonstrated in Figure 4(c). For neural network model, Figure 5 reports the convergence of gradient norm, where LAQ also shows competitive performance for nonconvex problem. Similar to the results for logistic model, LAQ requires the fewest number of bits. Table 2 summarizes the number of iterations, uploads and bits needed to reach a given accuracy.
Figure 6 exhibits the test accuracy of above compared algorithms on three commonly used datasets, MNIST, ijcnn1 and covtype. Applied to all these datasets, LAQ saves transmitted bits and meanwhile maintains the same accuracy.
Stochastic gradientbased tests.
We test stochastic gradient descent (SGD), quantized stochastic gradient descent (QSGD)
[2], sparsified stochastic gradient descent (SSGD) [30], and the stochastic version of LAQ abbreviated as SLAQ. The minibatch size is , , and the number of bits per coordinate is set as for logistic regression and for neural network. As shown in Figures 7 and 8, SLAQ incurs the lowest number of communication rounds and bits. In this stochastic gradient test, although the communication reduction of SLAQ is not as significant as LAQ compared with gradient based algorithms, SLAQ still outperforms the stateoftheart algorithms, e.g., QSGD and SSGD. The results are summarized in Table 3. More results under different number of bits and the level of heterogeneity are reported in the supplementary materials.Algorithm  Iteration #  Communication #  Bit #  Accuracy  

SLAQ  logistic  
neural network  
SGD  logistic  
neural network  
QSGD  logistic  
neural network  
SSGD  logistic  1000  
neural network 
This paper studied the communicationefficient distributed learning problem, and proposed LAQ that simultaneously quantizes and skips the communication based on gradient innovation. Compared to the original GD method, linear convergence rate is still maintained for strongly convex loss function. This is remarkable since LAQ saves both communication bits and rounds significantly. Numerical tests using (strongly convex) regularized logistic regression and (nonconvex) neural network models demonstrate the advantages of LAQ over existing popular approaches.
Acknowledgments
This work by J. Sun and Z. Yang is supported in part by the Shenzhen Committee on Science and Innovations under Grant GJHZ20180411143603361, in part by the Department of Science and Technology of Guangdong Province under Grant 2018A050506003, and in part by the Natural Science Foundation of China under Grant 61873118. The work by J. Sun is also supported by China Scholarship Council. The work by G. Giannakis is supported in part by NSF 1500713, and 1711471.
References

[1]
Alham Fikri Aji and Kenneth Heafield.
Sparse communication for distributed gradient descent.
In
Proc. Conf. Empi. Meth. Natural Language Process.
, Copenhagen, Denmark, Sep 2017.  [2] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communicationefficient SGD via gradient quantization and encoding. In Proc. Advances in Neural Info. Process. Syst., pages 1709–1720, Long Beach, CA, Dec 2017.
 [3] Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. The convergence of sparsified gradient methods. In Proc. Advances in Neural Info. Process. Syst., pages 5973–5983, Montreal, Canada, Dec 2018.
 [4] Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and optimization. In Proc. Advances in Neural Info. Process. Syst., pages 1756–1764, Montreal, Canada, Dec 2015.
 [5] Jeremy Bernstein, YuXiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. SignSGD: Compressed optimisation for nonconvex problems. In Proc. Intl. Conf. Machine Learn., pages 559–568, Stockholm, Sweden, Jul 2018.
 [6] Tianyi Chen, Georgios Giannakis, Tao Sun, and Wotao Yin. LAG: Lazily aggregated gradient for communicationefficient distributed learning. In Proc. Advances in Neural Info. Process. Syst., pages 5050–5060, Montreal, Canada, Dec 2018.
 [7] Mert Gurbuzbalaban, Asuman Ozdaglar, and Pablo A Parrilo. On the convergence rate of incremental aggregated gradient algorithms. SIAM Journal on Optimization, 27(2):1035–1048, 2017.
 [8] Peng Jiang and Gagan Agrawal. A linear speedup analysis of distributed deep learning with sparse and quantized communication. In Proc. Advances in Neural Info. Process. Syst., pages 2525–2536, Montreal, Canada, Dec 2018.
 [9] Michael I Jordan, Jason D Lee, and Yun Yang. Communicationefficient distributed statistical inference. J. American Statistical Association, to appear, 2018.
 [10] Michael Kamp, Linara Adilova, Joachim Sicking, Fabian Hüger, Peter Schlicht, Tim Wirtz, and Stefan Wrobel. Efficient decentralized deep learning by dynamic model averaging. In Euro. Conf. Machine Learn. Knowledge Disc. Data., pages 393–409, Dublin, Ireland, 2018.
 [11] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In Proc. Intl. Conf. Machine Learn., pages 3252–3261, Long Beach, CA, Jun 2019.
 [12] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint:1610.05492, Oct 2016.

[13]
Jakub Konečnỳ and Peter Richtárik.
Randomized distributed mean estimation: Accuracy vs communication.
Frontiers in Applied Mathematics and Statistics, 4:62, Dec 2018.  [14] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2:18, 2010.
 [15] Mu Li, David G Andersen, Alexander J Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In Proc. Advances in Neural Info. Process. Syst., pages 19–27, Montreal, Canada, Dec 2014.
 [16] Xiangru Lian, Ce Zhang, Huan Zhang, ChoJui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Proc. Advances in Neural Info. Process. Syst., pages 5330–5340, Long Beach, CA, Dec 2017.
 [17] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In Proc. Intl. Conf. Learn. Represent., Vancouver, Canada, Apr 2018.
 [18] Sindri Magnússon, Hossein ShokriGhadikolaei, and Na Li. On maintaining linear convergence of distributed learning and optimization under limited communication. arXiv preprint arXiv:1902.11163, 2019.
 [19] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communicationefficient learning of deep networks from decentralized data. In Proc. Intl. Conf. Artificial Intell. and Stat., pages 1273–1282, Fort Lauderdale, FL, April 2017.
 [20] Konstantin Mishchenko, Eduard Gorbunov, Martin Takáč, and Peter Richtárik. Distributed learning with compressed gradient differences. arXiv preprint:1901.09269, Jan 2019.
 [21] Eric J Msechu and Georgios B Giannakis. Sensorcentric data reduction for estimation with WSNs via censoring and quantization. IEEE Trans. Sig. Proc., 60(1):400–414, Jan 2011.
 [22] Angelia Nedić, Alex Olshevsky, and Michael Rabbat. Network topology and communicationcomputation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5):953–976, May 2018.
 [23] Larry L Peterson and Bruce S Davie. Computer Networks: A Systems Approach. Morgan Kaufman, Burlington, MA, 2007.
 [24] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1bit stochastic gradient descent and its application to dataparallel distributed training of speech dnns. In Proc. Conf. Intl. Speech Comm. Assoc., Singapore, Sept 2014.
 [25] Ohad Shamir, Nati Srebro, and Tong Zhang. Communicationefficient distributed optimization using an approximate newtontype method. In Proc. Intl. Conf. Machine Learn., pages 1000–1008, Beijing, China, Jun 2014.
 [26] Sebastian U. Stich, JeanBaptiste Cordonnier, and Martin Jaggi. Sparsified SGD with memory. In Proc. Advances in Neural Info. Process. Syst., pages 4447–4458, Montreal, Canada, Dec 2018.
 [27] Nikko Strom. Scalable distributed DNN training using commodity gpu cloud computing. In Proc. Conf. Intl. Speech Comm. Assoc., Dresden, Germany, Sept 2015.
 [28] Hongyi Wang, Scott Sievert, Shengchao Liu, Zachary Charles, Dimitris Papailiopoulos, and Stephen Wright. Atomo: Communicationefficient learning via atomic sparsification. In Proc. Advances in Neural Info. Process. Syst., pages 9850–9861, Montreal, Canada, Dec 2018.
 [29] Jianyu Wang and Gauri Joshi. Cooperative SGD: A unified framework for the design and analysis of communicationefficient SGD algorithms. arXiv preprint:1808.07576, August 2018.
 [30] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communicationefficient distributed optimization. In Proc. Advances in Neural Info. Process. Syst., pages 1299–1309, Montreal, Canada, Dec 2018.
 [31] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Proc. Advances in Neural Info. Process. Syst., pages 1509–1519, Long Beach, CA, Dec 2017.
 [32] Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compensated quantized SGD and its applications to largescale distributed optimization. arXiv preprint arXiv:1806.08054, 2018.
 [33] Hao Yu and Rong Jin. On the computation and communication complexity of parallel SGD with dynamic batch sizes for stochastic nonconvex optimization. arXiv preprint:1905.04346, May 2019.
 [34] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training linear models with endtoend low precision, and a little bit of deep learning. In Proc. Intl. Conf. Machine Learn., pages 4035–4043, Sydney, Australia, Aug 2017.
 [35] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging SGD. In Proc. Advances in Neural Info. Process. Syst., pages 685–693, Montreal, Canada, Dec 2015.
 [36] Yuchen Zhang and Xiao Lin. DiSCO: Distributed optimization for selfconcordant empirical loss. In Proc. Intl. Conf. Machine Learn., pages 362–370, Lille, France, June 2015.
Appendix A Proof of Lemma 2
With the LAQ update, we have:
where the second equality follows from: and the last inequality is resulted from: .
Appendix B Proof of Lemma 3
With Assumption 1, under the LAQ we have:
For the ease of expression, we define . Then the Lyapunov function defined in (16) can be written as
(21) 
Thus, we have
(22)  
where the second inequality follows from Young’s Equality: . The last inequality is resulted from
Comments
There are no comments yet.