1 Introduction
With the proliferation of edge devices, such as mobile phones, wearables and smart home appliances, comes an increase in the amount of data rich in potential information which can be mined for the benefit of humankind. One of the approaches of turning the raw data into information is via federated learning (Konečný et al., 2016; McMahan et al., 2017), where typically a single global supervised model is trained in a massively distributed manner over a network of heterogeneous devices.
Training supervised distributed/federated learning models is typically performed by solving an optimization problem of the form
(1) 
where denotes the number of devices/machines/workers/clients, and
is a loss function associated with the data stored on device
. We will writeIf more than one minimizer exist, denotes an arbitrary but fixed solution. We will rely on the solution concept captured in the following definition:
Definition 1
In distributed and federated learning problems of the form (1
), communication of messages across the network typically forms the key bottleneck of the training system. In the modern practice of supervised learning in general and deep learning in particular, this is exacerbated by the reliance on massive models described by millions or even billions of parameters. For these reasons, it is very important to devise novel and more efficient training algorithms capable of decreasing the overall communication cost, which can be formalized as the product of the number of communication rounds necessary to train a model of sufficient quality, and the computation and communication cost associated with a typical communication round.
1.1 Methods with compressed communication
One of the most common strategies for improving communication complexity is communication compression (Seide et al., 2014; Alistarh et al., 2017; Wangni et al., 2018; Horváth et al., 2019a; Mishchenko et al., 2019; Horváth et al., 2019b; Li et al., 2020; Li and Richtárik, 2020). This strategy is based on the reduction of the size of communicated messages via the application of a suitably chosen lossy compression mechanism, saving precious time spent in each communication round, and hoping that this will not increase the total number of communication rounds.
Several recent theoretical results suggest that by combining an appropriate (randomized) compression operator with a suitably designed gradienttype method, one can obtain improvement in the total communication complexity over comparable baselines not performing any compression. For instance, this is the case for distributed compressed gradient descent (CGD) (Alistarh et al., 2017; Khirirat et al., 2018; Horváth et al., 2019a; Li and Richtárik, 2020), and distributed CGD
methods which employ variance reduction to tame the variance introduced by compression
(Hanzely et al., 2018; Mishchenko et al., 2019; Horváth et al., 2019b; Li and Richtárik, 2020; Gorbunov et al., 2021).1.2 Methods with acceleration
An alternative approach to communication complexity reduction is based on the Nesterov/Polyak acceleration/momentum (Nesterov, 1983, 2004), which aims to reduce the total number of communication rounds.
Acceleration of gradienttype methods is widelystudied for standard optimization problems (Lan and Zhou, 2015; Lin et al., 2015; AllenZhu, 2017; Lan et al., 2019; Li and Li, 2020; Kovalev et al., 2020; Li, 2021a). Deep learning practitioners typically rely on Adam (Kingma and Ba, 2014), or one of its many variants, which besides other tricks also adopts momentum.
1.3 Can communication compression and acceleration be combined?
Encouraged by the recent theoretical success of communication compression, and the widespread success of accelerated methods, in this paper we seek to further enhance CGD methods with acceleration/momentum, with the aim to obtain provable improvements in overall communication complexity.
Can distributed gradienttype methods theoretically benefit from the combination of gradient compression and acceleration/momentum? To the best of our knowledge, no such results exist in the general convex regime, and in this paper we close this gap by designing a method that can provably enjoy the advantages of both compression (compressed communication in each round) and acceleration (much fewer communication rounds).
While there is abundance of research studying communication compression and acceleration in isolation, there is very limited work on the combination of both approaches. The first successful combination of gradient compression and acceleration/momentum was recently achieved by the ADIANA method of Li et al. (2020). However, Li et al. (2020) only provide theoretical results for strongly convex problems, and their method is not applicable to (general) convex problems. So, one needs to both design a new method to handle the convex case, and perform its analysis. Apriori, it is not clear at all what approach would work.
To the best of our knowledge, besides the initial work (Li et al., 2020), we are only aware of two other works for addressing this question (Ye et al., 2020; Qian et al., 2020). However, both these works still only focus on the simpler and less practically relevant strongly convex
setting. Thus, this line of research is still largely unexplored. For instance, the wellknown logistic regression problem is convex but not strongly convex. Finally, even if a problem is strongly convex, the modulus of strong convexity is typically not known, or hard to estimate properly.
2 Summary of Contributions
In this paper we propose and analyze an accelerated gradient method with compressed communication, which we call CANITA (described in Algorithm 1), for solving distributed general convex optimization problems of the form (1). In particular, CANITA can loosely be seen as a combination of the accelerated gradient method ANITA of Li (2021a), and the variancereduced compressed gradient method DIANA of Mishchenko et al. (2019). Ours is the first work combining the benefits of communication compression and acceleration in the general convex regime. For example, the work of Li et al. (2020) critically relies on strong convexity, which is very restrictive.
2.1 First accelerated rate for compressed gradient methods in the convex regime
For general convex problems, CANITA is the first compressed communication gradient method with an accelerated rate. In particular, our CANITA solves the distributed problem (1) in
communication rounds, which improves upon the current stateoftheart result
achieved by the DIANA method (Khaled et al., 2020b). See Table 1 for more comparisons.
Let us now illustrate the improvements coming from this new bound on an example with concrete numerical values. Let the compression ratio be (the size of compressed message is , where is the size of the uncompressed message). If random sparsification or quantization is used to achieve this, then (see Section 3.1). Further, if the number of devices/machines is , and the target error tolerance is , then the number of communication rounds of our CANITA method is , while the number of communication rounds of the previous stateoftheart method DIANA (Khaled et al., 2020b) is , i.e., vs. . This is an improvement of three orders of magnitude.
2.2 Accelerated rate with limited compression for free
For strongly convex problems, Li et al. (2020) showed that if the number of devices/machines is large, or the compression variance parameter is not very high (), then their ADIANA method enjoys the benefits of both compression and acceleration (i.e., of ADIANA vs. of previous works).
In this paper, we consider the general convex setting and show that the proposed CANITA also enjoys the benefits of both compression and acceleration. Similarly, if (i.e., many devices, or limited compression variance), CANITA achieves the accelerated rate vs. of previous works. This means that the compression does not hurt the accelerated rate at all. Note that the second term is of a lower order compared with the first term .
Algorithms  Strongly convex ^{1}^{1}1In this strongly convex column, denotes the condition number, where is the smooth parameter and is the strong convexity parameter.  General convex  Remark 
QSGD (Alistarh et al., 2017)  —  ^{2}^{2}2Here QSGD Alistarh et al. (2017) needs an additional bounded gradient assumption, i.e., , .  ✓compression acceleration 
DIANA (Mishchenko et al., 2019)  —  ✓compression acceleration  
DIANA (Horváth et al., 2019b)  ✓compression acceleration  
DIANA (Khaled et al., 2020b)  —  ✓compression acceleration  
ADIANA (Li et al., 2020)  —  ✓compression ✓acceleration  
CANITA (this paper)  —  ✓compression ✓acceleration 
2.3 Novel proof technique
The proof behind the analysis of CANITA is significantly different from that of ADIANA (Li et al., 2020), which critically relies on strong convexity. Moreover, the theoretical rate in the strongly convex case is linear , while it is sublinear or (accelerated) in the general convex case. We hope that our novel analysis can provide new insights and shed light on future work.
3 Preliminaries
Let denote the set and denote the Euclidean norm for a vector and the spectral norm for a matrix. Let denote the standard Euclidean inner product of two vectors and . We use and to hide the absolute constants.
3.1 Assumptions about the compression operators
We now introduce the notion of a randomized compression operator which we use to compress the gradients to save on communication. We rely on a standard class of unbiased compressors (see Definition 2) that was used in the context of distributed gradient methods before (Alistarh et al., 2017; Khirirat et al., 2018; Horváth et al., 2019b; Li and Richtárik, 2020; Li et al., 2020).
Definition 2 (Compression operator)
A randomized map is an compression operator if
(2) 
In particular, no compression () implies .
It is well known that the conditions (2) are satisfied by many practically useful compression operators (see Table 1 in (Beznosikov et al., 2020; Safaryan et al., 2021)). For illustration purposes, we now present a couple canonical examples: sparsification and quantization.
Example 1 (Random sparsification).
Given , the random sparsification operator is defined by
where denotes the Hadamard (elementwise) product and is a uniformly random binary vector with nonzero entries (). This random sparsification operator satisfies (2) with . By setting , this reduces to the identity compressor, whose variance is obviously zero: .
Example 2 (Quantization).
Given , the ()quantization operator is defined by
where are integers, and is a random vector with th element
The level satisfies . The probability is chosen so that . This ()quantization operator satisfies (2) with . In particular, QSGD (Alistarh et al., 2017) used (i.e., ()quantization) and proved that the expected sparsity of is .
3.2 Assumptions about the functions
Throughout the paper, we assume that the functions are convex and have Lipschitz continuous gradient.
Assumption 1
Functions are convex, differentiable, and smooth. The last condition means that there exists a constant such that for all we have
(3) 
4 The Canita Algorithm
In this section, we describe our method, for which we coin the name CANITA, designed for solving problem (1), which is of importance in distributed and federated learning, and contrast it to the most closely related methods ANITA, DIANA and ADIANA.
4.1 Canita: description of the method
Our proposed method CANITA, formally described in Algorithm 1, is an accelerated gradient method supporting compressed communication. It is the first method combing the benefits of acceleration and compression in the general convex regime (without strong convexity).
In each round , each machine computes its local gradient (e.g., ) and then a shifted version is compressed and sent to the server (See Line 5 of Algorithm 1). The local shifts are adaptively changing throughout the iterative process (Line 6), and have the role of reducing the variance introduced by compression . If no compression is used, we may simply set the shifts to be for all . The server subsequently aggregates all received messages to obtain the gradient estimator and maintain the average of local shifts (Line 8), and then perform gradient update step (Line 9) and update momentum sequences (Line 10 and 3). Besides, the last Line 11 adopts a randomized update rule for the auxiliary vectors which simplifies the algorithm and analysis, resembling the workings of the loopless SVRG method used in (Kovalev et al., 2020; Li, 2021a).
4.2 Canita vs existing methods
CANITA can be loosely seen as a combination of the accelerated gradient method ANITA of Li (2021a), and the variancereduced compressed gradient method DIANA of Mishchenko et al. (2019). In particular, CANITA uses momentum/acceleration steps (see Line 3 and 10 of Algorithm 1) inspired by those of ANITA (Li, 2021a), and adopts the shifted compression framework for each machine (see Line 5 and 6 of Algorithm 1) as pioneered in the DIANA method (Mishchenko et al., 2019).
We prove that CANITA enjoys the benefits of both methods simultaneously, i.e., convergence acceleration of ANITA and gradient compression of DIANA.
Although CANITA can conceptually be seen as combination of ANITA (Li, 2021a) and DIANA (Mishchenko et al., 2019; Horváth et al., 2019b; Khaled et al., 2020b) from an algorithmic perspective, the analysis of CANITA is entirely different. Let us now briefly outline some of the main differences.

For example, compared with ANITA (Li, 2021a), CANITA needs to deal with the extra compression of shifted local gradients in the distributed network. Thus, the obtained gradient estimator in Line 8 of Algorithm 1 is substantially different and more complicated than the one in ANITA, which necessitates a novel proof technique.

Compared with DIANA (Mishchenko et al., 2019; Horváth et al., 2019b; Khaled et al., 2020b), the extra momentum steps in Line 3 and 10 of Algorithm 1 make the analysis of CANITA more complicated than that of DIANA. We obtain the accelerated rate rather than the nonaccelerated rate of DIANA, and this is impossible without a substantially different proof technique.

Compared with the accelerated DIANA method ADIANA of Li et al. (2020), the analysis of CANITA is also substantially different since CANITA cannot exploit the strong convexity assumed therein.
Finally, please refer to Section 2 where we summarize our contributions for additional discussions.
5 Convergence Results for the Canita Algorithm
In this section, we provide convergence results for CANITA (Algorithm 1). In order to simplify the expressions appearing in our main result (see Theorem 1 in Section 5.1) and in the lemmas needed to prove it (see Section 6), it will be convenient to let
(4) 
5.1 Generic convergence result
We first present the main convergence theorem of CANITA for solving the distributed optimization problem (1) in the general convex regime.
Theorem 1
Suppose that Assumption 1 holds and the compression operators used in Algorithm 1 satisfy (2) of Definition 2. For any two positive sequences and such that the probabilities and positive stepsizes of Algorithm 1 satisfy the following relations
(5) 
for all , and
(6) 
for all . Then the sequences of CANITA (Algorithm 1) for all satisfy the inequality
(7) 
where the quantities are defined in (4).
Proof: The proof of Theorem 1 which relies on six lemmas is provided in Section 6. In particular, the proof simply follows from the key Lemma 6 (see Section 6.2), while Lemma 6 closely relies on previous five lemmas 1–5 (see Section 6.1 and Appendix A.6). Note that we defer the proofs for these lemmas in Appendix A.
As we shall see in detail in Section 5.2, the sequences and can be fixed to some constants.^{3}^{3}3Exception: While we indeed choose for , the value of may be different. However, the relaxation parameter needs to be decreasing and the step size may be increasing until a certain threshold. In particular, we choose
(8) 
where the constants may depend on the compression parameter and the number of devices/machines . As a result, the right hand side of (7) will be of the order , which indicates an accelerated rate. Hence, in order to find an solution of problem (1), i.e., vector such that
(9) 
the number of communication rounds of CANITA (Algorithm 1) is at most
While the above rate has an accelerated dependence on , it will be crucial to study the omitted constants (see (8)), and in particular their dependence on the compression parameter and the number of devices/machines . As expected, for any fixed target error , the number of communication rounds (sufficient to guarantee that (9) holds) may grow with increasing levels of compression, i.e., with increasing . However, at the same time, the communication cost in each round decreases with . It is easy to see that this tradeoff benefits compression. In particular, as we mention in Section 2, if the number of devices is large, or the compression variance is not very high, then compression does not hurt the accelerated rate of communication rounds at all.
5.2 Detailed convergence result
We now formulate a concrete Theorem 2 from Theorem 1 which leads to a detailed convergence result for CANITA (Algorithm 1) by specifying the choice of the parameters and . The detailed proof of Theorem 2 is deferred to Appendix B.
Theorem 2
Suppose that Assumption 1 holds and the compression operators used in Algorithm 1 satisfy (2) of Definition 2. Let and choose the two positive sequences and as follows:
(10) 
If we set the probabilities and positive stepsizes of Algorithm 1 as follows:
(11) 
and
(12) 
Then CANITA (Algorithm 1) for all satisfies
(13) 
According to (13), the number of communication rounds for CANITA (Algorithm 1) to find an solution of the distributed problem (1), i.e.,
is at most
6 Proof of Theorem 1
In order to prove Theorem 1, we first formulate six auxiliary results (Lemmas 1–6) in Section 6.1. The detailed proofs of these lemmas are deferred to Appendix A. Then in Section 6.2 we show that Theorem 1 follows from Lemma 6.
6.1 Six lemmas
First, we need a useful Lemma 1 which captures the change of the function value after a single gradient update step.
Lemma 1
Note that
according to the two momentum/interpolation steps of
CANITA (see Line 3 and Line 10 of Algorithm 1) and the gradient update step (see Line 9 of Algorithm 1). The proof of Lemma 1 uses these relations and the smoothness Assumption 1.In the next lemma, we bound the last variance term appearing in (14) of Lemma 1. To simplify the notation, from now on we will write
(15) 
and recall that defined in (4).
Lemma 2
This lemma is proved by using the definition of the compression operator (i.e., (2)).
Now, we need to bound the terms and in (16) of Lemma 2. We first show how to handle the term in the following Lemma 3.
Lemma 3
This lemma is proved by using the update of (Line 11 of Algorithm 1) and (Line 6 of Algorithm 1), the property of compression operator (i.e., (2)), and the smoothness Assumption 1.
Lemma 4
Suppose that Assumption 1 holds. For any , the following inequality holds:
(18) 
The proof of this lemma directly follows from a standard result characterizing the smoothness of convex functions (see e.g., Lemma 1 of (Lan et al., 2019; Li, 2021a)).
Finally, we also need a result connecting the function values in (14) of Lemma 1 and in (7) of Theorem 1 (recall that in (4)).
Now, we combine Lemmas 1–5 to obtain our final key lemma, which describes the recursive form of the objective function value after a single round.
Lemma 6
Suppose that Assumption 1 holds and the compression operators used in Algorithm 1 satisfy (2) of Definition 2. For any two positive sequences and such that the probabilities and positive stepsizes of Algorithm 1 satisfy the following relations
(20) 
for all , and
(21) 
for all . Then the sequences of CANITA (Algorithm 1) for all satisfy the inequality
(22) 
6.2 Proof of Theorem 1
Now, we are ready to prove the main convergence Theorem 1. According to Lemma 6, we know the change of the function value after each round. By dividing (22) with on both sides, we obtain
(23) 
Then according to the following conditions on the parameters (see (6) of Theorem 1):
(24) 
The proof of Theorem 1 is finished by telescoping (23) from to via (24) and maintaining the same inequality (23) for :
(25) 
7 Conclusion
In this paper, we proposed CANITA: the first gradient method for distributed general convex optimization provably enjoying the benefits of both communication compression and convergence acceleration. There is very limited work on combing compression and acceleration. Indeed, previous works only focus on the (much simpler) strongly convex setting. We hope that our novel algorithm and analysis can provide new insights and shed light on future work in this line of research. We leave further improvements to future work. For example, one may ask whether our approach can be combined with the benefits provided by multiple local update steps (McMahan et al., 2017; Stich, 2019; Khaled et al., 2020a; Karimireddy et al., 2020), with additional variance reduction techniques (Horváth et al., 2019b; Li and Richtárik, 2020), and to what extent one can extend our results to structured nonconvex problems (Li et al., 2021; Li, 2021b; Li and Richtárik, 2021; Gorbunov et al., 2021; Richtárik et al., 2021).
References
 Alistarh et al. (2017) Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communicationefficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017.

AllenZhu (2017)
Zeyuan AllenZhu.
Katyusha: The first direct acceleration of stochastic gradient
methods.
In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
, pages 1200–1205. ACM, 2017.  Beznosikov et al. (2020) Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik, and Mher Safaryan. On biased compression for distributed learning. arXiv:2002.12410, 2020.
 Gorbunov et al. (2021) Eduard Gorbunov, Konstantin Burlachenko, Zhize Li, and Peter Richtárik. MARINA: Faster nonconvex distributed learning with compression. In International Conference on Machine Learning, pages 3788–3798. PMLR, arXiv:2102.07845, 2021.
 Hanzely et al. (2018) Filip Hanzely, Konstantin Mishchenko, and Peter Richtárik. SEGA: variance reduction via gradient sketching. In Advances in Neural Information Processing Systems 31, pages 2082–2093, 2018.
 Horváth et al. (2019a) Samuel Horváth, ChenYu Ho, Ľudovít Horváth, Atal Narayan Sahu, Marco Canini, and Peter Richtárik. Natural compression for distributed deep learning. arXiv preprint arXiv:1905.10988, 2019a.
 Horváth et al. (2019b) Samuel Horváth, Dmitry Kovalev, Konstantin Mishchenko, Sebastian Stich, and Peter Richtárik. Stochastic distributed learning with gradient quantization and variance reduction. arXiv preprint arXiv:1904.05115, 2019b.

Karimireddy et al. (2020)
Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian
Stich, and Ananda Theertha Suresh.
SCAFFOLD: Stochastic controlled averaging for federated learning.
In
International Conference on Machine Learning
, pages 5132–5143. PMLR, 2020. 
Khaled et al. (2020a)
Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik.
Tighter theory for local SGD on identical and heterogeneous data.
In
International Conference on Artificial Intelligence and Statistics
, pages 4519–4529. PMLR, 2020a.  Khaled et al. (2020b) Ahmed Khaled, Othmane Sebbouh, Nicolas Loizou, Robert M Gower, and Peter Richtárik. Unified analysis of stochastic gradient methods for composite convex and smooth optimization. arXiv preprint arXiv:2006.11573, 2020b.
 Khirirat et al. (2018) Sarit Khirirat, Hamid Reza Feyzmahdavian, and Mikael Johansson. Distributed learning with compressed gradients. arXiv preprint arXiv:1806.06573, 2018.
 Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: a method for stochastic optimization. In The 3rd International Conference on Learning Representations, 2014.
 Konečný et al. (2016) Jakub Konečný, H. Brendan McMahan, Felix Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: strategies for improving communication efficiency. In NIPS Private MultiParty Machine Learning Workshop, 2016.
 Kovalev et al. (2020) Dmitry Kovalev, Samuel Horváth, and Peter Richtárik. Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, 2020.
 Lan and Zhou (2015) Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. arXiv preprint arXiv:1507.02000, 2015.
 Lan et al. (2019) Guanghui Lan, Zhize Li, and Yi Zhou. A unified variancereduced accelerated gradient method for convex optimization. In Advances in Neural Information Processing Systems, pages 10462–10472, 2019.
 Li (2021a) Zhize Li. ANITA: An optimal loopless accelerated variancereduced gradient method. arXiv preprint arXiv:2103.11333, 2021a.
 Li (2021b) Zhize Li. A short note of PAGE: Optimal convergence rates for nonconvex optimization. arXiv preprint arXiv:2106.09663, 2021b.
 Li and Li (2020) Zhize Li and Jian Li. A fast AndersonChebyshev acceleration for nonlinear optimization. In International Conference on Artificial Intelligence and Statistics, pages 1047–1057. PMLR, arXiv:1809.02341, 2020.
 Li and Richtárik (2020) Zhize Li and Peter Richtárik. A unified analysis of stochastic gradient methods for nonconvex federated optimization. arXiv preprint arXiv:2006.07013, 2020.
 Li and Richtárik (2021) Zhize Li and Peter Richtárik. ZeroSARAH: Efficient nonconvex finitesum optimization with zero full gradient computation. arXiv preprint arXiv:2103.01447, 2021.
 Li et al. (2020) Zhize Li, Dmitry Kovalev, Xun Qian, and Peter Richtárik. Acceleration for compressed gradient descent in distributed and federated optimization. In International Conference on Machine Learning, pages 5895–5904. PMLR, arXiv:2002.11364, 2020.
 Li et al. (2021) Zhize Li, Hongyan Bao, Xiangliang Zhang, and Peter Richtárik. PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization. In International Conference on Machine Learning, pages 6286–6295. PMLR, arXiv:2008.10898, 2021.
 Lin et al. (2015) Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for firstorder optimization. In Advances in Neural Information Processing Systems, pages 3384–3392, 2015.
 McMahan et al. (2017) H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communicationefficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics, pages 1273–1282. PMLR, 2017.
 Mishchenko et al. (2019) Konstantin Mishchenko, Eduard Gorbunov, Martin Takáč, and Peter Richtárik. Distributed learning with compressed gradient differences. arXiv preprint arXiv:1901.09269, 2019.
 Nesterov (1983) Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence . In Doklady AN USSR, volume 269, pages 543–547, 1983.
 Nesterov (2004) Yurii Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, 2004.
 Qian et al. (2020) Xun Qian, Peter Richtárik, and Tong Zhang. Error compensated distributed SGD can be accelerated. arXiv preprint arXiv:2010.00091, 2020.
 Richtárik et al. (2021) Peter Richtárik, Igor Sokolov, and Ilyas Fatkhullin. EF21: A new, simpler, theoretically better, and practically faster error feedback. arXiv preprint arXiv:2106.05203, 2021.
 Safaryan et al. (2021) Mher Safaryan, Egor Shulgin, and Peter Richtárik. Uncertainty principle for communication compression in distributed and federated learning and the search for an optimal compressor. Information and Inference: A Journal of the IMA, 2021.

Seide et al. (2014)
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu.
1bit stochastic gradient descent and its application to dataparallel distributed training of speech DNNs.
In Fifteenth Annual Conference of the International Speech Communication Association, 2014.  Stich (2019) Sebastian U. Stich. Local SGD converges fast and communicates little. In International Conference on Learning Representations, 2019.
 Wangni et al. (2018) Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communicationefficient distributed optimization. In Advances in Neural Information Processing Systems, pages 1306–1316, 2018.
 Ye et al. (2020) Tian Ye, Peijun Xiao, and Ruoyu Sun. DEED: A general quantization scheme for communication efficiency in bits. arXiv preprint arXiv:2006.11401, 2020.
Appendix A Missing Proofs for Lemmas in Section 6
In Section 6, we provided the proof of Theorem 1 using six lemmas. Now we present the omitted proofs for these Lemmas 1–6.
a.1 Proof of Lemma 1
According to the smoothness of (Assumption 1), we have
(26)  
(27)  
(28)  
(29)  