# CANITA: Faster Rates for Distributed Convex Optimization with Communication Compression

Due to the high communication cost in distributed and federated learning, methods relying on compressed communication are becoming increasingly popular. Besides, the best theoretically and practically performing gradient-type methods invariably rely on some form of acceleration/momentum to reduce the number of communications (faster convergence), e.g., Nesterov's accelerated gradient descent (Nesterov, 2004) and Adam (Kingma and Ba, 2014). In order to combine the benefits of communication compression and convergence acceleration, we propose a compressed and accelerated gradient method for distributed optimization, which we call CANITA. Our CANITA achieves the first accelerated rate O(√((1+√(ω^3/n))L/ϵ) + ω(1/ϵ)^1/3), which improves upon the state-of-the-art non-accelerated rate O((1+ω/n)L/ϵ + ω^2+n/ω+n1/ϵ) of DIANA (Khaled et al., 2020b) for distributed general convex problems, where ϵ is the target error, L is the smooth parameter of the objective, n is the number of machines/devices, and ω is the compression parameter (larger ω means more compression can be applied, and no compression implies ω=0). Our results show that as long as the number of devices n is large (often true in distributed/federated learning), or the compression ω is not very high, CANITA achieves the faster convergence rate O(√(L/ϵ)), i.e., the number of communication rounds is O(√(L/ϵ)) (vs. O(L/ϵ) achieved by previous works). As a result, CANITA enjoys the advantages of both compression (compressed communication in each round) and acceleration (much fewer communication rounds).

• 26 publications
• 129 publications
02/26/2020

### Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization

Due to the high communication cost in distributed and federated learning...
12/24/2021

### Faster Rates for Compressed Federated Learning with Client-Variance Reduction

Due to the communication bottleneck in distributed and federated learnin...
06/16/2020

### Federated Accelerated Stochastic Gradient Descent

We propose Federated Accelerated Stochastic Gradient Descent (FedAc), a ...
02/15/2021

### MARINA: Faster Non-Convex Distributed Learning with Compression

We develop and analyze MARINA: a new communication efficient method for ...
02/20/2020

### Uncertainty Principle for Communication Compression in Distributed and Federated Learning and the Search for an Optimal Compressor

In order to mitigate the high communication cost in distributed and fede...
02/25/2020

### Statistically Preconditioned Accelerated Gradient Method for Distributed Optimization

We consider the setting of distributed empirical risk minimization where...
12/09/2020

### Enhancing Parameter-Free Frank Wolfe with an Extra Subproblem

Aiming at convex optimization under structural constraints, this work in...

## 1 Introduction

With the proliferation of edge devices, such as mobile phones, wearables and smart home appliances, comes an increase in the amount of data rich in potential information which can be mined for the benefit of humankind. One of the approaches of turning the raw data into information is via federated learning (Konečný et al., 2016; McMahan et al., 2017), where typically a single global supervised model is trained in a massively distributed manner over a network of heterogeneous devices.

Training supervised distributed/federated learning models is typically performed by solving an optimization problem of the form

 minx∈Rd{f(x):=1nn∑i=1fi(x)}, (1)

where denotes the number of devices/machines/workers/clients, and

is a loss function associated with the data stored on device

. We will write

 x∗:=argminx∈Rdf(x).

If more than one minimizer exist, denotes an arbitrary but fixed solution. We will rely on the solution concept captured in the following definition:

###### Definition 1

A random vector

is called an -solution of the distributed problem (1) if

 E[f(ˆx)]−f(x∗)≤ϵ,

where the expectation is with respect to the randomness inherent in the algorithm used to produce .

In distributed and federated learning problems of the form (1

), communication of messages across the network typically forms the key bottleneck of the training system. In the modern practice of supervised learning in general and deep learning in particular, this is exacerbated by the reliance on massive models described by millions or even billions of parameters. For these reasons, it is very important to devise novel and more efficient training algorithms capable of decreasing the overall communication cost, which can be formalized as the product of the number of communication rounds necessary to train a model of sufficient quality, and the computation and communication cost associated with a typical communication round.

### 1.1 Methods with compressed communication

One of the most common strategies for improving communication complexity is communication compression (Seide et al., 2014; Alistarh et al., 2017; Wangni et al., 2018; Horváth et al., 2019a; Mishchenko et al., 2019; Horváth et al., 2019b; Li et al., 2020; Li and Richtárik, 2020). This strategy is based on the reduction of the size of communicated messages via the application of a suitably chosen lossy compression mechanism, saving precious time spent in each communication round, and hoping that this will not increase the total number of communication rounds.

Several recent theoretical results suggest that by combining an appropriate (randomized) compression operator with a suitably designed gradient-type method, one can obtain improvement in the total communication complexity over comparable baselines not performing any compression. For instance, this is the case for distributed compressed gradient descent (CGD(Alistarh et al., 2017; Khirirat et al., 2018; Horváth et al., 2019a; Li and Richtárik, 2020), and distributed CGD

methods which employ variance reduction to tame the variance introduced by compression

(Hanzely et al., 2018; Mishchenko et al., 2019; Horváth et al., 2019b; Li and Richtárik, 2020; Gorbunov et al., 2021).

### 1.2 Methods with acceleration

An alternative approach to communication complexity reduction is based on the Nesterov/Polyak acceleration/momentum (Nesterov, 1983, 2004), which aims to reduce the total number of communication rounds.

Acceleration of gradient-type methods is widely-studied for standard optimization problems (Lan and Zhou, 2015; Lin et al., 2015; Allen-Zhu, 2017; Lan et al., 2019; Li and Li, 2020; Kovalev et al., 2020; Li, 2021a). Deep learning practitioners typically rely on Adam (Kingma and Ba, 2014), or one of its many variants, which besides other tricks also adopts momentum.

### 1.3 Can communication compression and acceleration be combined?

Encouraged by the recent theoretical success of communication compression, and the widespread success of accelerated methods, in this paper we seek to further enhance CGD methods with acceleration/momentum, with the aim to obtain provable improvements in overall communication complexity.

Can distributed gradient-type methods theoretically benefit from the combination of gradient compression and acceleration/momentum? To the best of our knowledge, no such results exist in the general convex regime, and in this paper we close this gap by designing a method that can provably enjoy the advantages of both compression (compressed communication in each round) and acceleration (much fewer communication rounds).

While there is abundance of research studying communication compression and acceleration in isolation, there is very limited work on the combination of both approaches. The first successful combination of gradient compression and acceleration/momentum was recently achieved by the ADIANA method of Li et al. (2020). However, Li et al. (2020) only provide theoretical results for strongly convex problems, and their method is not applicable to (general) convex problems. So, one needs to both design a new method to handle the convex case, and perform its analysis. A-priori, it is not clear at all what approach would work.

To the best of our knowledge, besides the initial work (Li et al., 2020), we are only aware of two other works for addressing this question (Ye et al., 2020; Qian et al., 2020). However, both these works still only focus on the simpler and less practically relevant strongly convex

setting. Thus, this line of research is still largely unexplored. For instance, the well-known logistic regression problem is convex but not strongly convex. Finally, even if a problem is strongly convex, the modulus of strong convexity is typically not known, or hard to estimate properly.

## 2 Summary of Contributions

In this paper we propose and analyze an accelerated gradient method with compressed communication, which we call CANITA (described in Algorithm 1), for solving distributed general convex optimization problems of the form (1). In particular, CANITA can loosely be seen as a combination of the accelerated gradient method ANITA of Li (2021a), and the variance-reduced compressed gradient method DIANA of Mishchenko et al. (2019). Ours is the first work combining the benefits of communication compression and acceleration in the general convex regime. For example, the work of Li et al. (2020) critically relies on strong convexity, which is very restrictive.

### 2.1 First accelerated rate for compressed gradient methods in the convex regime

For general convex problems, CANITA is the first compressed communication gradient method with an accelerated rate. In particular, our CANITA solves the distributed problem (1) in

 O(√(1+√ω3n)Lϵ+ω(1ϵ)13)

communication rounds, which improves upon the current state-of-the-art result

 O((1+ωn)Lϵ+ω2+nω+n1ϵ)

achieved by the DIANA method (Khaled et al., 2020b). See Table 1 for more comparisons.

Let us now illustrate the improvements coming from this new bound on an example with concrete numerical values. Let the compression ratio be (the size of compressed message is , where is the size of the uncompressed message). If random sparsification or quantization is used to achieve this, then (see Section 3.1). Further, if the number of devices/machines is , and the target error tolerance is , then the number of communication rounds of our CANITA method is , while the number of communication rounds of the previous state-of-the-art method DIANA (Khaled et al., 2020b) is , i.e., vs. . This is an improvement of three orders of magnitude.

For strongly convex problems, Li et al. (2020) showed that if the number of devices/machines is large, or the compression variance parameter is not very high (), then their ADIANA method enjoys the benefits of both compression and acceleration (i.e., of ADIANA vs.  of previous works).

In this paper, we consider the general convex setting and show that the proposed CANITA also enjoys the benefits of both compression and acceleration. Similarly, if (i.e., many devices, or limited compression variance), CANITA achieves the accelerated rate vs.  of previous works. This means that the compression does not hurt the accelerated rate at all. Note that the second term is of a lower order compared with the first term .

### 2.3 Novel proof technique

The proof behind the analysis of CANITA is significantly different from that of ADIANA (Li et al., 2020), which critically relies on strong convexity. Moreover, the theoretical rate in the strongly convex case is linear , while it is sublinear or (accelerated) in the general convex case. We hope that our novel analysis can provide new insights and shed light on future work.

## 3 Preliminaries

Let denote the set and denote the Euclidean norm for a vector and the spectral norm for a matrix. Let denote the standard Euclidean inner product of two vectors and . We use and to hide the absolute constants.

### 3.1 Assumptions about the compression operators

We now introduce the notion of a randomized compression operator which we use to compress the gradients to save on communication. We rely on a standard class of unbiased compressors (see Definition 2) that was used in the context of distributed gradient methods before (Alistarh et al., 2017; Khirirat et al., 2018; Horváth et al., 2019b; Li and Richtárik, 2020; Li et al., 2020).

###### Definition 2 (Compression operator)

A randomized map is an -compression operator if

 E[C(x)]=x,E[∥C(x)−x∥2]≤ω∥x∥2,∀x∈Rd. (2)

In particular, no compression () implies .

It is well known that the conditions (2) are satisfied by many practically useful compression operators (see Table 1 in (Beznosikov et al., 2020; Safaryan et al., 2021)). For illustration purposes, we now present a couple canonical examples: sparsification and quantization.

#### Example 1 (Random sparsification).

Given , the random- sparsification operator is defined by

 C(x):=dk⋅(ξk⊙x),

where denotes the Hadamard (element-wise) product and is a uniformly random binary vector with nonzero entries (). This random- sparsification operator satisfies (2) with . By setting , this reduces to the identity compressor, whose variance is obviously zero: .

#### Example 2 (Quantization).

Given , the ()-quantization operator is defined by

 C(x):=sign(x)⋅∥x∥p⋅1s⋅ξs,

where are integers, and is a random vector with -th element

 ξs(i):={l+1,with probability |xi|∥x∥ps−l,l,otherwise.

The level satisfies . The probability is chosen so that . This ()-quantization operator satisfies (2) with . In particular, QSGD (Alistarh et al., 2017) used (i.e., ()-quantization) and proved that the expected sparsity of is .

### 3.2 Assumptions about the functions

Throughout the paper, we assume that the functions are convex and have Lipschitz continuous gradient.

###### Assumption 1

Functions are convex, differentiable, and -smooth. The last condition means that there exists a constant such that for all we have

 ∥∇fi(x)−∇fi(y)∥≤L∥x−y∥,∀x,y∈Rd. (3)

It is easy to see that the objective in (1) satisfies (3) provided that the constituent functions do.

## 4 The Canita Algorithm

In this section, we describe our method, for which we coin the name CANITA, designed for solving problem (1), which is of importance in distributed and federated learning, and contrast it to the most closely related methods ANITA, DIANA and ADIANA.

### 4.1 Canita: description of the method

Our proposed method CANITA, formally described in Algorithm 1, is an accelerated gradient method supporting compressed communication. It is the first method combing the benefits of acceleration and compression in the general convex regime (without strong convexity).

In each round , each machine computes its local gradient (e.g., ) and then a shifted version is compressed and sent to the server (See Line 5 of Algorithm 1). The local shifts are adaptively changing throughout the iterative process (Line 6), and have the role of reducing the variance introduced by compression . If no compression is used, we may simply set the shifts to be for all . The server subsequently aggregates all received messages to obtain the gradient estimator and maintain the average of local shifts (Line 8), and then perform gradient update step (Line 9) and update momentum sequences (Line 10 and 3). Besides, the last Line 11 adopts a randomized update rule for the auxiliary vectors which simplifies the algorithm and analysis, resembling the workings of the loopless SVRG method used in (Kovalev et al., 2020; Li, 2021a).

### 4.2 Canita vs existing methods

CANITA can be loosely seen as a combination of the accelerated gradient method ANITA of Li (2021a), and the variance-reduced compressed gradient method DIANA of Mishchenko et al. (2019). In particular, CANITA uses momentum/acceleration steps (see Line 3 and 10 of Algorithm 1) inspired by those of ANITA (Li, 2021a), and adopts the shifted compression framework for each machine (see Line 5 and 6 of Algorithm 1) as pioneered in the DIANA method (Mishchenko et al., 2019).

We prove that CANITA enjoys the benefits of both methods simultaneously, i.e., convergence acceleration of ANITA and gradient compression of DIANA.

Although CANITA can conceptually be seen as combination of ANITA (Li, 2021a) and DIANA (Mishchenko et al., 2019; Horváth et al., 2019b; Khaled et al., 2020b) from an algorithmic perspective, the analysis of CANITA is entirely different. Let us now briefly outline some of the main differences.

• For example, compared with ANITA (Li, 2021a), CANITA needs to deal with the extra compression of shifted local gradients in the distributed network. Thus, the obtained gradient estimator in Line 8 of Algorithm 1 is substantially different and more complicated than the one in ANITA, which necessitates a novel proof technique.

• Compared with DIANA (Mishchenko et al., 2019; Horváth et al., 2019b; Khaled et al., 2020b), the extra momentum steps in Line 3 and 10 of Algorithm 1 make the analysis of CANITA more complicated than that of DIANA. We obtain the accelerated rate rather than the non-accelerated rate of DIANA, and this is impossible without a substantially different proof technique.

• Compared with the accelerated DIANA method ADIANA of Li et al. (2020), the analysis of CANITA is also substantially different since CANITA cannot exploit the strong convexity assumed therein.

Finally, please refer to Section 2 where we summarize our contributions for additional discussions.

## 5 Convergence Results for the Canita Algorithm

In this section, we provide convergence results for CANITA (Algorithm 1). In order to simplify the expressions appearing in our main result (see Theorem 1 in Section 5.1) and in the lemmas needed to prove it (see Section 6), it will be convenient to let

 Ft:=f(wt)−f(x∗),Ht:=1nn∑i=1∥∇fi(wt)−hti∥2,Dt:=12∥xt−x∗∥2. (4)

### 5.1 Generic convergence result

We first present the main convergence theorem of CANITA for solving the distributed optimization problem (1) in the general convex regime.

###### Theorem 1

Suppose that Assumption 1 holds and the compression operators used in Algorithm 1 satisfy (2) of Definition 2. For any two positive sequences and such that the probabilities and positive stepsizes of Algorithm 1 satisfy the following relations

 αt≤11+ω,ηt≤1L(1+βt+4ptγt(1+2ptαt)) (5)

for all , and

 2ωβtn+4ptγt(1+2ptαt)≤1−θt,(1−ptθt)ηtptθ2t≤ηt−1pt−1θ2t−1,(ωβtn+(1−αt2)γt)ηtθ2t≤γt−1ηt−1θ2t−1 (6)

for all . Then the sequences of CANITA (Algorithm 1) for all satisfy the inequality

 E[Ft+1+γtptLHt+1]≤θ2tptηt((1−θ0p0)η0θ20p0F0+(ωβ0n+(1−α02)γ0)η0θ20LH0+D0), (7)

where the quantities are defined in (4).

Proof: The proof of Theorem 1 which relies on six lemmas is provided in Section 6. In particular, the proof simply follows from the key Lemma 6 (see Section 6.2), while Lemma 6 closely relies on previous five lemmas 15 (see Section 6.1 and Appendix A.6). Note that we defer the proofs for these lemmas in Appendix A.

As we shall see in detail in Section 5.2, the sequences and can be fixed to some constants.333Exception: While we indeed choose for , the value of may be different. However, the relaxation parameter needs to be decreasing and the step size may be increasing until a certain threshold. In particular, we choose

 βt≡c1,γt≡c2,pt≡c3,αt≡c4,θt=c5t+c6,ηt=min{(1+1t+c7)ηt−1, 1c8L}, (8)

where the constants may depend on the compression parameter and the number of devices/machines . As a result, the right hand side of (7) will be of the order , which indicates an accelerated rate. Hence, in order to find an -solution of problem (1), i.e., vector such that

 E[f(wT+1)−f(x∗)](???):=E[FT+1]≤ϵ, (9)

the number of communication rounds of CANITA (Algorithm 1) is at most

While the above rate has an accelerated dependence on , it will be crucial to study the omitted constants (see (8)), and in particular their dependence on the compression parameter and the number of devices/machines . As expected, for any fixed target error , the number of communication rounds (sufficient to guarantee that (9) holds) may grow with increasing levels of compression, i.e., with increasing . However, at the same time, the communication cost in each round decreases with . It is easy to see that this trade-off benefits compression. In particular, as we mention in Section 2, if the number of devices is large, or the compression variance is not very high, then compression does not hurt the accelerated rate of communication rounds at all.

### 5.2 Detailed convergence result

We now formulate a concrete Theorem 2 from Theorem 1 which leads to a detailed convergence result for CANITA (Algorithm 1) by specifying the choice of the parameters and . The detailed proof of Theorem 2 is deferred to Appendix B.

###### Theorem 2

Suppose that Assumption 1 holds and the compression operators used in Algorithm 1 satisfy (2) of Definition 2. Let and choose the two positive sequences and as follows:

 βt=⎧⎪ ⎪⎨⎪ ⎪⎩β0=9(1+b+ω)2(1+b)Lfor  t=0β≡48ω(1+ω)(1+b+2(1+ω))n(1+b)2for  t≥1,γt=γ≡(1+b)28(1+b+2(1+ω))for  t≥0. (10)

If we set the probabilities and positive stepsizes of Algorithm 1 as follows:

 pt≡11+b,αt≡11+ω,θt=3(1+b)t+9(1+b+ω),for  t≥0, (11)

and

 ηt=⎧⎪⎨⎪⎩1L(β0+3/2)for  t=0min{(1+1t+9(1+b+ω))ηt−1, 1L(β+3/2)}for  t≥1. (12)

Then CANITA (Algorithm 1) for all satisfies

 E[FT+1]≤O((1+√ω3/n)LT2+ω3T3). (13)

According to (13), the number of communication rounds for CANITA (Algorithm 1) to find an -solution of the distributed problem (1), i.e.,

 E[f(wT+1)−f(x∗)](???):=E[FT+1]≤ϵ,

is at most

 T=O⎛⎝√(1+√ω3n)Lϵ+ω(1ϵ)13⎞⎠.

## 6 Proof of Theorem 1

In order to prove Theorem 1, we first formulate six auxiliary results (Lemmas 16) in Section 6.1. The detailed proofs of these lemmas are deferred to Appendix A. Then in Section 6.2 we show that Theorem 1 follows from Lemma 6.

### 6.1 Six lemmas

First, we need a useful Lemma 1 which captures the change of the function value after a single gradient update step.

###### Lemma 1

Suppose that Assumption 1 holds. For any , the following equation holds for CANITA (Algorithm 1) for any round :

 E[f(zt+1)] ≤E[f(yt)+⟨∇f(yt),θt(x∗−xt)⟩+θ2tηt(Dt−Dt+1) −(θ2t2ηt−L(1+βt)θ2t2)∥xt+1−xt∥2+12Lβt∥∇f(yt)−gt∥2]. (14)

Note that

 zt+1−yt=θt(xt+1−xt)=−ηtgt

according to the two momentum/interpolation steps of

CANITA (see Line 3 and Line 10 of Algorithm 1) and the gradient update step (see Line 9 of Algorithm 1). The proof of Lemma 1 uses these relations and the smoothness Assumption 1.

In the next lemma, we bound the last variance term appearing in (14) of Lemma 1. To simplify the notation, from now on we will write

 Yt:=1nn∑i=1∥∇fi(wt)−∇fi(yt)∥2, (15)

and recall that defined in (4).

###### Lemma 2

If is as defined in Line 8 of Algorithm 1, and the compression operator satisfies (2) of Definition 2, we have

 E[∥∇f(yt)−gt∥2]≤2ωn(Yt+Ht). (16)

This lemma is proved by using the definition of the -compression operator (i.e., (2)).

Now, we need to bound the terms and in (16) of Lemma 2. We first show how to handle the term in the following Lemma 3.

###### Lemma 3

Suppose that Assumption 1 holds and let . According to the probabilistic update of in Line 11 of Algorithm 1, we have

 E[Ht+1] ≤(1−αt2)Ht+2pt(1+2ptαt)Yt+2ptL2θ2t(1+2ptαt)E[∥xt+1−xt∥2]. (17)

This lemma is proved by using the update of (Line 11 of Algorithm 1) and (Line 6 of Algorithm 1), the property of -compression operator (i.e., (2)), and the smoothness Assumption 1.

To deal with the term in Lemmas 2 and 3, we need the following result.

###### Lemma 4

Suppose that Assumption 1 holds. For any , the following inequality holds:

 Yt≤2L(f(wt)−f(yt)−⟨∇f(yt),wt−yt⟩). (18)

The proof of this lemma directly follows from a standard result characterizing the -smoothness of convex functions (see e.g., Lemma 1 of (Lan et al., 2019; Li, 2021a)).

Finally, we also need a result connecting the function values in (14) of Lemma 1 and in (7) of Theorem 1 (recall that in (4)).

###### Lemma 5

According to the probabilistic update of in Line 11 of Algorithm 1, we have

 E[f(wt+1)]=ptE[f(zt+1)]+(1−pt)E[f(wt)]. (19)

Now, we combine Lemmas 15 to obtain our final key lemma, which describes the recursive form of the objective function value after a single round.

###### Lemma 6

Suppose that Assumption 1 holds and the compression operators used in Algorithm 1 satisfy (2) of Definition 2. For any two positive sequences and such that the probabilities and positive stepsizes of Algorithm 1 satisfy the following relations

 αt≤11+ω,ηt≤1L(1+βt+4ptγt(1+2ptαt)) (20)

for all , and

 2ωβtn+4ptγt(1+2ptαt)≤1−θt (21)

for all . Then the sequences of CANITA (Algorithm 1) for all satisfy the inequality

 E[Ft+1+γtptLHt+1]≤E[(1−θtpt)Ft+(ωβtn+(1−αt2)γt)ptLHt+θ2tptηt(Dt−Dt+1)]. (22)

### 6.2 Proof of Theorem 1

Now, we are ready to prove the main convergence Theorem 1. According to Lemma 6, we know the change of the function value after each round. By dividing (22) with on both sides, we obtain

 E[ηtθ2tptFt+1+γtηtθ2tLHt+1]≤E[(1−θtpt)ηtθ2tptFt+(ωβtn+(1−αt2)γt)ηtθ2tLHt+Dt−Dt+1]. (23)

Then according to the following conditions on the parameters (see (6) of Theorem 1):

 (1−ptθt)ηtptθ2t≤ηt−1pt−1θ2t−1,~{}~{}and~{}~{}(ωβtn+(1−αt2)γt)ηtθ2t≤γt−1ηt−1θ2t−1,  ∀t≥1. (24)

The proof of Theorem 1 is finished by telescoping (23) from to via (24) and maintaining the same inequality (23) for :

 E[FT+1+γTpTLHT+1]≤θ2TpTηT((1−θ0p0)η0θ20p0F0+(ωβ0n+(1−α02)γ0)η0θ20LH0+D0). (25)

## 7 Conclusion

In this paper, we proposed CANITA: the first gradient method for distributed general convex optimization provably enjoying the benefits of both communication compression and convergence acceleration. There is very limited work on combing compression and acceleration. Indeed, previous works only focus on the (much simpler) strongly convex setting. We hope that our novel algorithm and analysis can provide new insights and shed light on future work in this line of research. We leave further improvements to future work. For example, one may ask whether our approach can be combined with the benefits provided by multiple local update steps (McMahan et al., 2017; Stich, 2019; Khaled et al., 2020a; Karimireddy et al., 2020), with additional variance reduction techniques (Horváth et al., 2019b; Li and Richtárik, 2020), and to what extent one can extend our results to structured nonconvex problems (Li et al., 2021; Li, 2021b; Li and Richtárik, 2021; Gorbunov et al., 2021; Richtárik et al., 2021).

## Appendix A Missing Proofs for Lemmas in Section 6

In Section 6, we provided the proof of Theorem 1 using six lemmas. Now we present the omitted proofs for these Lemmas 16.

### a.1 Proof of Lemma 1

According to the -smoothness of (Assumption 1), we have

 E[f(zt+1)] ≤E[f(yt)+⟨∇f(yt),zt+1−yt⟩+L2∥zt+1−yt∥2] =E[f(yt)+⟨∇f(yt),θt(xt+1−xt)⟩+Lθ2t2∥xt+1−xt∥2] (26) =E[f(yt)+⟨∇f(yt)−gt,θt(xt+1−xt)⟩+⟨gt,θt(xt+1−xt)⟩+Lθ2t2∥xt+1−xt∥2] ≤E[f(yt)+12Lβt∥∇f(yt)−gt∥2+Lβtθ2t2∥xt+1−xt∥2+Lθ2t2∥xt+1−xt∥2 +⟨gt,θt(xt+1−xt)⟩] (27) =E[f(yt)+12Lβt∥∇f(yt)−gt∥2+L(1+βt)θ2t2∥xt+1−xt∥2 +⟨gt,θt(x∗−xt)⟩+⟨gt,θt(xt+1−x∗)⟩] =E[f(yt)+12Lβt∥∇f(yt)−gt∥2+L(1+βt)θ2t2∥xt+1−xt∥2+⟨∇f(yt),θt(x∗−xt)⟩ +⟨gt,θt(xt+1−x∗)⟩] (28) =E[f(yt)+12Lβt∥∇f(yt)−gt∥2+L(1+βt)θ2t2∥xt+1−xt∥2+⟨∇f(yt),θt(x∗−xt)⟩ +θ2tηt⟨xt−xt+1,xt+1−x∗⟩] (29) =E[f(yt)+12Lβt∥∇f(yt)−gt∥2+L(1+βt)θ2t2∥xt+1−xt∥2+⟨∇f(yt),θt(x∗−xt)⟩ +θ2t2ηt(∥xt−x∗∥2−∥xt−xt+1∥2−∥xt+1−x∗∥2)] =E[f(yt)+⟨∇f(yt),θt(x∗−xt)⟩+θ2t2ηt(∥xt−x∗∥2−∥xt+1−x∗∥2) −(θ2