DeepAI

# Communication-Efficient Distributed SGD with Compressed Sensing

We consider large scale distributed optimization over a set of edge devices connected to a central server, where the limited communication bandwidth between the server and edge devices imposes a significant bottleneck for the optimization procedure. Inspired by recent advances in federated learning, we propose a distributed stochastic gradient descent (SGD) type algorithm that exploits the sparsity of the gradient, when possible, to reduce communication burden. At the heart of the algorithm is to use compressed sensing techniques for the compression of the local stochastic gradients at the device side; and at the server side, a sparse approximation of the global stochastic gradient is recovered from the noisy aggregated compressed local gradients. We conduct theoretical analysis on the convergence of our algorithm in the presence of noise perturbation incurred by the communication channels, and also conduct numerical experiments to corroborate its effectiveness.

• 7 publications
• 2 publications
• 27 publications
• 55 publications
05/15/2019

### DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression

A standard approach in large scale machine learning is distributed stoch...
01/22/2020

### Intermittent Pulling with Local Compensation for Communication-Efficient Federated Learning

Federated Learning is a powerful machine learning paradigm to cooperativ...
06/21/2018

### Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization

Large-scale distributed optimization is of great importance in various a...
11/30/2021

### Communication-Efficient Federated Learning via Quantized Compressed Sensing

In this paper, we present a communication-efficient federated learning f...
02/26/2020

### LASG: Lazily Aggregated Stochastic Gradients for Communication-Efficient Distributed Learning

This paper targets solving distributed machine learning problems such as...
09/20/2018

### Sparsified SGD with Memory

Huge scale machine learning problems are nowadays tackled by distributed...
11/12/2020

### Distributed Sparse SGD with Majority Voting

Distributed learning, particularly variants of distributed stochastic gr...

## 1 Introduction

Large-scale distributed stochastic optimization plays a fundamental role in the recent advances of machine learning, allowing models with vast sizes to be trained on massive datasets by multiple machines. In the meantime, the past few years have witnessed an explosive growth of networks of IoT devices such as smart phones, self-driving cars, robots, unmanned aerial vehicles (UAVs), etc., which are capable of data collection and processing for many learning tasks. In many of these applications, due to privacy concerns, it is preferable that the local edge devices learn the model by cooperating with the central server but without sending their own data to the server. Moreover, the communication between the edge devices and the server is often through wireless channels, which are lossy and unreliable in nature and have limited bandwidth, imposing significant challenges, especially for large dimensional problems.

To address the communication bottlenecks, researchers have investigated communication-efficient distributed optimization methods for large-scale problems, for both the device-server setting [1, 2] and the peer-to-peer setting [3, 4]. In this paper, we consider the device-server setting where a group of edge devices are coordinated by a central server.

Most existing techniques for the device-server setting can be classified into two categories. The first category aims to reduce the number of communication rounds, based on the idea that each edge device runs multiple local SGD steps in parallel before sending the local updates to the server for aggregation. This approach has also been called

FedAvg [1] in federated learning and convergence has been studied in [5, 6, 7]. Another line of work investigates lazy/adaptive upload of information, i.e., local gradients are uploaded only when found to be informative enough [8].

The second category focuses on efficient compression of gradient information transmitted from edge devices to the server. Commonly adopted compression techniques include quantization [9, 10, 11] and sparsification [12, 13, 14]. These techniques can be further classified according to whether the gradient compression yields biased [9, 14] or unbiased [10, 13]

gradient estimators. To handle the bias and boost convergence,

[12, 15] introduced the error feedback method that accumulates and corrects the error caused by gradient compression at each step.

Two recent papers [16, 17] employ sketching methods for gradient compression. Specifically, each device compresses its local stochastic gradient by count sketch [18] via a common sketching operator; and the server recovers the indices and the values of large entries of the aggregated stochastic gradient from the gradient sketches. However, theoretical guarantees of count sketch were developed for recovering one fixed

signal by randomly generating a sketching operator from a given probability distribution. During SGD, gradient signals are constantly changing, making it impractical to generate a new sketch operator for every SGD iteration. Thus the papers apply a single sketching operator to all the gradients through the optimization procedure, while sacrificing theoretical guarantees. Further, there is a limited understanding of the performance when there is transmission error/noise of the uploading links.

Our Contributions. We propose a distributed SGD-type algorithm that employs compressed sensing for gradient compression. Specifically, we adopt compressed sensing techniques for the compression of local stochastic gradients at the device side, and the reconstruction of the aggregated stochastic gradients at the server side. The use of compressed sensing enables the server to approximately identify the top entries of the aggregated gradient without querying directly each local gradient. Our algorithm also integrates error feedback strategies at the server side to handle the bias introduced by compression, while keeping the edge devices to be stateless. We provide convergence analysis of our algorithm in the presence of additive noise incurred by the uploading communication channels, and conduct numerical experiments that justify the effectiveness of our algorithm.

Besides the related work discussed above, it is worth noting that a recent paper [19] uses compressed sensing for zeroth-order optimization, which exhibits a mathematical structure similar to this study. However, [19] considers the centralized setting and only establishes convergence to a neighborhood of the minimizer.

Notations: For , denotes its -norm, and denotes its best-

approximation, i.e., the vector that keeps the top

entries of in magnitude with other entries set to .

## 2 Problem Setup

Consider a group of edge devices and a server. Each device is associated with a differentiable local objective function , and is able to query a stochastic gradient such that . Between each device and the server are an uploading communication link and a broadcasting communication link. The goal is to solve

 minx∈Rd  f(x)\coloneqq1n∑ni=1fi(x) (1)

through queries of stochastic gradients at each device and exchange of information between the server and each device.

One common approach for our problem setup is the stochastic gradient descent (SGD) method: For each time step , the server first broadcasts the current iterate to all devices, and then each device produces a stochastic gradient and uploads it to the server, after which the server updates . However, as the server needs to collect local stochastic gradients from each device at every iteration, the vanilla SGD may encounter significant bottleneck imposed by the uploading links if is very large. This issue may be further exacerbated if the server and the devices are connected via lossy wireless networks of limited bandwidth, which is the case for many IoT applications.

In this work, we investigate the situation where the communication links, particularly the uploading links from each edge device to the server, have limited bandwidth that can significantly slow down the whole optimization procedure; the data transmitted through each uploading link may also be corrupted by noise. Our goal is to develop an SGD-type algorithm for solving (1) that achieves better communication efficiency over the uploading links.

## 3 Algorithm

Our algorithm is outlined in Algorithm 1, which is based on the SGD method with the following major ingredients:

1. [leftmargin=0pt,topsep=2pt,itemindent=15pt,labelwidth=6pt,labelsep=7pt,listparindent=15pt,itemsep=2pt]

2. Compression of local stochastic gradients using compressed sensing techniques. Here each edge device compresses its local gradient by before uploading it to the server. The matrix is called the sensing matrix, and its number of rows is strictly less than the number of columns . As a result, the communication burden of uploading the local gradient information can be reduced.

We emphasize that Algorithm 1 employs the for-all scheme of compressed sensing, which allows one to be used for the compression of all local stochastic gradients (see Section 3.1 for more details on the for-each and the for-all schemes).

After collecting the compressed local gradients and obtaining (corrupted by communication channel noise), the server recovers a vector by a compressed sensing algorithm, which will be used for updating .

3. Error feedback of compressed gradients. In general, the compressed sensing reconstruction will introduce a nonzero bias in the SGD iterations that hinders convergence. To handle this bias, we adopt the error feedback method in [12, 15] and modify it similarly as FetchSGD [17]. The resulting error feedback procedure is done purely at the server side without knowing the true aggregated stochastic gradients.

Note that the aggregated vector is corrupted by additive noise from the uploading links. This noise model incorporates a variety of communication schemes, including digital transmission with quantization, and over-the-air transmission for wireless multi-access networks [14].

We now provide more details on our algorithm design.

### 3.1 Preliminaries on Compressed Sensing

Compressed sensing [20] is a technique that allows efficient sensing and reconstruction of an approximately sparse signal. Mathematically, in the sensing step, a signal is observed through linear measurement , where is a pre-specified sensing matrix with , and is additive noise. Then in the reconstruction step, one recovers the original signal by approximately solving

 ^x= argminz∥z∥0s.t.%   y=Φz, (w=0) (2) ^x= argminz\mfrac12∥y−Φz∥22s.t.  ∥z∥0≤K, (w≠0) (3)

where restricts the number of nonzero entries in .

Both (2) and (3) are NP-hard nonconvex problems[21, 22] , and researchers have proposed various compressed sensing algorithms for obtaining approximate solutions. As discussed below, the reconstruction error will heavily depend on i) the design of the sensing matrix , and ii) whether the signal can be well approximated by a sparse vector.

Design of the sensing matrix . Compressed sensing algorithms can be categorized into two schemes [23]: i) the for-each

scheme, in which a probability distribution over sensing matrices is designed to provide desired reconstruction for a fixed signal, and every time a new signal is to be measured and reconstructed, one needs to randomly generate a new

; ii) the for-all scheme, in which a single is used for the sensing and reconstruction of all possible signals.111A more detailed explanation of the two schemes can be found in Appendix D. We mention that count sketch is an example of a for-each scheme algorithm. In this paper, we choose the for-all scheme so that the server doesn’t need to send a new matrix to each device per iteration.

To ensure that the linear measurement can discriminate approximately sparse signals, researchers have proposed the restricted isometry property (RIP) [20] as a condition on :

###### Definition 1.

We say that satisfies the -restricted isometry property, if for any that has at most nonzero entries.

The restricted isometry property on is fundamental for analyzing the reconstruction error of many compressed sensing algorithms under the for-all scheme [24].

Metric of sparsity. The classical metric of sparsity is the norm defined as the number of nonzero entries. However, for our setup, the vectors to be compressed can only be approximately sparse in general, which cannot be handled by the norm as it is not stable under small perturbations. Here, we adopt the following sparsity metric from [25]:222Compared to [25], we add a scaling factor so that .

 sp(x)\coloneqq∥x∥21/(∥x∥22⋅d),x∈Rd∖{0}. (4)

The continuity of indicates that is robust to small perturbations on , and it can be shown that is Schur-concave, meaning that it can characterize approximate sparsity of a signal. has also been used in [25] for performance analysis of compressed sensing algorithms.

### 3.2 Details of Algorithm Design

Generation of . As mentioned before, we choose compressed sensing under the for-all scheme for gradient compression and reconstruction. We require that the sensing matrix have a low storage cost, since it will be transmitted to and stored at each device; should also satisfy RIP so that the compressed sensing algorithm has good reconstruction performance. The following proposition suggests a storage-friendly approach for generating matrices satisfying RIP.

###### Proposition 1 ([26]).

Let

be an orthogonal matrix with entries of absolute values

, and let be sufficiently small. For some ,333 The notation hides logarithm dependence on . let be a matrix whose rows are chosen uniformly and independently from the rows of , multiplied by . Then, with high probability, satisfies the -RIP.

This proposition indicates that, we can choose a “base matrix” satisfying the condition in Proposition 1, and then randomly choose rows to form . In this way, can be stored or transmitted by merely the corresponding row indices in . Note that Proposition 1 only requires to have logarithm dependence on . Candidates of the base matrix include the discrete cosine transform (DCT) matrix and the Walsh-Hadamard transform (WHT) matrix, as both DCT and WHT and their inverse have fast algorithms of time complexity , implying that multiplication of or with any vector can be finished within time.

Choice of the compressed sensing algorithm. We let be the Fast Iterative Hard Thresholding (FIHT) algorithm [27].444See also Appendix E for a brief summary. Our experiments suggest that FIHT achieves a good balance between computation efficiency and empirical reconstruction error compared to other algorithms we have tried.

We note that FIHT has a tunable parameter that controls the number of nonzero entries of . This parameter should accord with the sparsity of the vector to be recovered (see Section 3.3 for theoretical results). In addition, the server can broadcast the sparse vector instead of the whole for the edge devices to update their local copies of , which saves communication over the broadcasting links.

Error feedback. We adopt error feedback to facilitate convergence of Algorithm 1. The following lemma verifies that Algorithm 1 indeed incorporates the error feedback steps in [15]; the proof is straightforward which we omit here.

###### Lemma 1.

Consider Algorithm 1, and suppose is generated according to Proposition 1. Then for each , there exist unique and satisfying , , such that

 p(t)= ηg(t)+e(t),x(t+1)=x(t)−Δ(t), (5) Δ(t)= A(Φp(t)+ηw(t);Φ), e(t+1)= p(t)−Δ(t)+\mfracηQdΦ⊤w(t).

where .

By comparing Lemma 1 with [15, Algorithm 2], we see that the only difference lies in the presence of communication channel noise in our setting. In addition, since error feedback is implemented purely at the server side, the edge devices will be stateless during the whole optimization procedure.

### 3.3 Theoretical Analysis

First, we make the following assumptions on the objective function , the stochastic gradients and the communication channel noise :

###### Assumption 1.

The function is -smooth over , i.e., there exists such that for all ,

###### Assumption 2.

Denote . There exists such that for all .

###### Assumption 3.

The communication channel noise satisfies for each .

Our theoretical analysis will be based on the following result on the reconstruction error of FIHT:

###### Lemma 2 ([27, Corollary I.3]).

Let be the maximum number of nonzero entries of the output of FIHT. Suppose the sensing matrix satisfies -RIP for sufficiently small . Then, for any and ,

 ∥A(Φx+w;Φ)−x∥2≤(CA,s+1)∥∥x−x[K]∥∥2+\mfracCA,s√K∥∥x−x[K]∥∥1+CA,n∥w∥2 (6)

where and are positive constants that depend on .

We are now ready to establish convergence of Algorithm 1.

###### Theorem 1.

Let be the maximum number of nonzero entries of the output of FIHT. Suppose the sensing matrix satisfies -RIP for sufficiently small . Furthermore, assume that

 sp(p(t))≤γ⋅2K/d[1+CA,s(3−2K/d)]2 (7)

for all for some , where is defined in Lemma 1. Then for sufficiently large , by choosing , we have that

In addition, if is convex and has a minimizer , then we further have

 E[f(¯x(t))−f∗]≤L∥x(1)−x∗∥22+G2/L√T+6T⎡⎣γ(1+γ)G2(1−γ)2L+2(CA,n+√Q/d)2σ2(1−γ)L⎤⎦,

where and .

Proof of Theorem 1 is given in Appendix B.

###### Remark 1.

Theorem 1 requires to remain sufficiently low. This condition is hard to check and can be violated in practice (see Section 4). However, our numerical experiments seem to suggest that even if the condition (7) is violated, Algorithm 1 may still exhibit relatively good convergence behavior when the gradient itself has relatively low sparsity level. Theoretical investigation on these observations will be interesting future directions.

## 4 Numerical Results

### 4.1 Test Case with Synthetic Data

We first conduct numerical experiments on a test case with synthetic data. Here we set the dimension to be and the number of edge devices to be . The local objective functions are of the form , where each is a diagonal matrix, and we denote . We generate such that the diagonal entries of is given by for each while the diagonal of each is dense. We also let give approximately sparse stochastic gradients for every . We refer to Appendix C for details on the test case.

We test three algorithms: the uncompressed vanilla SGD, Algorithm 1, and SGD with count sketch. The SGD with count sketch just replaces the gradient compression and reconstruction of Algorithm 1 by the count sketch method [18]. We set for both Algorithm 1 and SGD with count sketch. For Algorithm 1, we generate from the WHT matrix, and uses the FFHT library [28] for fast WHT. We set , and the initial point to be the origin for all three algorithms.

Figure 0(a) illustrates the convergence of the three algorithms without communication channel error (i.e., ). For Algorithm 1, we set (the compression rate is ), and for SGD with count sketch we set the sketch size to be (the compression rate is ). We see that Algorithm 1 has better convergence behavior while also achieves higher compression rate compared to SGD with count sketch. Our numerical experiments suggest that for approximately sparse signals, FIHT can achieve higher reconstruction accuracy and more aggressive compression than count sketch, and for signals that are not very sparse, FIHT also seems more robust.

Figure 0(b) shows the evolution of and for Algorithm 1. We see that is small for the first few iterations, and then increases and stabilizes around , which suggests that the condition (7) is likely to have been violated for large . On the other hand, Fig. 0(a) shows that Algorithm 1 can still achieve relatively good convergence behavior. This indicates a gap between the theoretical results in Section 3.3 and the empirical results, and suggests our analysis could be improved. We leave relevant investigation as future work.

Figure 0(c) illustrates the convergence of Algorithm 1 with different levels of communication channel noise. Here the entries of are i.i.d. sampled from with . We see that the convergence of Algorithm 1 gradually deteriorates as increases, suggesting its robustness against communication channel noise.

### 4.2 Test Case of Federated Learning with CIFAR-10 Dataset

We implement our algorithm on a residual network with 668426 trainable parameters in two different settings. We primarily use Tensorflow and MPI in the implementation of these results (details about the specific experimental setup can be found in Appendix

C). In addition, we shall only present upload compression results here; download compression is not considered to be as significant as the upload compression (given that download speeds are generally higher than upload speeds), and in our case the download compression rate is simply given by . For both settings, we use the CIFAR10 dataset (60,000 32323 images of 10 classes) with a 50,000/10,000 train/test split.

In the first setting, we instantiate 100 workers and split the CIFAR10 training dataset such that all local datasets are i.i.d. As seen in Figure 2, our algorithm is able to achieve 2

upload compression with marginal effect to the training and testing accuracy over 50 epochs. As the compression rate increases, the convergence of our method gradually deteriorates (echoing results in the synthetic case). For comparison, we also show the results of using Count Sketch with

rows and columns, where is the desired compression rate, in lieu of FIHT. In our setting, while uncompressed (1) Count Sketch performs well, it is very sensitive to higher compression, diverging for 1.1 and 1.25 compression.

In the second setting, we split the CIFAR10 dataset into 10,000 sets of 5 images of a single class and assign these sets to 10,000 workers. Each epoch consists of 100 rounds with 1% (100) workers participating in each round. In Figure 3, similar to the i.i.d. setting, we see that our algorithm’s training accuracy convergence gradually deteriorates with higher compression. Typical of the non-i.i.d. setting, the testing accuracy is not as high as that of the i.i.d. setting. However, we note that FIHT is able to achieve 10 compression with negligible effect on the testing accuracy. In addition, Count Sketch with rows and columns diverges even for small compression rates in this problem setting.

## 5 Conclusion

In this paper, we develop a communication efficient SGD algorithm based on compressed sensing. This algorithm has several direct variants. For example, momentum method can be directly incorporated. Also, when the number of devices is very large, the server can choose to query compressed stochastic gradients from a randomly chosen subset of workers.

Our convergence guarantees require to be persistently low, which is hard to check in practice. The numerical experiments also show that our algorithm can work even if grows to a relatively high level. They suggest that our theoretical analysis can be further improved, which will be an interesting future direction.

## Appendix A Auxiliary Results

In this section, we provide some auxiliary results for the proof of Theorem 1. We first give an alternative form of the reconstruction error derived from the condition (7) and the performance guarantee (6).

###### Lemma 3.

Suppose the conditions in Lemma 2 are satisfied. Let be arbitrary, and let satisfy that is upper bounded by the right-hand side of (7). Then

 ∥A(Φx+w;Φ)−x∥2≤√γ2∥x∥2+CA,n∥w∥2.
###### Proof.

By [29, Lemma 7], we have . Therefore by Lemma 2,

 ∥A(Φx+w;Φ)−x∥2≤ CA,s+12√K∥x∥1+CA,s√K∥∥x−x[K]∥∥1+CA,n∥w∥2 ≤ [CA,s+12+CA,s(1−\mfracKd)]∥∥x∥∥1√K+CA,n∥w∥2.

Plugging in the definition and upper bound of leads to the result. ∎

Next, we derive a bound on the second moment of

.

###### Lemma 4.

We have

 2η21−γ[γ(1+γ)1−γG2+2(CA,n+√Q/d)2σ2]+2η2γ(1+γ)(1−γ)2⋅1TT∑t=1E[∥∇f(x(t))∥22].
###### Proof.

By definition, we have

 E[∥e(t+1)∥22]≤ E[(∥Δ(t)−p(t)∥2+\mfracηQd∥∥Φ⊤w(t)∥∥2)2] = E[(∥A(Φp(t)+ηw(t))−p(t)∥2+η√Q/d∥w∥2)2] ≤ E[(√γ/2∥p(t)∥2+η(CA,n+√Q/d)∥w∥2)2] ≤ γE[∥p(t)∥22]+2η2(CA,n+√Q/d)2E[∥w(t)∥22] ≤ γE[∥ηg(t)+e(t)∥22]+2η2(CA,n+√Q/d)2σ2,

where the first inequality follows from Lemma 3, and the second inequality follows from the definition of and the assumption that . Notice that

 E[∥ηg(t)+e(t)∥22]≤(1+2γ1−γ)E[∥ηg(t)∥22]+(1+1−γ2γ)E[∥e(t)∥22],

 E[∥e(t+1)∥22]≤1+γ2E[∥e(t)∥22]+η2γ(1+γ)1−γE[∥g(t)∥22]+2η2(CA,n+√Q/d)2σ2.

By and Assumption 2, we have

 E[∥g(t)∥22]= E[∥∇f(x(t))∥22]+E[∥g(t)−∇f(x(t))∥22] ≤ E[∥∇f(x(t))∥22]+G2.

Therefore

 E[∥e(t+1)∥22]≤1+γ2E[∥e(t)∥22]+η2γ(1+γ)1−γE[∥∇f(x(t))∥22]+η2[γ(1+γ)1−γG2+2(CA,n+√Q/d)2σ2].

By summing over and noting that and , we get

 1TT∑t=1E[∥e(t)∥22]≤ 1+γ2TT∑t=1E[∥e(t)∥22]+η2γ(1+γ)(1−γ)TT∑t=1E[∥∇f(x(t))∥22] +η2[γ(1+γ)1−γG2+2(CA,n+√Q/d)2σ2],

which then leads to the desired result. ∎

## Appendix B Proof of Theorem 1

Convex case: Denote , and it can be checked that

 ~x(t+1) =x(t+1)−e(t+1)=x(t)−Δ(t)−(p(t)−Δ(t))=x(t)−p(t) =~x(t)−ηg(t).

We then have

 ∥~x(t+1)−x∗∥22=∥~x(t)−x∗∥22+η2∥g(t)∥22−2η⟨g(t),~x(t)−x∗⟩.

By taking the expectation and noting and Assumption 2, we get

 E[∥~x(t+1)−x∗∥22]≤ E[∥~x(t)−x∗∥22]+η2(E[∥∇f(x(t))∥22]+G2) −2ηE[⟨∇f(x(t)),x(t)−x∗⟩]+2ηE[⟨∇f(x(t)),e(t)⟩],

 E[⟨∇f(x(t)),x(t)−x∗⟩]≤ 12η(E[∥~x(t)−x∗∥22]−E[∥~x(t+1)−x∗∥22])+ηG22 +η2E[∥∇f(x(t))∥22]+E[⟨∇f(x(t)),e(t)⟩] ≤ 12η(E[∥~x(t)−x∗∥22]−E[∥~x(t+1)−x∗∥22])+ηG22 +η+(3L)−12E[∥∇f(x(t))∥22]+3L2E[∥e(t)∥22],

where in the second inequality we used . Now, we take the average of both sides over and plug in the bound in Lemma 4 to get

 1TT∑t=1E[⟨∇f(x(t)),x(t)−x∗⟩]≤ 12ηT∥x(1)−x∗∥22+ηG22+3η2L1−γ[γ(1+γ)1−γG2+2(CA,n+√Q/d)2σ2] +(η+(3L)−12+3η2Lγ(1+γ)(1−γ)2)1TT∑t=1E[∥∇f(x(t))∥22].

By , we can show that for sufficiently large ,

 η+(3L)−12+3η2Lγ(1+γ)(1−γ)2=1L(16+12√T+3γ(1+γ)T(1−γ)2)≤14L.

Thus

 1TT∑t=1E[⟨∇f(x(t)),x(t)−x∗⟩]≤ 12ηT∥x(1)−x∗∥22+ηG22+3η2L1−γ[γ(1+γ)1−γG2+2(CA,n+√Q/d)2σ2] +14L⋅1TT∑t=1E[∥∇f(x(t))∥22].

By the convexity of , we have . Furthermore, since is -smooth, we see that

 f(x∗)≤f(x(t)−1L∇f(x(t)))≤f(x(t))−12L∥∇f(x(t))∥22,

which leads to . We then get

 1TT∑t=1E[f(x(t))−f(x∗)]≤ 12TT∑t=1E[f(x(t))−f(x∗)]+12ηT∥x(1)−x∗∥22+ηG22 +3η2L1−γ[γ(1+γ)1−γG2+2(CA,n+√Q/d)2σ2].

By subtracting from both sides of the inequality, and using the bound that follows from the convexity of , we get the final bound.

Nonconvex case: Denote , and it can be checked that . Since is -smooth, we get

 f(~x(t+1))≤f(~x(t))−η⟨∇f(~x(t)),g(t)⟩+η2L2∥g(t)∥22.

By taking the expectation and using and Assumption 2, we see that

 E[f(~x(t+1))]≤ E[f(~x(t))]−ηE[⟨∇f(~x(t)),∇f(x(t))⟩]+η2L2(E[∥∇f(x(t))∥22]+G2) = E[f(~x(t))]−ηE[∥∇f(x(t))∥22] ≤ E[f(~x(t))]−η(1−ηL)2E[∥∇f(x(t))∥22] +η2E[∥∇f(~x(t))−∇f(x(t))∥22]+η2L2G2 ≤ E[f(~x(t))]−η(1−ηL)2E[∥∇f(x(t))∥22]+ηL22E[∥e(t)∥22]+η2L2G2,

where in the second inequality we used , and in the last inequality we used the -smoothness of . By taking the telescoping sum, we get

 1−ηL21TT∑t=1E[∥∇f(x(t))∥22]≤ f(x(1))−f∗η+ηLG22+L22⋅1TT∑t=1E[∥e(t)∥22].

After plugging in the bound in Lemma 4, we get

 1−ηL21TT∑t=1E[∥∇f(x(t))∥22]≤ f(x(1))−f∗η+ηLG22+η2L21−γ[γ(1+γ)1−γG2+2(C