1 Introduction
Largescale distributed stochastic optimization plays a fundamental role in the recent advances of machine learning, allowing models with vast sizes to be trained on massive datasets by multiple machines. In the meantime, the past few years have witnessed an explosive growth of networks of IoT devices such as smart phones, selfdriving cars, robots, unmanned aerial vehicles (UAVs), etc., which are capable of data collection and processing for many learning tasks. In many of these applications, due to privacy concerns, it is preferable that the local edge devices learn the model by cooperating with the central server but without sending their own data to the server. Moreover, the communication between the edge devices and the server is often through wireless channels, which are lossy and unreliable in nature and have limited bandwidth, imposing significant challenges, especially for large dimensional problems.
To address the communication bottlenecks, researchers have investigated communicationefficient distributed optimization methods for largescale problems, for both the deviceserver setting [1, 2] and the peertopeer setting [3, 4]. In this paper, we consider the deviceserver setting where a group of edge devices are coordinated by a central server.
Most existing techniques for the deviceserver setting can be classified into two categories. The first category aims to reduce the number of communication rounds, based on the idea that each edge device runs multiple local SGD steps in parallel before sending the local updates to the server for aggregation. This approach has also been called
FedAvg [1] in federated learning and convergence has been studied in [5, 6, 7]. Another line of work investigates lazy/adaptive upload of information, i.e., local gradients are uploaded only when found to be informative enough [8].The second category focuses on efficient compression of gradient information transmitted from edge devices to the server. Commonly adopted compression techniques include quantization [9, 10, 11] and sparsification [12, 13, 14]. These techniques can be further classified according to whether the gradient compression yields biased [9, 14] or unbiased [10, 13]
gradient estimators. To handle the bias and boost convergence,
[12, 15] introduced the error feedback method that accumulates and corrects the error caused by gradient compression at each step.Two recent papers [16, 17] employ sketching methods for gradient compression. Specifically, each device compresses its local stochastic gradient by count sketch [18] via a common sketching operator; and the server recovers the indices and the values of large entries of the aggregated stochastic gradient from the gradient sketches. However, theoretical guarantees of count sketch were developed for recovering one fixed
signal by randomly generating a sketching operator from a given probability distribution. During SGD, gradient signals are constantly changing, making it impractical to generate a new sketch operator for every SGD iteration. Thus the papers apply a single sketching operator to all the gradients through the optimization procedure, while sacrificing theoretical guarantees. Further, there is a limited understanding of the performance when there is transmission error/noise of the uploading links.
Our Contributions. We propose a distributed SGDtype algorithm that employs compressed sensing for gradient compression. Specifically, we adopt compressed sensing techniques for the compression of local stochastic gradients at the device side, and the reconstruction of the aggregated stochastic gradients at the server side. The use of compressed sensing enables the server to approximately identify the top entries of the aggregated gradient without querying directly each local gradient. Our algorithm also integrates error feedback strategies at the server side to handle the bias introduced by compression, while keeping the edge devices to be stateless. We provide convergence analysis of our algorithm in the presence of additive noise incurred by the uploading communication channels, and conduct numerical experiments that justify the effectiveness of our algorithm.
Besides the related work discussed above, it is worth noting that a recent paper [19] uses compressed sensing for zerothorder optimization, which exhibits a mathematical structure similar to this study. However, [19] considers the centralized setting and only establishes convergence to a neighborhood of the minimizer.
Notations: For , denotes its norm, and denotes its best
approximation, i.e., the vector that keeps the top
entries of in magnitude with other entries set to .2 Problem Setup
Consider a group of edge devices and a server. Each device is associated with a differentiable local objective function , and is able to query a stochastic gradient such that . Between each device and the server are an uploading communication link and a broadcasting communication link. The goal is to solve
(1) 
through queries of stochastic gradients at each device and exchange of information between the server and each device.
One common approach for our problem setup is the stochastic gradient descent (SGD) method: For each time step , the server first broadcasts the current iterate to all devices, and then each device produces a stochastic gradient and uploads it to the server, after which the server updates . However, as the server needs to collect local stochastic gradients from each device at every iteration, the vanilla SGD may encounter significant bottleneck imposed by the uploading links if is very large. This issue may be further exacerbated if the server and the devices are connected via lossy wireless networks of limited bandwidth, which is the case for many IoT applications.
In this work, we investigate the situation where the communication links, particularly the uploading links from each edge device to the server, have limited bandwidth that can significantly slow down the whole optimization procedure; the data transmitted through each uploading link may also be corrupted by noise. Our goal is to develop an SGDtype algorithm for solving (1) that achieves better communication efficiency over the uploading links.
3 Algorithm
Our algorithm is outlined in Algorithm 1, which is based on the SGD method with the following major ingredients:

[leftmargin=0pt,topsep=2pt,itemindent=15pt,labelwidth=6pt,labelsep=7pt,listparindent=15pt,itemsep=2pt]

Compression of local stochastic gradients using compressed sensing techniques. Here each edge device compresses its local gradient by before uploading it to the server. The matrix is called the sensing matrix, and its number of rows is strictly less than the number of columns . As a result, the communication burden of uploading the local gradient information can be reduced.
We emphasize that Algorithm 1 employs the forall scheme of compressed sensing, which allows one to be used for the compression of all local stochastic gradients (see Section 3.1 for more details on the foreach and the forall schemes).
After collecting the compressed local gradients and obtaining (corrupted by communication channel noise), the server recovers a vector by a compressed sensing algorithm, which will be used for updating .

Error feedback of compressed gradients. In general, the compressed sensing reconstruction will introduce a nonzero bias in the SGD iterations that hinders convergence. To handle this bias, we adopt the error feedback method in [12, 15] and modify it similarly as FetchSGD [17]. The resulting error feedback procedure is done purely at the server side without knowing the true aggregated stochastic gradients.
Note that the aggregated vector is corrupted by additive noise from the uploading links. This noise model incorporates a variety of communication schemes, including digital transmission with quantization, and overtheair transmission for wireless multiaccess networks [14].
We now provide more details on our algorithm design.
3.1 Preliminaries on Compressed Sensing
Compressed sensing [20] is a technique that allows efficient sensing and reconstruction of an approximately sparse signal. Mathematically, in the sensing step, a signal is observed through linear measurement , where is a prespecified sensing matrix with , and is additive noise. Then in the reconstruction step, one recovers the original signal by approximately solving
()  (2)  
()  (3) 
where restricts the number of nonzero entries in .
Both (2) and (3) are NPhard nonconvex problems[21, 22] , and researchers have proposed various compressed sensing algorithms for obtaining approximate solutions. As discussed below, the reconstruction error will heavily depend on i) the design of the sensing matrix , and ii) whether the signal can be well approximated by a sparse vector.
Design of the sensing matrix . Compressed sensing algorithms can be categorized into two schemes [23]: i) the foreach
scheme, in which a probability distribution over sensing matrices is designed to provide desired reconstruction for a fixed signal, and every time a new signal is to be measured and reconstructed, one needs to randomly generate a new
; ii) the forall scheme, in which a single is used for the sensing and reconstruction of all possible signals.^{1}^{1}1A more detailed explanation of the two schemes can be found in Appendix D. We mention that count sketch is an example of a foreach scheme algorithm. In this paper, we choose the forall scheme so that the server doesn’t need to send a new matrix to each device per iteration.To ensure that the linear measurement can discriminate approximately sparse signals, researchers have proposed the restricted isometry property (RIP) [20] as a condition on :
Definition 1.
We say that satisfies the restricted isometry property, if for any that has at most nonzero entries.
The restricted isometry property on is fundamental for analyzing the reconstruction error of many compressed sensing algorithms under the forall scheme [24].
Metric of sparsity. The classical metric of sparsity is the norm defined as the number of nonzero entries. However, for our setup, the vectors to be compressed can only be approximately sparse in general, which cannot be handled by the norm as it is not stable under small perturbations. Here, we adopt the following sparsity metric from [25]:^{2}^{2}2Compared to [25], we add a scaling factor so that .
(4) 
The continuity of indicates that is robust to small perturbations on , and it can be shown that is Schurconcave, meaning that it can characterize approximate sparsity of a signal. has also been used in [25] for performance analysis of compressed sensing algorithms.
3.2 Details of Algorithm Design
Generation of . As mentioned before, we choose compressed sensing under the forall scheme for gradient compression and reconstruction. We require that the sensing matrix have a low storage cost, since it will be transmitted to and stored at each device; should also satisfy RIP so that the compressed sensing algorithm has good reconstruction performance. The following proposition suggests a storagefriendly approach for generating matrices satisfying RIP.
Proposition 1 ([26]).
Let
be an orthogonal matrix with entries of absolute values
, and let be sufficiently small. For some ,^{3}^{3}3 The notation hides logarithm dependence on . let be a matrix whose rows are chosen uniformly and independently from the rows of , multiplied by . Then, with high probability, satisfies the RIP.This proposition indicates that, we can choose a “base matrix” satisfying the condition in Proposition 1, and then randomly choose rows to form . In this way, can be stored or transmitted by merely the corresponding row indices in . Note that Proposition 1 only requires to have logarithm dependence on . Candidates of the base matrix include the discrete cosine transform (DCT) matrix and the WalshHadamard transform (WHT) matrix, as both DCT and WHT and their inverse have fast algorithms of time complexity , implying that multiplication of or with any vector can be finished within time.
Choice of the compressed sensing algorithm. We let be the Fast Iterative Hard Thresholding (FIHT) algorithm [27].^{4}^{4}4See also Appendix E for a brief summary. Our experiments suggest that FIHT achieves a good balance between computation efficiency and empirical reconstruction error compared to other algorithms we have tried.
We note that FIHT has a tunable parameter that controls the number of nonzero entries of . This parameter should accord with the sparsity of the vector to be recovered (see Section 3.3 for theoretical results). In addition, the server can broadcast the sparse vector instead of the whole for the edge devices to update their local copies of , which saves communication over the broadcasting links.
Error feedback. We adopt error feedback to facilitate convergence of Algorithm 1. The following lemma verifies that Algorithm 1 indeed incorporates the error feedback steps in [15]; the proof is straightforward which we omit here.
Lemma 1.
3.3 Theoretical Analysis
First, we make the following assumptions on the objective function , the stochastic gradients and the communication channel noise :
Assumption 1.
The function is smooth over , i.e., there exists such that for all ,
Assumption 2.
Denote . There exists such that for all .
Assumption 3.
The communication channel noise satisfies for each .
Our theoretical analysis will be based on the following result on the reconstruction error of FIHT:
Lemma 2 ([27, Corollary I.3]).
Let be the maximum number of nonzero entries of the output of FIHT. Suppose the sensing matrix satisfies RIP for sufficiently small . Then, for any and ,
(6) 
where and are positive constants that depend on .
We are now ready to establish convergence of Algorithm 1.
Theorem 1.
Let be the maximum number of nonzero entries of the output of FIHT. Suppose the sensing matrix satisfies RIP for sufficiently small . Furthermore, assume that
(7) 
for all for some , where is defined in Lemma 1. Then for sufficiently large , by choosing , we have that
In addition, if is convex and has a minimizer , then we further have
where and .
Remark 1.
Theorem 1 requires to remain sufficiently low. This condition is hard to check and can be violated in practice (see Section 4). However, our numerical experiments seem to suggest that even if the condition (7) is violated, Algorithm 1 may still exhibit relatively good convergence behavior when the gradient itself has relatively low sparsity level. Theoretical investigation on these observations will be interesting future directions.
4 Numerical Results
4.1 Test Case with Synthetic Data
We first conduct numerical experiments on a test case with synthetic data. Here we set the dimension to be and the number of edge devices to be . The local objective functions are of the form , where each is a diagonal matrix, and we denote . We generate such that the diagonal entries of is given by for each while the diagonal of each is dense. We also let give approximately sparse stochastic gradients for every . We refer to Appendix C for details on the test case.
We test three algorithms: the uncompressed vanilla SGD, Algorithm 1, and SGD with count sketch. The SGD with count sketch just replaces the gradient compression and reconstruction of Algorithm 1 by the count sketch method [18]. We set for both Algorithm 1 and SGD with count sketch. For Algorithm 1, we generate from the WHT matrix, and uses the FFHT library [28] for fast WHT. We set , and the initial point to be the origin for all three algorithms.
sigma confidence intervals.
Figure 0(a) illustrates the convergence of the three algorithms without communication channel error (i.e., ). For Algorithm 1, we set (the compression rate is ), and for SGD with count sketch we set the sketch size to be (the compression rate is ). We see that Algorithm 1 has better convergence behavior while also achieves higher compression rate compared to SGD with count sketch. Our numerical experiments suggest that for approximately sparse signals, FIHT can achieve higher reconstruction accuracy and more aggressive compression than count sketch, and for signals that are not very sparse, FIHT also seems more robust.
Figure 0(b) shows the evolution of and for Algorithm 1. We see that is small for the first few iterations, and then increases and stabilizes around , which suggests that the condition (7) is likely to have been violated for large . On the other hand, Fig. 0(a) shows that Algorithm 1 can still achieve relatively good convergence behavior. This indicates a gap between the theoretical results in Section 3.3 and the empirical results, and suggests our analysis could be improved. We leave relevant investigation as future work.
Figure 0(c) illustrates the convergence of Algorithm 1 with different levels of communication channel noise. Here the entries of are i.i.d. sampled from with . We see that the convergence of Algorithm 1 gradually deteriorates as increases, suggesting its robustness against communication channel noise.
4.2 Test Case of Federated Learning with CIFAR10 Dataset
We implement our algorithm on a residual network with 668426 trainable parameters in two different settings. We primarily use Tensorflow and MPI in the implementation of these results (details about the specific experimental setup can be found in Appendix
C). In addition, we shall only present upload compression results here; download compression is not considered to be as significant as the upload compression (given that download speeds are generally higher than upload speeds), and in our case the download compression rate is simply given by . For both settings, we use the CIFAR10 dataset (60,000 32323 images of 10 classes) with a 50,000/10,000 train/test split.In the first setting, we instantiate 100 workers and split the CIFAR10 training dataset such that all local datasets are i.i.d. As seen in Figure 2, our algorithm is able to achieve 2
upload compression with marginal effect to the training and testing accuracy over 50 epochs. As the compression rate increases, the convergence of our method gradually deteriorates (echoing results in the synthetic case). For comparison, we also show the results of using Count Sketch with
rows and columns, where is the desired compression rate, in lieu of FIHT. In our setting, while uncompressed (1) Count Sketch performs well, it is very sensitive to higher compression, diverging for 1.1 and 1.25 compression.In the second setting, we split the CIFAR10 dataset into 10,000 sets of 5 images of a single class and assign these sets to 10,000 workers. Each epoch consists of 100 rounds with 1% (100) workers participating in each round. In Figure 3, similar to the i.i.d. setting, we see that our algorithm’s training accuracy convergence gradually deteriorates with higher compression. Typical of the noni.i.d. setting, the testing accuracy is not as high as that of the i.i.d. setting. However, we note that FIHT is able to achieve 10 compression with negligible effect on the testing accuracy. In addition, Count Sketch with rows and columns diverges even for small compression rates in this problem setting.
5 Conclusion
In this paper, we develop a communication efficient SGD algorithm based on compressed sensing. This algorithm has several direct variants. For example, momentum method can be directly incorporated. Also, when the number of devices is very large, the server can choose to query compressed stochastic gradients from a randomly chosen subset of workers.
Our convergence guarantees require to be persistently low, which is hard to check in practice. The numerical experiments also show that our algorithm can work even if grows to a relatively high level. They suggest that our theoretical analysis can be further improved, which will be an interesting future direction.
Appendix A Auxiliary Results
In this section, we provide some auxiliary results for the proof of Theorem 1. We first give an alternative form of the reconstruction error derived from the condition (7) and the performance guarantee (6).
Lemma 3.
Proof.
Lemma 4.
We have
Proof.
By definition, we have
where the first inequality follows from Lemma 3, and the second inequality follows from the definition of and the assumption that . Notice that
which leads to
By and Assumption 2, we have
Therefore
By summing over and noting that and , we get
which then leads to the desired result. ∎
Appendix B Proof of Theorem 1
Convex case: Denote , and it can be checked that
We then have
By taking the expectation and noting and Assumption 2, we get
which leads to
where in the second inequality we used . Now, we take the average of both sides over and plug in the bound in Lemma 4 to get
By , we can show that for sufficiently large ,
Thus
By the convexity of , we have . Furthermore, since is smooth, we see that
which leads to . We then get
By subtracting from both sides of the inequality, and using the bound that follows from the convexity of , we get the final bound.
Nonconvex case: Denote , and it can be checked that . Since is smooth, we get
By taking the expectation and using and Assumption 2, we see that
where in the second inequality we used , and in the last inequality we used the smoothness of . By taking the telescoping sum, we get
After plugging in the bound in Lemma 4, we get