I Introduction
With the explosive growth in data and the rapid advancements of algorithms (e.g., deep learning), as well as the stepchange improvement of computing resources, artificial intelligence (AI) has achieved breakthroughs in a wide range of applications, including speech processing [64], image classification [94]
[9], etc. AI is expected to affect significant segments of many vertical industries and our daily life, such as intelligent vehicles [205] and tactile robots [69]. In addition, it is anticipated that AI could add around 16 percent or about trillion to the global gross domestic product (GDP) by 2030, compared with that of 2018 [18].The explosive data growth generated by the massive number of end devices, e.g., smart phones, tablets and InternetofThings (IoT) sensors, provides opportunities and challenges for providing intelligent services. It is predicted that there will be nearly 85 Zettabytes of usable data generated by all people, machines and things by 2021, which shall exceed the cloud data center traffic (21 Zettabytes) by a factor of [38]. Moreover, delaysensitive intelligent applications, such as autonomous driving, cyberphysical control systems, and robotics, require fast processing of the incoming data. Such extremely high network bandwidth and low latency requirements would place unprecedented pressures on traditional cloudbased AI, where massive sensors/embedded devices transfer collected data to the cloud [171], often under varying network qualities (e.g., bandwidth and latency). In addition, privacy is a major concern for cloudbased solutions. To address these problems, one promising solution, edge AI [214, 135], comes to the rescue.
Futuristic wireless systems [99] mainly consist of ultradense edge nodes, including edge servers at the base stations and wireless access points, and edge devices
such as smart phones, smart vehicles, and drones. Edge AI pushes inference and training processes of AI models to the network edge in close proximity to data sources. As such, the amount of data transferred to the cloud will be significantly reduced, thus alleviating the network traffic load, latency and privacy concerns. Although training an AI model (e.g., deep neural networks) generally requires intensive computing resources, the rapid development of mobile edge computing can provide cloudcomputing capabilities at the edge of the mobile network
[77, 117], and make the application of AI to edges much more efficient. In addition, computational capabilities of edge servers and edge devices continue to improve. Notable examples include the deployment of neural processing unit (NPU) in Kirin 970 smart phone chips and the Apple’s bionic chip A12, which substantially accelerate AI computations on edge devices. In a nutshell, the advances of mobile edge computing platforms and the improvement of computing power of the edge nodes make edge AI a feasible solution.Nevertheless, pushing AI towards the edge is a nontrivial task. The most straightforward way of realizing edge AI without any communication load, i.e., deploying the full AI models on edge devices, is often infeasible when the size of the AI model (e.g., DNNs) is too large or the computational requirement is too high, given the limited hardware resources of edge devices. A promising solution is to incorporate cooperation among edge nodes to accomplish edge AI tasks that require intensive computation and large storage sizes. This can be achieved by exploiting different data storage and processing capabilities for a wider range of intelligent services with distinct latency and bandwidth requirements [217], as shown in Fig. 1. For example, based on federated learning [92]
, we can use multiple devices to train an AI model collaboratively. Specifically, each device only needs to compute a local model according to its own data samples, before sending the computation results to a fusion center, where the global AI model is aggregated and updated. The new AI model will be transmitted back to each device for training at the next epoch. Such solutions exploit ondevice computing power in a collaborative way, which, however, requires significant communication overheads during the model updating process. In addition, some computationintensive AI inference tasks can only be accomplished by tasksplitting among edge devices and edge servers
[87], which also incurs heavy communication cost. Therefore, the enormous communication overhead presents a major bottleneck for edge AI.To unleash its full potential, the upcoming edge AI [214, 106] shall rely on advances in various aspects, including the smart design of distributed learning algorithms and system architectures, supported by efficient communication protocols. In this article, we survey the communication challenges in designing and deploying AI models and algorithms in edge AI systems. Specifically, we provide a thorough survey on communicationefficient distributed learning algorithms for training AI models on edges. In addition, we provide an overview of edge AI system architectures for communicationefficient edge training and edge inference. In the next section, we start with the motivations and identify major communication challenges in edge AI. A paper outline will also be provided.
Ii Motivations and Challenges
In this section, we present the motivations and identify key communication challenges of edge AI. Interplays among computation mechanisms, learning algorithms, as well as system architectures, are revealed.
Iia Motivations
During the past decades, the thriving mobile internet has enabled various mobile applications such as mobile pay, mobile gaming, etc. These applications in turn led to an upsurge of mobile devices and mobile data, which prompts the prosperity of AI for greatly facilitating daily life. As the key design target, the 5G rollout has focused on several key services for connected things: enhanced mobile broadband (eMBB), ultra reliable low latency communications (URLLC), and massive machine type communications (mMTC). In contrast, futuristic 6G networks [99] will undergo a paradigm shift from connected things to connected intelligence. The network infrastructure of 6G is envisioned to fully exploit the potential of massive distributed devices and the data generated at network edges for supporting intelligent applications [214].
In recent years, a new trend is to move computation tasks from the cloud center towards network edges due to the increasing computing power of edge nodes [117]. In the upcoming 5G wireless systems, there are a growing number of edge nodes, varying from base stations (BSs) to various edge devices such as mobile phones, tablets, and IoT devices. The computational capabilities of edge mobile devices have seen substantial improvements thanks to the rapid development of mobile chipsets. For example, mobile phones nowadays have comparable computing power as computing servers a decade ago. In addition, edge servers have the potential to provide lowlatency AI services for mobile users which are infeasible to be directly implemented on devices. Since edge servers have relatively less powerful computation resources than cloud centers, it is necessary to employ joint design principles across edge servers and edge devices to further reduce execution latency and enhance privacy [117]. The advances in edge computing thus provides opportunities for pushing AI frontiers from the cloud center to network edges, stimulating a new research area known as edge AI, including both AI model training and inference procedures.
Training at network edges is challenging and requires coordinating massive edge nodes to collaboratively build a machine learning model [120]. Each edge node usually has access to only a small subset of training data, which is the fundamental difference from traditional cloudbased model training [194]. The information exchange across edge nodes in edge training results in high communication cost, especially in the wireless environment with limited bandwidth. This brings a main bottleneck in edge training. It is interesting to note that a number of works have revisited the communication theory for addressing the communication challenges of edge training. The connection between data aggregation from distributed nodes in edge training and the innetwork computation problem [59] in wireless sensor networks has been established in [192], which proposed an overtheair computation approach for fast model aggregation in each round of training for ondevice federated learning. In wireless communication systems, limited feedback [113] from a receiver to a transmitter is critical to reducing the information bits for realizing channel agile techniques that require channel knowledge at the transmitter. A connection between limited feedback in wireless communication and the quantization method was established in [51] for reducing the data transmission cost in edge training, which borrows ideas from the widely adopted Grassmannian quantization approach for limited feedback.
Edge inference, i.e., performing inference of AI models at network edges, enjoys the benefits of lowlatency and enhanced privacy, which are critical for a wide range of AI applications such as drones, smart vehicles, and so on. As such, it has drawn significant attention from both academia and industry. Recently, deep learning models have been actively adopted in a number of applications to provide highquality services for mobile users. For example, AI technologies have shown promises in healthcare [147], such as detection of heart failure [35]
with recurrent neural network (RNN) and decisions about patient treatment
[62]with reinforcement learning. However, deep neural network (DNN) models often have a huge number of parameters, which will consume considerable storage and computation resources. A typical example is the classic convolutional neural network (CNNs) architecture named AlexNet
[94], which has over 60 million parameters. Therefore, model compression approaches [72, 31] have attracted much attention for deploying DNN models at network edges. It should also be noted that the power budget on edge devices is also limited, which stimulates research pursuits on energyefficient processing of deep neural networks from signal processing perspective [171]. For IoT devices without enough memory to store the entire model, coding techniques shed light on the efficient data shuffling for distributed inference across edge nodes [102, 190].IiB Performance Measurements and Unique Challenges of Edge AI
The typical procedures for providing an AI service include training a machine learning model from data, and performing inference with the trained model. The performance of a machine learning model can be measured by its model accuracy, which can potentially be improved by collecting more training data. However, training a machine learning model from massive data is time consuming. To train a model efficiently, distributed architectures are often adopted, which will introduce additional communication costs for exchanging information across nodes. The computation and communication costs grow extremely high for highdimensional models such as deep neural networks. In addition, lowlatency is also critical for inference in applications such as smart vehicles, smart drones, etc. We thus summarize the key performance measurements of edge AI in terms of model accuracy and total latency.
In the cloud center, cloud computing servers are connected with extremely high bandwidth networks and the training data is available to all nodes. Fundamentally distinct from cloud based AI, edge AI poses more stringent constraints on the algorithms and system architectures.

Limited resources on edge nodes: Instead of the large amount of powerful GPUs and CPUs integrated servers at the nodes of cloudbased AI, there are often limited computation, storage, and power resources on edge devices, with limited link bandwidth among a large number of edge devices and the edge servers at base stations and wireless access points. For example, the classic AlexNet [94]
, which is designed for computer vision, has over 60 million parameters. With 512 Volta GPUs interconnected at the rate of 56Gbps, the Alexnet can be trained within record of 2.6 minutes in the data center of SenseTime
[169]. As one of the most powerful GPUs in the world, one Volta GPU has 5,120 cores. However, the MaliG76 GPU on Huawei Mate 30 Pro, one of the most powerful smart phones, has only 16 cores. The theoretical maximal speed envisioned in 5G is 10Gbps and the average speed is only 50Mbps. 
Heterogeneous resources across edge nodes: The variabilities in hardware, network, and power budget of edge nodes imply heterogeneous communication, computation, storage and power capabilities. The edge servers at base stations have much more computation, storage and power resources than mobile devices. For example, Apple Watch Series 5 can only afford up to 10 hours of audio playback^{1}^{1}1https://www.apple.com/ca/watch/battery/, and users may want to be involved in training tasks only when the devices are charged. To make things worse, edge devices that are connected to a metered cellular network are usually not willing to exchange information with other edge nodes.

Privacy and security constraints: The privacy and security of AI services are increasingly vital especially for emerging highstake applications in intelligent IoT. Operators expect stricter regulations and laws on preserving data privacy for service providers. For example, the General Data Protection Regulation (GDPR) [148] by the European Union grants users the right for data to be deleted or withdrawn. Federated learning [92, 194] becomes a particular relevant research topic for collaboratively building machine learning models while preserving data privacy. Robust algorithms and system designs are also proposed in [29, 49] for security concern against adversarial attacks during distributed edge training.
Enabling efficient edge AI is challenging for coordinating and scheduling edge nodes to efficiently perform a training or inference task under various physical and regulatory constraints. To provide efficient AI services, we shall jointly design new distributed paradigms for computing, communications, and learning. Note that the communication cost for cloudbased AI services may be relatively small compared with computational cost. However, in edge AI systems, the communication cost often becomes a dominating issue due to the stringent constraints. This paper will give a comprehensive survey on edge AI from the perspective of addressing communication challenges from both the algorithm level and system level.
IiC Communication Challenges of Edge AI
Generally, there are multiple communication rounds between edge nodes for an edge AI task. Let denote the total size of information to be exchanged per round, denote the communication rate, denote the number of communication rounds, and denote the total computation time. Then the total latency in an edge AI system is given by
(1) 
For model training, iterative algorithms are often adopted which involve multiple communication rounds. The inference process often requires one round of collaborative computations across edge nodes. Therefore, to alleviate the communication overheads under resource and privacy constraints, it is natural to seek methods for reducing the number of communication rounds for training and the communication overhead per round for training and inference, as well as improving the communication rate.
From the endtoend data transmission perspective, the information content of a message is measured in entropy that characterizes the amount of uncertainty. Based on this measure, the limit of lossless source coding is characterized by Shannon’s source coding theory [162]. It provides a perfect answer to the best we can do if we only focus on “how to transmit” instead of “what to transmit” from one node to another. That is, the fundamental limit of the endtoend communication problem has already been solved when the edge AI system and algorithm are fixed.
However, communication is not isolated in edge AI. From the learning algorithm perspective, “what to transmit” determines the required communication overhead per round and the number of communication rounds. This learning level perspective motivates the development of different algorithms to reduce the communication overhead per round and improve the convergence rate. For instance, many gradient based algorithms have been proposed for accelerating the convergence of distributed training [202, 97] . In addition, lossy compression techniques such as quantization and pruning [72, 108] have drawn much attention recently to reduce the communication overhead per round.
Edge AI system design has also a great influence on the communication paradigm design across edge nodes. For instance, the target of communication in each round is to compute a certain function value with respect to the intermediate values at edge devices. In particular, the full gradient can be computed at a centralized node by aggregating the locally computed partial gradients at all local nodes. It is therefore better to be studied from the perspective of innetwork computation [59], instead of treating communication and computation separately. For example, an overtheair computation approach was developed in [192] for fast model aggregation in distributed model training for ferderated learning. In addition, efficient inference at network edges is closely related to computation offloading in edge computing [117], which is being extensively studied in both the communication and mobile computing communities.
IiD Related Works and Our Contributions
There exist a few survey papers [214, 135, 125, 44] on edge AI. Particularly, the early works [214, 135] emphasized the differences between cloudbased AI and edge AI. Zhou et al. [214] surveyed the technologies of training and inference for deep learning models at network edges. Park et al. [135] focused on the opportunities of utilizing edge AI for improving wireless communication, as well as realizing edge AI over wireless channels. Murshed et al. [125] mainly discussed different machine learning models and neural network models, different practical applications such as video analytics and image recognition, as well as various machine learning frameworks for enabling edge AI. Han et al. [44] further considered the convergence of edge computing and deep learning, i.e., the deep learning techniques for edge computing, as well as edge computing techniques for deep learning.
Unlike existing survey papers [214, 135, 125, 44], we shall present a comprehensive coverage to address the communication challenges for realizing AI at network edges. Edge AI is far from a trivial task of merely adopting the same computation and communication techniques in the cloud center. It requires learning performance aware joint design of computation and communication. Both distributed learning algorithms and distributed computing system architectures shall be customized according to the considered AI model, data availability, and the heterogeneous resources at edge nodes for reducing communication overheads during training and inference. We summarize the research topics on edge AI as algorithmlevel designs and systemlevel designs, which are listed more specifically as follows:

Algorithm level: At the algorithm level, the communication rounds of training a model can be reduced by accelerating convergence, while communication overhead per round can be reduced by information compression techniques (e.g., sparsification, quantization, etc.). We first survey different types of edge AI algorithms including the zerothorder, firstorder, secondorder and federated optimization algorithm, as well as their applications in edge AI. For example, in the context of reinforcement learning, modelfree based methods turn the reinforcement learning problem into zerothorder optimization [146]. Although firstoder methods are widely used in DNNs training, secondorder methods and federated optimization become appealing in edge AI given the growing computational capabilities of devices. As we can see from Fig. 2, the algorithm closer to the right side can potentially achieve better accuracy with less communication rounds, at the cost of more computation resources per round. Note that we list federated optimization methods separately due to its unique motivation to protect private data at each node. For each type of algorithms, there are a number of works focusing on further reducing communication cost. We give a comprehensive survey on the algorithm level in Section III to address the communication challenges in edge AI.

System level: From the system perspective, data distribution (e.g., distributed across edge devices), model parameters (e.g., partitioned and deployed across edge devices and edge servers), computation (e.g., MapReduce), and communication mechanisms (e.g., aggregation at a central node) can be diverse in different applications. There are two main edge AI system architectures for training, i.e., the data partition system and model partition system, based on the availability of data and model. After training the AI model, model deployment is critical for achieving lowlatency AI services. There are also other general edge computing paradigms in edge AI systems that address the tradeoff between computation and communication via coding techniques. There are different types of communication problems arising from the deployment of machine learning algorithms on different system architectures, which typically involve the distributed mode and decentralized mode depending on the existence of a central node. We shall survey various systemlevel approaches to achieve efficient communications in Section IV.
We summarize the main topics and highlighted technologies included in this paper in Table I.
Category  Topic  Representative Results 

CommunicationEfficient Algorithms for Edge AI  ZerothOrder Methods  Optimal rates for zerothorder convex optimization [52] 
Distributed zerothorder algorithms over timevarying networks [201, 153]  
FirstOrder Methods  Variance reduction for minimizing communication rounds [202, 97]  
Gradient reuse for minimizing communication bandwidth [23, 24]  
Relating gradient quantization to limited feedback in wireless communication [51]  
Communicating only important gradients for minimizing communication bandwidth [21, 108]  
SecondOrder Methods  Stochastic quasiNewton methods [157, 19, 123]  
Approximate Newtontype methods [161, 208, 183, 53]  
Federated Optimization  Federated averaging algorithm [120] and dual coordinate ascent algorithm [81] for minimizing communication rounds  
Handling the system and statistical heterogeneity of distributed learning [155, 166]  
Compressing DNN models with vector quantization [61], binary weights [40, 41, 143], randomized sketching [27, 26, 107], network pruning [96, 75, 73, 72, 176, 112, 2, 85], sparse regularization [95, 186, 132], and structural matrix designing for minimizing communication bandwidth [156, 46, 80, 45, 151, 7, 165, 32, 195] 

CommunicationEfficient Edge AI Systems  Data Partition Based Edge Training Systems  Fast aggregation via overtheair computation [192, 218, 5, 6] 
Aggregation frequency control with limited bandwidth and computation resources [182, 213, 180]  
Data reshuffling via index coding and pliable index coding for improving training performance [98, 167, 84]  
Straggler mitigation via coded computing [172, 196, 144, 71, 103, 133, 14, 115, 114]  
Training in decentralized system mode [141, 42, 16, 15, 128, 1, 129, 137, 86, 159, 150]  
Model Partition Based Edge Training Systems  Model partition across a large number of nodes to balance computation and communication [121, 126, 79]  
Model partition across edge device and edge server to avoid the exposure of users’ data [118, 179]  
Vertical architecture for privacy with vertically partitioned data and model [194, 177, 88, 55, 198, 178, 74]  
Computation Offloading Based Edge Inference Systems  Serverbased edge inference:


Deviceedge joint inference:


General Edge Computing Systems  Coding techniques for efficient data shuffling [104, 105, 102, 190, 101, 68, 82, 136, 140]  
Coding techniques for straggler mitigation [136, 149, 211, 93] 
Iii Communicationefficient Algorithms for Edge AI
Distributed machine learning has been mainly investigated in the environment with abundant computing resources, large memory, and highbandwidth networking, e.g., in cloud data centers. The extension to the edge AI system is highly nontrivial due to the isolated data at distributed mobile devices, limited computing resources, and the heterogeneity in communication links. Communicationefficient methods will be critical to exploit the distributed data samples and utilize various available computing resources for achieving excellent learning performance. This section introduces communicationefficient approaches for edge AI at the algorithmic level, including zerothorder methods, firstorder methods, secondorder methods, and federated optimization. As illustrated in Fig. 2, these methods achieve different tradeoffs among the local computation and communication cost. Fig. 3 provides illustrations of local operations and communication messages of different methods.
Iiia CommunicationEfficient ZerothOrder Methods
Zerothoder (derivativefree) methods [130] are increasingly adopted in the applications where only the function value is available, but the derivative information is computationally difficult to obtain, or is even not well defined. For the distributed setting with a central coordinating center shown in Fig. 3(a), only a function value scalar is required to be transmitted to the central node in uplink transmission. In the field of reinforcement learning, zerothorder methods have been widely used for policy function learning without ever building a model [146]. Zerothorder methods have also been adopted to blackbox adversarial attacks on deep neural networks (DNNs) since most real world systems do not release their internal DNNs structure and weights [22].
is the loss function. The firstorder derivative of function
is denoted as . a) Zerothorder method: only the function value can be evaluated during training [130]. b) Firstorder method: gradient descent. c) Secondorder method: DANE [161]. d) Federated optimization: federated averaging algorithm [120].In zerothorder optimization algorithms, the full gradients are typically estimated via gradient estimators based on only the function values
[39]. For instance, we can use the quantity to approximate the gradient of function at point . It was shown in [52] that this kind of derivativefree algorithm only suffers a factor of at most in the convergence rate over traditional stochastic gradient methods for dimensional convex optimization problems. Under timevarying random network topologies, recent studies [201, 153] have investigated the distributed zerothorder optimization algorithms for unconstrained convex optimization in multiagent systems. Convex optimization with a set of convex constraints have been studied in [200, 134]. Nonconvex multiagent optimization has been studied in [70] for different types of network topologies, including undirected connected networks or star networks under the setting where the agent can only access the values of its local function.To develop communicationefficient distributed zerothorder optimization methods, there have been a number of works on reducing the number of perdevice communication. For instance, it was proposed in [154]
that at each iteration each device communicates with its neighbors with some probability that is independent from others and the past, and this probability parameter decays to zero at a carefully tuned rate. For such a distributed zerothoder method, the convergence rate of the mean squared error of solutions is established in terms of the communication costs, i.e., the number of pernode transmissions to neighboring nodes in the network, instead of the iteration number. The subsequent work
[152] improved the convergence rate under additional smoothness assumptions. Quantization techniques are also adopted to reduce the communication cost per communication round. The paper [47] considered a distributed gradientfree algorithm for multiagent convex optimization, where the agents can only exchange quantized data information due to limited bandwidth. In the extreme case considered in [28], each estimated gradient is further quantized into 1 bit, which enjoys high communication efficiency in distributed optimization scenarios.IiiB CommunicationEfficient FirstOrder Methods
Firstorder optimization methods are the most commonly used algorithms in machine learning, which are mainly based on gradient descent methods as shown in Fig. 3(b)
. The idea of gradient descent methods is to iteratively update variables in the opposite direction of the gradients of the loss function at that point with an appropriate step size (a.k.a., a learning rate). As the computational complexity at each iteration scales with the number of data samples and the dimension of the model parameter, it is generally infeasible to train large machine learning models with tremendous amount of training data samples on a single device. Therefore, distributed training techniques have been proposed to mitigate the computation cost, with additional communication costs. Meanwhile, as the training dataset becomes larger and larger, stochastic gradient descent (SGD) method emerges as an appealing solution, in which only one training sample is used to compute the gradient at each iteration. In edge AI systems with inherently isolated data, distributed realizations of SGD will play a key role and should be carefully investigated.
To apply firstorder methods in largescale distributed edge AI systems, the substantial demand for communication among devices for gradient exchange is one of the main bottlenecks. One way to address this issue is to reduce the communication round by accelerating the convergence rate of the learning algorithms. Another approach is to reduce the communication overhead per round, which includes gradient reuse method, quantization, sparsification, and sketching based compression methods. These two approaches are elaborated in the following.
IiiB1 Minimizing Communication Round
We first consider an extreme case. The distributed optimization approach with the minimum communication round, i.e., only one communication round, is that each device performs independent optimization. For example, each node adopts SGD to compute local model parameters, and a server then averages these model parameters in the end. As shown in [219], the overall run time decreases significantly as the number of devices increases for some learning tasks. Subsequently, it was shown in [207] that this oneround communication approach can achieve the same orderoptimal sample complexity in terms of meansquared error of model parameters as the centralized setting under a reasonable set of conditions. The orderoptimal sample complexity can be obtained by performing a stochastic gradientbased methods on each devices [207]. However, one round communication restricts the ability to exchange information during training, which is in general not sufficient for training large models (e.g., DNNs) to achieve the target accuracy in practice.
In general settings where devices upload their local gradients to a fusion center at each iteration, it is critical to reduce the communication round by accelerating the convergence rate of the algorithm. Shamir and Srebro [160] proposed to accelerate minibatch SGD by using the largest possible minibatch size that does not hurt the sample complexity, and it shows that the communication cost decreases linearly with the size of the minibatch. Yuan et al. [202] proposed an amortized variancereduced gradient algorithm for a decentralized setting, where each device collects data that is spatially distributed and all devices are only allowed to communicate with direct neighbors. In addition, a minibatch strategy is adopted by [202] to achieve communication efficiency. However, it has been shown in [89, 197] that too large minibatch sizes will result in a degradation in the generalization of the model. In practice, additional efforts should be taken to reduce this generalization drop. For instance, it was shown in [63] that training with large minibatch sizes up to images achieves the same accuracy as small minibatch settings by adjusting learning rates as a function of minibatch size. This idea was also adopted by [206]
to train DNNs for automatic speech recognition tasks using large batch sizes in oder to accelerate the total training process.
The statistical heterogeneity of data hinders the fast convergence of firstorder algorithms. To address this issue, there have been lots of efforts. Arjevani and Shamir [8] studied the scenarios where each device has access to a different subset of data to minimize the averaged loss function over all devices. They established a lower bound on the rounds of communication, which is shown to be achieved by the algorithm of [210] for quadratic and strongly convex functions. But how to design optimal algorithms in terms of communication efficiency for general functions remains an open problem. By utilizing additional storage space of devices, Lee et al. [97] proposed to assign two subsets of data to each device. The first subset is from a random partition and the second subset is randomly sampled with replacement from the overall datasets. Since each device has access to both data subsets, the authors proposed a distributed stochastic variance reduced gradient method to minimize the communication round, in which the batch gradients are computed in parallel on different devices and the algorithm utilizes the local data sampled with replacement to construct the unbiased stochastic gradient in each iterative update. For nonconvex optimization problems, Garber et al. [54]
proposed a stochastic distributed algorithm to solve the principal component analysis problem, which gives considerable acceleration in terms of communication rounds over previous distributed algorithms.
IiiB2 Minimizing Communication Bandwidth
Another series of works focus on reducing the size of local updates from each device, thereby reducing the overall communication cost. In the following, we review three representative techniques, i.e., gradient reuse, gradient quantization, and gradient sparsification.

[leftmargin=0pt,itemindent=1.5em,align=left,topsep=0.5em,itemsep=0.5em]

Gradient reuse: To minimize a sum of smooth loss functions distributed among multiple devices, considering that the gradients of some devices vary slowly between two consecutive communication rounds, a lazily aggregated gradient (LAG) method was proposed by [23] which uses outdated gradients of these devices at the fusion center. Specifically, theses devices upload nothing during this communication round, which is able to reduce communication overheads per round significantly. Theoretically, it was shown in [23] that LAG achieves the same order of convergence rates as the batch gradient descent method under the cases where the loss functions are stronglyconvex, convex, or nonconvex smooth. If the distributed datasets are heterogeneous, LAG can achieve a target accuracy with considerably less communication costs measured as the total number of transmissions over all the devices in comparison with the batch gradient descent method. In addition, a similar gradient reuse idea was adopted in distributed reinforcement learning to achieve communication efficient training [24].

Gradient quantization: To reduce the communication cost of gradient aggregation, some scalar quantization methods have been proposed to compress the gradients by a small number of bits instead of using floatingpoint representation. To estimate the mean of the gradient vectors collected from devices, Suresh et al. [170] analyzed the mean squared error for several quantization schemes without probabilistic assumptions on the data from the information theoretic perspective. In the view of distributed learning algorithms. Alistarh et al. [3] has investigated the quantized stochastic gradient descent (QSGD) to study the tradeoff between communication costs and convergence guarantees. Specifically, each device can adjust the number of bits sent per iteration according to the variance added to the device. As demonstrated in [3], each device can transmit no more than bits per iteration in expectation, where is the number of model parameters, while the variance is only increased by a factor of . Compared to full precision SGD, using approximately bits of communication per iteration as opposed to bits will only result in at most more iterations, which leads to bandwidth savings of approximately
. For distributed training the shallowest neural networks consisting of a single rectified linear unit, it was shown in
[124] that the quantized stochastic gradient method converges to the global optima at a linear convergence rate. Seide et al. [158] proposed to quantize the gradient using only one bit, achieving a 10 times speedup on distributed training of speech DNNs with a small accuracy loss. Theoretically, Bernstein et al. [12] provided rigorous analysis for the signbased distributed stochastic gradient descent algorithm, where each device sends the sign information of the gradients to a fusion center and the sign information of the aggregated gradients signs is returned to each device for updating model parameters. This scheme is shown to achieve the same reduction in variance as full precision distributed SGD and converge to a stationary point of a general nonconvex function.As pointed out in [173, 199], scalar quantization methods fail under decentralized networks without a central aggregation node. To address this issue, extrapolation compression and difference compression methods were proposed in [173], and a gradient vector quantization technique was proposed in [199] to exploit the correlations between CNN gradients. Vector quantization [56] by jointly quantizing all entries of a vector can achieve the optimal ratedistortion tradeoff, which, however, comes at the price of high complexity that increases with the vector length. Interestingly, it was found in [51] that Grassmannian quantization, a vector quantization method that has already been widely adopted in wireless communication for limited feedback, can be applied for gradient quantization. Limited feedback is an area of studying efficient feedback of quantized vectors from a receiver to a transmitter for channel adaptive transmission schemes since the communication cost of feedback is extremely high in massive MIMO communication systems. This motivated [51] to develop an efficient Grassmannian quantization scheme for highdimensional gradient compression in distributed learning.
Additionally, Jiang et al. [83]
proposed to use quantile sketch, a nonuniform quantization method for gradient compression. Sketch is a technique of approximating input data with a probablistic data structure. In
[83], the gradient values are summarized into a number of buckets, whose indices are further encoded by a binary representation since the number of buckets is relatively small. 
Gradient sparsification: The basic idea behind gradient sparsification is to communicate only important gradients according to some criteria. This is based on the observation that many gradients are normally very small during training. Strom [168] proposed to leave out the gradients below a predefined constant threshold. Chen et al. [21] proposed AdaComp via localized selection of gradient residues, which automatically tunes the compression rate depending on local activity. It was demonstrated that AdaComp can achieve a compression ratio of around for fullyconnected layers and
for convolutional layers without noticeable degradation of top1 accuracy on ImageNet dataset. Deep gradient compression was proposed in
[108]based on a gradient sparsification approach, where only gradients exceeding a threshold are communicated, while the remaining gradients are accumulated until they reach the threshold. Several techniques including momentum correction, local gradient clipping, momentum factor masking, and warmup training are adopted to preserve the accuracy. This deep gradient compression approach is shown to achieve a gradient compression ratio from
to without losing accuracy for a wide range of DNNs and RNNs [108]. In [184], to ensure the sparsified gradient to be unbiased, the authors proposed to drop some coordinates of the stochastic gradient vectors randomly and amplify the rest of the coordinates appropriately. For both convex and nonconvex smooth objectives, under analytic assumptions, it was shown in [4] that sparsifying gradients by magnitude with local error correction provides convergence guarantees. Thus, providing a theoretical foundation for numerous empirical results on training largescale recurrent neural networks on a wide range of applications.
IiiC CommunicationEfficient Secondorder Methods
Firstorder algorithms only require the computation of gradienttype updates, and thus reduce the amount of local computation at each device. But the main drawback is that the required number of communication rounds is still huge due to the slow convergence rate. It thus motivates to exploit secondorder curvature information into distributed learning algorithms to improve the convergence rate for edge training. However, exact secondorder methods require the computation, storage and even communication of a Hessian matrix, which results in tremendous overhead. Therefore, one has to resort to approximate methods such as illustrated in Fig. 3(c) [161]. The works on communicationefficient secondorder methods can be categorized into two types. One is to maintain a global approximated inverse Hessian matrix in the central node, and the other line of works propose to solve a secondorder approximation problem locally at each device.
A common approach to develop approximate secondorder methods is to take the merits of the wellknown quasiNewton method, namely Limitedmemory Broyden Fletcher Goldfarb Shanno (LBFGS) [109], which avoids the high cost of computing the inversion of Hessian matrix via directly estimating the inverse Hessian matrix. In learning with large amounts of training data, it is a critical problem to develop a minibatch stochastic quasiNewton method. However, directly extending LBFGS to a stochastic version does not result in a stable approximation of the inverse Hessian matrix. Schraudolph et al. [157] developed a stochastic LBFGS for online convex optimization without line search, which is often problematic in a stochastic algorithm. But there may be a high level of noise in its Hessian approximation. To provide stable and productive Hessian approximations, Byrd et al. [19] developed a stochastic quasiNewton method by updating the estimated inverse Hessian matrix every iterations using subsampled Hessianvector products. The inverse Hessian matrix maintained in a central node is updated by collecting only a Hessianvector product update at each device. Moritz et al. [123] proposed a linearly convergent stochastic LBFGS algorithm via obtaining a more stable and higher precision estimation of the inverse Hessian matrix, but it requires higher computation and communication overhead at each round.
Another main idea of communicationefficient secondorder methods is to solve a secondorder approximation problem at each device without maintaining and computing a global Hessian matrix. To reduce the communication overhead at each round, Shamir et al. [161] proposed a distributed approximate Newtontype method named as “DANE” by solving an approximate local Newton system at each device with a global aggregation step, which only requires the same communication bandwidth as firstorder distributed learning algorithms. Subsequently, the algorithm “DiSCO” proposed in [208] solved a more accurate secondorder approximation at per communication round by approximately solving the global Newton system with a distributed preconditioned conjugate gradient method. It reduces the communication rounds compared with “DANE”, while the computation cost at the master machine grows roughly cubically with the model dimension. Wang et al. [183] proposed an improved approximate Newton method “GIANT” to further reduce the communication round via conjugate gradient steps at each device, which is shown to outperform “DANE” and “DiSCO”. Note that the communication of these approaches involves the transmission of a global update to each device and the aggregation of local update from each device at per round, both with the same size as the number of model parameters. However, the convergence results of “DANE”, “DiSCO”, and “GIANT” require a high accuracy solution to the subproblem at each device. An adaptive distributed Newton method was proposed in [53] by additionally transmitting a scalar parameter accounting for the information loss of distributed secondorder approximation at per round, which outperforms “GIANT” in numerical experiments.
IiiD CommunicationEfficient Federated Optimization
In the edge training system, the local dataset at each device is usually only a small subset of the overall dataset. Furthermore, the rapid advancement of CPUs and GPUs on mobile devices makes the computation essentially free in comparison to the communication cost. Thus, a natural idea is to use additional local computation to decrease the communication cost. Federated optimization [92] is a framework of iteratively performing a local training algorithm (such as multiple steps of SGD as illustrated in Fig. 3(d)) based on the dataset at each device and aggregating the local updated models, i.e., computing the average (or weighted average) of the local updated model parameters. This framework provides additional privacy protection for data, and has the potential of reducing the number of communication rounds for aggregating updates from a large number of mobile devices. The concern of data privacy and security is becoming a worldwide major issue, especially for emerging highstake applications in intelligent IoT, which prompted governments to enact new regualtions such as General Data Protection Regulation (GDPR) [148]. There are a line of works studying federated optimization algorithms [81, 120, 166] to reduce the communication rounds. In addition, a number of model compression methods have been proposed to reduce the model size, either during the local training process or compressing the model parameters after local training, which can further reduce the communication cost of aggregation for federated optimization [91]. These methods are reviewed in this part.
IiiD1 Minimizing Communication Round
Jaggi et cl. [81] proposed a framework named “CoCoA” by leveraging the primaldual structure of a convex loss function of general linear models with a convex regularization term. In each communication round, each mobile device performs multiple steps of a dual optimization method based on local dataset in exchange for fewer communication rounds, followed by computing the average of updated local models. Motivated by [81], authors in [166] further proposed a communicationefficient federated optimization algorithm called “MOCHA” for multitask learning. By returning an additional accuracy level parameter, it is also capable of dealing with straggling devices. However, these algorithms are not suitable for a general machine learning problem when the strong duality fails or the dual problem is difficult to obtain.
The Federated Averaging (FedAvg) [120] algorithm is another communicationefficient federated optimization algorithm by updating local model at each device with a given number of SGD iterations and model averaging. It is generalized from the traditional oneshot averaging algorithm [209] that is applicable only when the data samples at each device are drawn from the same distribution. In per round of communication, each device performs a given number of steps of SGD with a global model as the initial point, and the aggregated global model is given by the weighted average of all local models. The weights are chosen as the sizes of the local training dataset, which is shown to be robust to not independently and identically distributed (nonIID) data distribution and unbalanced data across mobile devices. Wang and Joshi [181] provided the convergence result of the FedAvg algorithm to a stationary point. To reduce the costly communication with the remote cloud, edge server assisted hierarchical federated averaging was proposed in [110]. By exploiting the highly efficient local communications with edge servers, it achieves significant training speedup compared with the cloudbased approach. With infrequent model aggregation at the cloud, it also achieves higher model performance than edgebased training, as data from more users can be accessed.
For the FedAvg algorithm, the steps of local SGD at each device should be chosen carefully given the existence of statistical heterogeneity, i.e., when the local data across devices are nonIID. If too many steps of SGD are performed locally, the learning performance will be degraded. To address this problem, the FedProx algorithm [155] was proposed by adding a proximal term in the local objective function to restrict the local updated model to be close to the global model, instead of initializing each local model with the global updated at each communication round. Its convergence guarantees are also provided via characterizing the heterogeneity with a device dissimilarity assumption. Numerical results demonstrate that FedProx is more robust to the statistical heterogeneity across devices.
IiiD2 Minimizing Communication Bandwidth
Transmitting the model parameters per communication round generally results in a huge communication overhead since the number of model parameters can be very large. Therefore, it is important to reduce the model size to alleviate the communication overhead [91]. To this end, model compression is one of the promising approaches. We survey the main techniques adopted in model compression in this subsection.

[leftmargin=0pt,itemindent=1.5em,align=left,topsep=0.5em,itemsep=0.5em]

Quantization: Quantization compresses DNNs by representing the weights by fewer bits instead of adopting the 32bit floating point format. The works [61, 188] adopt means clustering to the weights of a pretrained DNN. At the training stage, it has been shown that DNNs can be trained using only 16bit wide fixedpoint number representation by stochastic rounding [66], which induces little to no degradation in the classification accuracy. In the extreme case, the weights are represented by
bit, but the naive approach that binarizes pretrained DNNs directly shall bring performance loss significantly. Therefore, the main idea behind binarization is to learn the binary weights or activation during training, which are thoroughly investigated in
[40, 41, 143]. This kind of method allows a substantial computational speedup on devices due to the bitwise operations. It may also reduce the communication cost in federated learning significantly as the weights are represented by bit. 
Sketching: Randomized sketching [139, 36] is a powerful tool for dimensionality reduction, which can be applied to model compression. In [27], HashedNet sketches the weights of neural networks using a hash function, and enforces all weights that are mapped to the same hash bucket to share a single parameter value. But it is only applicable to fully connected neural networks. The subsequent work [26]
extended it to CNNs, which is achieved by first converting filter weights to the frequency domain and then grouping the corresponding frequency parameters into hash buckets using a lowcost hash function. Theoretically, it was shown in
[107] that such Hashingbased neural networks have nice properties, i.e., local strong convexity and smoothness around the global minimizer. 
Pruning: Network pruning generally compresses DNNs by removing the connections, filters or channels according to some criteria. Early works include the Optimal Brain Damage [96] and the Optimal Brain Surgeon [75]
, which proposed to remove the connections between neurons based on the Hessian of the loss function given a trained DNN. Recently, a line of research is to prune redundant, less important connections in a pretrained DNN. For instance, the work in
[73] proposed to prune the unimportant weights of a pretrained network and retrain the network to fine tune the weights of the remaining connections, which reduces the number of parameters of AlexNet by without harming the accuracy. Deep compression was proposed in [72] to compress DNNs via three stages, i.e., pruning, trained quantization and Huffman coding, which yields considerably compact DNNs. For example, the storage size of AlexNet is reduced by on the ImageNet dataset without loss of accuracy. From a Bayesian point of view, network pruning was also investigated in [176, 112]. However, such heuristic methods present no convergence guarantees. Instead, Aghasi
et al. [2] proposed to prune the network layerbylayer via convex programming, which also shows that the overall performance drop can be bounded by the sum of the reconstruction error of each layer. Subsequently, iterative reweighed optimization has been adopted to further prune the DNN with convergence guarantees [85]. 
Sparse regularization: There is a growing interest in learning compact DNNs without pretraining, which is achieved by adding regularizers to the loss function during training in order to induce sparsity in DNNs. In [95], the authors proposed to use a regularizer based on norm to induce groupsparse structured convolution kernels when training CNNs, which leads to computational speedups. To remove trivial filters, channels and even layers at the training stage, the work in [186] proposed to add structured sparsity regularization on each layer. Theoretically, the convergence behavior of gradient descent algorithms for learning shallow compact neural networks was depicted in [132], which also shows the required sample complexity for efficient learning.

Structural matrix designing: The main idea behind lowrank matrix factorization approaches for compressing DNNs is to apply lowrank matrix factorization techniques to the weight matrix of DNNs. For a low rank matrix with , we can represent it as where . Therefore, we reduce the total parameters from to , which is able to reduce the computational complexity and storage. For example, the work in [156] showed that the number of parameters of the DNNs can be reduced by for large vocabulary continuous speech recognition tasks via lowrank matrix factorization of the final weight layer. In [46], in order to accelerate convolution, each convolutional layer is approximated by a lowrank matrix, and different approximation metrics are studied to improve the performance. The work in [80] proposed to speed up the convolutional layers by constructing a low rank basis of rankone filters for a pretrained CNN.
Lowrank methods have also been exploited at the training stage. In [45], lowrank methods were exploited to reduce the number of network parameters that are learned during training. Lowrank methods have also been adopted to learn separable filters to accelerate convolution in [151, 7], which is achieved by adding additional regularization to find lowrank filters.
Besides lowrank matrix factorization, another way to reduce the number of parameters of weight matrix is leveraging structured matrices which can describe matrices with much fewer parameters than . In this way, Sindhwani et al. [165] proposed to learn structured parameter matrices of DNNs, which also accelerates inference and training dramatically via fast matrixvector products and gradient computation. The work in [32] proposed to impose the circulant structure on the weight matrix of fullyconnected layers to accelerate computation both at training and inference stages. In [195], the authors presented an adaptive Fastfood transform to reparameterize the matrixvector multiplication of fullyconnected layers, thereby reducing the storage and computational costs.
Iv CommunicationEfficient Edge AI Systems
Due to the limited computation, storage, and communication resources of edge nodes, as well as the privacy, security, lowlatency, and reliability requirements of AI applications, a variety of edge AI system architectures have been proposed and investigated for efficient training and inference. This section gives a comprehensive survey of different edge AI systems and topics therein. It starts with a general discussion on different architectures, and then introduces them one by one.
Iva Architectures of Edge AI Systems
We summarize the main system architectures of edge AI into four categories. According to the availability data and model parameters, data partition based edge training systems and model partition based edge training systems are two common system architectures for efficiently training at network edges. To achieve lowlatency inference, computation offloading based edge inference systems is a promising approach by offloading the entire or a part of inference tasks from resource limited edge devices to proximate edge servers. There are also edge AI systems defined by general computing paradigms, which can be termed as general edge computing systems.

[leftmargin=0pt,itemindent=1.5em,align=left,topsep=0.5em,itemsep=0.5em]

Data partition based edge training systems: For data partition based edge training systems, the data is massively distributed over a number of edge devices, and each edge device has only a subset of the whole dataset. Then the edge AI model can be trained by pooling the computation capabilities of edge devices. During training, each edge device holds a replica of the complete AI model to compute a local update. This procedure often requires a centralized coordinating center, e.g., an edge server, for scheduling a number of edge devices, aggregating the local updates from edge devices, etc. There are also works considering decentralized systems where edge devices communicate with each other directly. Edge training systems with a central node are usually called distributed system modes, while systems without a central node are called decentralized system modes, as demonstrated in Fig. 4.
Fig. 4: Two different types of edge training systems. 
Model partition based edge training systems: In model partition based edge training systems, each node does not have replica of all the model parameters, i.e., the AI model is partitioned and distributed across multiple nodes. Model partition is needed when very deep machine learning models are applied. Some works proposed to balance computation and communication overhead via model partition for accelerating the training process. Furthermore, model partition based edge training systems garner much attention for preserving the data privacy during training when each edge node can only access to partial data attributes for a common set of user identities. It is often referred to as vertical federated learning [194]. To preserve data privacy, it is proposed to train a model through the synergy of the edge device and edge server by performing simple processing at the device and uploading the intermediate values to a powerful edge server. This is realized by deploying a small part of model parameters on the device and the remaining part on the edge server to avoid the exposure of users’ data.

Computation offloading based edge inference systems: To enable lowlatency edge AI services, it is critical to deploy the trained model proximate to end users. Unfortunately, it is often infeasible to deploy large models, especially DNN models, directly on each device for local inference due to the limited storage, computation and battery resources. Therefore, a promising solution is to push the AI model and massive computations to proximate edge servers, which prompts the recent proposal of computation offloading based edge inference systems [117]. We divide the works on computation offloading based edge inference systems into two classes, i.e., deploying the entire model on an edge server, and partitioning the model and deploying across the edge device and edge server.

General edge computing systems: Beyond the systems mentioned above, there are also edge AI systems defined by general computing paradigms, e.g., MapReduce [43]. The MapReducelike frameworks often consider distributed data input and distributed model deployment jointly for accelerating distributed training or inference. In such systems, reducing the communication overhead for data shuffling between multiple nodes becomes a critical task. Interestingly, coding technique plays a critical role in scalable data shuffling [105, 190] as well as straggler mitigation [136].
In the remaining part of this section, we discuss the important topics and involved techniques that address the communication challenges for these system architectures.
IvB Data Partition Based Edge Training Systems
In data partition based edge training systems, each device usually has a subset of the training data and a replica of the machine learning model. The training can be accomplished by performing local computation and periodically exchanging local updates from mobile devices. The main advantage of such a system is that it is applicable to most of the model architectures and scales well. The main drawback is that the model size and the operations that are needed to complete the local computation are limited by the storage size and computation capabilities of each device. In the following, we separately discuss distributed and decentralized system modes.
IvB1 Distributed System Mode
In the distributed system mode, each edge device computes a local update according to its local data samples, and the central node shall periodically aggregate local updates from edge devices. The communication bottleneck comes from aggregating the local updates from mobile devices and straggler devices. The efforts for addressing the communication challenges in distributed data partition based training systems are listed as follows:

[leftmargin=0pt,itemindent=1.5em,align=left,topsep=2.5em,itemsep=0.5em]

Fast aggregation via overtheair computation:
Overtheair computation is an efficient approach to compute a function of distributed data by exploiting the signal superposition property of the wireless multiple access channel [127]. As shown in Fig. 5, we are able to jointly consider communication and computation to reduce the communication costs significantly. In particular, the function that can be computable via overtheair computation is called the nomographic function [60]. In distributed machine learning, we first compute the local updates (e.g., gradients and model parameters) at each worker, and aggregate these values over the wireless channel. For aggregation functions that fall into the class of nomographic functions, we are able to improve the communication efficiency by exploiting overtheair computing. It should be noted that digital modulation schemes for overtheair computation are advocated in [187, 20, 50, 215] due to its easier implementation on the existing communication systems and its less stringent requirement of synchronization compared to anolog schemes.
To improve the communication efficiency for federated learning, Yang et al. [192] proposed to adopt the overtheair computation approach for fast model aggregation instead of the traditional communicationandcomputation separation method. This is motivated by the fact that the aggregating function is a linear combination of updates from distributed mobile devices, which falls into the set of nomographic functions. Using transceiver design by exploring the signal superposition property of a wireless multiple access channel, overtheair computation can improve the communication efficiency and reduce the required bandwidth. In addition, the joint device selection and beamforming design problem was considered in [192], for which sparse and lowrank optimization methods were proposed, yielding admirable performance of the proposed overtheair computation for fast model aggregation.
The efficiency of overtheair computation for fast aggregation in federated edge learning has also been demonstrated in [218], which characterized two tradeoffs between communication and learning performance. The first one is the tradeoff between the updated quality measured by the receive SNR and the truncation ratio of model parameters due to the proposed truncatedchannelinversion policy for deep fading channels. The second one is the tradeoff between the receive SNR and the fraction of exploited data, namely, the fraction of scheduling cellinterior devices if the data distributed over devices uniformly. In [216], overtheair computation based on MIMO, i.e., multiantenna techniques, is further adopted in highmobility multimodal sensing for fast aggregation, where the receive beamforming is designed based on the differential geometry approach.
Based on overtheair computation, Amiri and Gunduz [5] proposed a gradient sparsification and random linear projection method to reduce the dimension of gradients due to limited channel bandwidth. It was shown that such an approach results in a much faster convergence of the learning process compared with the separate computation and communication based approaches. This work was further extended to wireless fading channels in [6].

Aggregation frequency control with limited bandwidth and computation resources: The learning process includes the local updates at different devices and the global aggregation at the fusion center. We can aggregate the local updates at the interval of one or multiple local updates, such as adopting the federated averaging algorithm [120]. The aggregation frequency should be carefully designed by weighing the limited computation resources at devices locally and the limited communication bandwidth for global data aggregation. To this end, Wang et al. [182] provided a convergence bound of gradientdescent based federated learning from a theoretical perspective. Based on this convergence result, the authors proposed a control algorithm that learns the data distribution, system dynamics, and model characteristics, which can be used to dynamically determine the frequency of global aggregation in real time to minimize the learning loss under a fixed resource budget. Zhou and Cong [213] established the convergence results of the distributed stochastic gradient descent algorithm that is averaged after steps for nonconvex loss functions. The convergence rate in terms of the total run time instead of the number of iterations was investigated in [180], which also proposed an adaptive communication strategy that starts with a low aggregation frequency to save communication costs, followed by increasing the aggregation frequency to achieve a low error floor.

Data reshuffling via index coding and pliable index coding: Data reshuffling [145, 67] is a recognized approach to improve the statistical performance of machine learning algorithms. Randomly reshuffling the training data at each device makes the distributed learning algorithm go over the data in a different order, which brings statistical gains for nonIID data [10]. However, in edgeAI systems, its communication cost is prohibitively expensive. There are a sequence of works focusing on reducing the communication cost of data reshuffling.
To reduce the communication cost of data reshuffling, Lee et al. [98] proposed a coded shuffling approach based on index coding. This approach assumes that the data placement rules are prespecified. The statistical learning performance can be improved provided a small number of new data points updated at each work, which motivates the proposal of a pliable index coding based semirandom data reshuffling approach [167] for more efficient coding schemes design. It claims that the new data for each device is not necessarily in a specific way and each data is required at no more than devices (which is called the constraint). The pliable data reshuffling problem was also considered in wireless networks [84]. It was further observed that at per round it is not necessary to update a new data for all mobile devices, and the authors proposed to maximize the number of devices that are refreshed with a new data point. This approach turns out to reduce the communication cost considerably with a slight sacrifice of the learning performance.

Straggler mitigation via coded computing: In practice, some devices may be stragglers during the computation of the gradients, i.e., it takes more time for these devices to finish the computation task. By carefully replicating data sets on devices, Tandon et al. [172] proposed to encode the computed gradients to migrate stragglers, while the amount of redundancy data depends on the number of stragglers in the system. In [196], straggler tolerance and communication cost were considered jointly. Therefore, compared with [172], the total runtime of the distributed gradient computation is further reduced by distributing the computations over subsets of gradient vector components in addition to subsets of data sets. Raviv et al. [144] adopted tools from classic coding theory, i.e., cyclic MDS codes, to achieve favorable performance of gradient coding in terms of the applicable range of parameters and in the complexity of the coding algorithms. Using ReedSolomon codes, Halbawi et al. [71] made the learning system more robust to stragglers compared with [172]. The performance with respect to the communication load and computation load required for mitigating the effect of stragglers was further improved in [103]. Most of straggling mitigation approaches assumed that the straggler devices have no contribution to the learning task. In contrast, it was proposed by [133] to exploit the nonpersistent stragglers since they are able to complete a certain portion of assigned tasks in practice. This is achieved by transmitting multiple local updates from devices to the fusion center per communication round instead of only one local updates per round.
In addition, approximate gradient coding was proposed in [144] where the fusion center only requires an approximate computation of the full gradients instead of an exact one, which reduces the computation from the devices significantly while preserving the system tolerance to stragglers. However, this approximate gradient approach typically results in a slower convergence rate of the learning algorithm compared with the exact gradient approach [14]. When the loss function is the squared loss, it was proposed in [115]
to encode the second moment of the data matrix with a low density paritycheck (LDPC) code to mitigate the effect of the stragglers. They also indicated that the moment encoding based gradient descent algorithm can be viewed as a stochastic gradient descent method, which provides opportunities to obtain convergence guarantees for the proposed approach. Considering the general loss function, it was proposed in
[114] to distribute the data to the devices using low density generator matrix (LDGM) codes. Bitar et al. [14] proposed an approximate gradient coding scheme by distributing data points redundantly to devices based on a pairwise balanced design, simply ignoring the stragglers. The convergence guarantees are established and the convergence rate can be improved with the redundancy of data [14].
IvB2 Decentralized System Mode
In the decentralized mode, a machine learning model is trained with a number of edge devices by exchanging information directly without a central node. A well known decentralized information exchange paradigm is the gossip communication protocol [17], by randomly evoking a node as a central node to collect updates from neighbour nodes or broadcast its local update to neighbour nodes. By integrating the gossip communication protocols into the learning algorithms, Elastic Gossip [141] and Gossiping SGD [42] [16] [15] were proposed.
One typical network topology for decentralized machine learning is the fully connected network, where each device communicates directly with all other devices. In this scenario, each device maintains a local copy of the model parameters and computes its local gradients that will be sent to all other devices. Each device can average the gradients received from every other devices and then perform local updates. In each iteration, the model parameters will be identical at all devices if each device starts from a same initial point. This process is essentially the same as the classical gradient descent at a centralized server, so the convergence can be guaranteed as in the centralized settings. However, such a fully connected network suffers a heavy communication overhead that grows quadratically in the number of devices, while the communication overhead is linear in the number of devices for centralized settings. Therefore, network topology design plays a key role in alleviating the communication bottleneck in decentralized scenarios. In addition, the convergence rate of the decentralized algorithm also depends on the topology of network [128]. We should note that the decentralized edge AI system suffers from the same issues as the system in distributed mode since each device acts like a fusion center.
There have been several works demonstrating that some carefully designed topologies of networks achieve better performance than the fully connected network. It has been empirically observed in [1] that using an alternative network topology between devices can lead to improved learning performance in several deep reinforcement learning tasks compared with the standard fullyconnected communication topology. Specifically, it was observed in [1] that the ErdosRenyi graph topology with 1000 devices can compete with the standard fullyconnected topology with 3000 devices, which shows that the machine learning performance can be more efficient if the topology is carefully designed. Considering that different devices may require different times to carry out local computation, Neglia et al. [129] analyzed the influences of different network topologies on the total runtime of distributed subgradient methods, which can determine the degrees of the topology graph, leading to the faster convergence speed. They also showed that a sparser network can sometimes result in significant reduction of the convergence time.
One common alternative to the fully connected network topology is to employ a ring topology [137], where each device only communicates with its neighbors that are arranged in a logical ring. More concretely, each device aggregates and passes its local gradients along the ring such that all devices have a copy of the full gradients at the end. This approach has been adopted in distributed deep learning for model updating [86, 159]. However, the algorithm deployed on the ring topology are inherently sensitive to stragglers [150]. To alleviate the effects of stragglers in the ring topology, Reisizadeh et al. [150] proposed to use a logical tree topology for communication, based on which they mitigated stragglers by gradient coding techniques. In the tree topology, there are several layers of devices, where each device communicates only with its parent node. By concurrently transmitting messages from a large number of children nodes to multiple parent nodes, communication with the tree topology can be more efficient than that with the ring topology.
IvC Model Partition Based Edge Training Systems
While data partition based edge training systems have obtained much attention in both academia and industry, there is also an important line of works designing edge AI systems based on partitioning a single machine learning model and deploying it distributedly across mobile devices and edge servers. In such systems, each node holds part of the model parameters and accomplish the model training task or the inference task collaboratively. One main advantage of model partition in the training process is the small storage size needed for each node. In this system, the machine learning model is distributedly deployed among multiple computing nodes, with each node evaluating updates of only a portion of the model’s parameters. Such method is particularly useful in the scenarios where the machine learning model is too large to be stored in a single node [119, 189]. Another main concern of model partition during training is the data privacy when the data at each node belongs to different parties. However, model training with model partition based architectures also poses heavy communication overhead between edge devices.

[leftmargin=0pt,itemindent=1.5em,align=left,topsep=0.5em,itemsep=0.5em]

Model partition across a large number of nodes to balance computation and communication: A line of works [121, 126, 79] have considered the model partition across edge nodes with heterogeneous hardware and computing power. In [121], a reinforcement learning approach was proposed for deploying the computing graph onto edge computing devices, which, however, is time and resource intensive. To avoid the huge computation cost of reinforcement learning based approach, Harlap et al. [126] proposed the PipeDream system for automatically determining the model partition strategy of DNNs. Furthermore, injecting multiple minibatches makes the system converge faster than using a single machine or using the data partition approach. While PipeDream stresses the hardware utilization of edge devices, each device should maintain multiple versions of model parameters to avoid optimization issues caused by the staleness of parameters with asynchronous backward updates. This hinders scaling to much bigger models for PipeDream. To address this problem, the GPipe system was proposed in [79] with novel batchsplitting and rematerialization techniques, which is able to scale to large models with little additional communication overhead.

Model partition across the edge device and edge server to avoid the exposure of users’ data: In practice, powerful edge servers are often owned by service providers, but users may be relunctant to expose their data to service providers for model training. The observation that a DNN model can be split between two successive layers motivates researchers to deploy the first few layers on the device locally and the remaining layers on the edge server to avoid the exposure of users’ data. Mao et al. [118] proposed a privacypreserving deep learning architecture where the shallow part of a DNN is deployed on the mobile device and the large part is deployed on the edge server. Gupta and Raskar [65]
designed a model partition approach over multiple agents, i.e., multiple data sources and one supercomputing resource, and further extended it to semisupervised learning cases with few labeled sample. A particular DNN model for face recognition is trained and evaluated on a Huawei Nexux 6P phone with satisfactory performance. In
[179], a partition approach named ARDEN was proposed by taking both privacy and performance into consideration. The model parameters at mobile device are fixed and differential privacy mechanism is introduced to guarantee the privacy of the output at mobile device. Before uploading the local output, deliberate noise is added to improve the robustness of DNN, which is shown to be beneficial for the inference performance. 
Vertical architecture for privacy with vertically partitioned data and model: In most industries, data is often vertically partitioned, i.e., each owner only holds partial data attributes. Data isolation becomes a severe bottleneck for collaboratively building a model due to competition, privacy, and administrative procedures. Therefore, much attention is being paid on privacypreserving machine learning with vertically partitioned data [194]. During the training, the model is also vertically partitioned and each owner holds a part of model parameters. Therefore, vertical architecture of AI is proposed and studied for privacypreserving machine learning where each node has access to different features of common data instances and maintains the corresponding subset of model parameters. What makes it worse is that the label of each data instance is only available to nodes belonging to one party.
Vaidya and Clifton [177] proposed a privacypreserving means algorithm in the vertical architecture with secure multiparty computation. Kantarcioglu and Clifton [88]
studied the secure association rules mining problem with vertically partitioned data. A linear regression model was taken into consideration in
[55], and multiparty computation protocols were proposed with a semitrusted third party to achieve secure and scalable training. For privacypreserving classification with support vector machine (SVM), Yu
et al. [198]considered the dual problem of SVM and adopted a random perturbation strategy, which is suitable only for nodes belong to more than three parties. A privacypreserving classification approach based on decision tree was proposed in
[178], which adopts secure multiparty computation procedures including commutative encryption to determine if there are any remaining attributes and secure cardinality computation of set intersection. For classification with logistic regression, the problem becomes even more difficult because of the coupled objective function as well as the gradient. To address this problem, Hardy
et al. [74] proposed to use Taylor approximation to benefit from the homomorphic encryption protocol without revealing the data at each node.
IvD Computation Offloading Based Edge Inference Systems
The advancement of edge computing makes it increasingly attractive to push the AI inference task to network edge to enable lowlatency AI services for mobile users [117]. However, the power consumption and storage for DNN models is often unbearable for mobile devices such as wearable devices. Fortunately, offloading the task from edge devices to powerful edge servers emerges as an antidote [117, 193, 78, 175]. One solution is to offload the entire inference task to an edge server, which is termed as serverbased edge inference, as shown in Fig. 6(a). It is particularly suitable for resource limited IoT devices. In this case, the entire AI models are deployed on edge servers and edge devices should upload their input data to edge servers for inference. For latency and privacy concerns, another alternative is offloading only a part of the task to the edge server, and the edge server computes the inference result based on the intermediate value computed by the edge device. We refer to it as deviceedge joint inference as shown in Fig. 6(b). This edge device and edge server synergy can be achieved by performing simple processing at the device and the remaining part at the edge server.
IvD1 ServerBased Edge Inference
In the scenario where the models are deployed on the edge servers, the devices send the input data to the edge, the edge servers compute the inference results according to the trained models, and the inference results are then sent back to the devices. The main bottleneck is the limited communication bandwidth for data transmission. To reduce the realtime data transmission overhead of uplink transmission in bandwidthlimited edge AI systems, an effective way is to reduce the volume of data transmitted from devices without hurting the inference accuracy. In addition, cooperative downlink transmission of multiple edge servers has been proposed to enhance the communication efficiency for edge inference.

[leftmargin=0pt,itemindent=1.5em,align=left,topsep=0.5em,itemsep=0.5em]

Partial data transmission: To realize cloud based visual localization for mobile robots in real time, it is important to control the volume of the data through the network. Therefore, Ding et al. [48] used an data volume reduction method proposed by [138] for multirobot communication, which employs sparsification methods to compress the data. In a cloudbased collaborative 3D mapping system, Mohanarajah et al. [122] proposed to reduce bandwidth requirements by sending only the keyframes as opposed to all the frames produced by the sensor, and Chen et al. [25] proposed to determine and offload keyframes for object detection by utilizing heuristics such as frame differences to select the keyframes. These approaches are useful in reducing the communication cost when we are able to exploit the structure of the specific tasks and the associated data.

Raw data encoding: Data encoding has been widely used in compressing the data volume. For example, traditional image compression approaches (e.g., JPEG) can compress data aggressively, but they are often optimized from the perspective of humanvisual, which will result in an unacceptable performance degradation in DNN applications if we use a high compression ratio. Based on this observation, to achieve a higher compression ratio, Liu et al. [111] proposed to optimize the data encoding schemes from the perspective of DNNs based on the frequency component analysis and rectified quantization table, which is able to achieve a higher compression ratio than the traditional JPEG method without degrading the accuracy for image recognition. Instead of using standard video encoding techniques, it was argued in [33] that data collection and transmission schemes should be designed jointly in vision tasks to maximize an endtoend goal with a pretrained model. Specifically, the authors proposed to use DNNs to encode the high dimensional raw data into a sparse, latent representation for efficient transmission, which can be recovered later at the cloud via a decoding DNN. In addition, this coding process is controlled by a reinforcement learning algorithm, which sends action information to devices for encoding in order to maximize the predication accuracy of the pretrained model with decoded inputs, while achieving communicationefficient data transmission. This novel data encoding idea is a promising solution for realizing realtime inference in edge AI systems.

Cooperative downlink transmission: Cooperative transmission [57] is known as an effective aproach for improving the communication efficiency via proactive interferenceaware coordination of multiple base stations. It was proposed in [193] to offload each inference task to multiple edge servers and cooperatively transmit the output results to mobile users in downlink transmission. A new technology named intelligent reflecting surface (IRS) [142] emerges as a costeffective approach to enhance the spectrum efficiency and energy efficiency of wireless communication networks, which is promising in facilitating communicationefficient edge inference [203]. It is achieved by reconfiguring the wireless propagation environment via a planar array to induce the change of the signals’ amplitude and/or phase. To further improve the performance of the cooperative edge inference scheme in [193], Hua et al. [78] proposed the IRSaided edge inference system and designed a task selection strategy to minimize both the uplink and downlink transmit power consumption, as well as the computation power consumption at edge servers.
IvD2 DeviceEdge Joint Inference
For many ondevice data, such as healthcare information and users’ behaviors, privacy is of a primary concern. Thus, there emerges the idea of edge device and edge server synergy, which can be termed as deviceedge joint inference, by deploying the partitioned DNN model over the mobile device and the powerful edge server. By deploying the first few layers locally, a mobile device can compute the local output with simple processing, and transmit the local output to a more powerful edge server without revealing any sensitive information.
An early work [76]
considered the partition of the image classification pipeline and found that executing feature extraction on devices and offloading the rest to the edge servers achieves optimal runtime. Recently, Neurosurgeon has been proposed in
[87], where a DNN model is automatically split between a device and an edge server according to the network latency for transmission and the mobile device energy consumption at different partition points, in order to minimize the total inference time. Different methods have been developed [116, 212] to partition a pretrained DNN over several mobile devices in order to accelerate DNN inference on devices. Bhardwaj et al. [13] further considered memory and communication costs in this distributed inference architecture, for which model compression and network sciencebased knowledge partitioning algorithm are proposed to address these issues. For robotics system where the model is partitioned between the edge server and the robot, the robot should take both local computation accuracy and offloading latency into account, and this offloading problem was formulated in [34] as a sequential decision making problem that is solved by a deep reinforcement learning algorithm.In the following, we review the main methods for further reducing the communication overhead for the model partition based edge inference.

[leftmargin=0pt,itemindent=1.5em,align=left,topsep=0.5em,itemsep=0.5em]

Early exit: Early exit can be used to reduce communication workloads when partitioning DNNs, which has been proposed in [174] based on the observation that the features learned at the early layer of the network can be often sufficient to produce accurate inference results. Therefore, the inference process can exit early if the data samples can be inferred with high confidence. This technique has been adopted in [175] for distributed DNN inference over the cloud, the edge and devices. With early exit, each device first performs the first few layers of an DNN, and offloads the rest of computation to the edges or the clouds if the outputs of the device do not meet the accuracy requirements. This approach is able to reduce the communication cost by a factor of over compared with the traditional approach that offloads all raw data to the cloud for inference. More recently, Li et al. [100] proposed an ondemand lowlatency inference framework through jointly designing the model partition strategy according to the heterogeneous computation capabilities between a mobile device and edge servers, and the early exit strategy according to the complicated network environment.

Encoded transmission and pruning for compressing the transmitted data: In a hierarchical distributed architecture, the main communication bottleneck is that the transmission of intermediate values between the partition point since the intermediate data can be much larger than the raw data. To reduce the communication overhead of intermediate value transmissions, it was proposed in [90] to partition a network at an intermediate layer, whose features are encoded before wireless transmissions to reduce their data size. It shows that partitioning a CNN at the end of the last convolutional layer where the data communication requirement is less coupled with feature space encoding enables significant reduction in communication workloads. Recently, a deep learning based endtoend architecture was proposed in [163], named BottleNet++. By jointly considering model partition, feature compression and transmission, BottleNet++ achieves up to 64x bandwidth reduction over the additive white Gaussian noise channel and up to 256x bit compression ratio in the binary erasure channel, with less than reduction in accuracy, compared with merely transmitting intermediate data without feature compression.
Network pruning, as discussed in Section IIID2, has been exploited in reducing the communication overhead of intermediate feature transmissions. For example, a 2step pruning approach was proposed in [164] to reduce the transmission workload at the network partition point by limiting the pruning region. Specifically, the first step aims to reduce the total computation workload of the network while the second step aims to compress the intermediate data for transmission.

Coded computing for cooperative edge inference: Coding theory can be leveraged to address the communication challenges of distributed inference in edge AI systems. For example, Zhang and Simeone [204] considered distributed linear inference in mobile edge AI system, where the model is partitioned among several edge devices that compute the inference results cooperatively for each device. It was shown in [204] that coding is efficient in reducing the overall computationpluscommunication latency.
IvE General Edge Computing System
Beyond the edge AI system architectures mentioned above, there are also edge AI systems based on a general computing paradigm, namely, MapReduce. MapReduce [43] is a general distributed computing framework that is able to achieve parallel speedups on a variety of machine learning problems during training and inference procedures [37]. The MapReducelike distributed computing framework takes the distributed data input and distributed model deployment into account jointly. In [11], the convolutional neural network was implemented based on the MapReduce framework to accelerate its training process. Ghoting et al. [58] proposed SystemML based on the MapReduce framework to support distributed training for a broad class of supervised and unsupervised machine learning algorithms. [190] proposed a communicationefficient wireless data shuffling strategy for supporting MapReducebased distributed inference tasks.
In the MapReducelike distributed computing framework as shown in Fig. 7, there are generally three phases (i.e., a map phase, a shuffle phase, and a reduce phase) to complete a computational task. In the map phase, every computing node computes a map function of the assigned data simultaneously, generating a number of intermediate values. In the shuffle phase, nodes communicate with each other to obtain some intermediate values for computing the output function. Subsequently, in the reduce phase, each node computes the assigned output function according to the available intermediate values. However, there are two main bottlenecks in such a distributed computing framework. One is the heavy communication load in the shuffle phase, and another is the straggler delay caused by the variability of computation times at different nodes. To address these problems, coding has been proposed as a promising approach by exploiting abundant computing resources at the network edge [104]. In recent years, coding techniques are becoming a hot area of research for reducing the communication cost of data shuffling, as well as reducing the computing latency by mitigating straggler nodes, as reviewed below.

[leftmargin=0pt,itemindent=1.5em,align=left,topsep=0.5em,itemsep=0.5em]

Coding techniques for efficient data shuffling: Coding techniques for shuffling data in the MapReducelike distributed computing framework were first proposed in [105], which considered a wireline scenario where each computing node can obtain the intermediate values from other nodes through a shared link. In [102], the authors extended the work in [105] to a wireless setting, where the computing nodes are able to communicate with each other via an access point. A scalable data shuffling scheme was proposed by utilizing a particular repetitive pattern of placing intermediate values among devices, reducing the communication bandwidth by factors that grow linearly with the number of devices. To improve the wireless communication efficiency (i.e., achieved data rates) in the data shuffle phase, a lowrank optimization model was proposed in [190] by establishing the interference alignment condition. The lowrank model is further solved by an efficient differenceofconvexfunctions (DC) algorithm. Both [102] and [190] considered the communication load minimization problem under the wireless communication setting with a central node.
There are also some works considering the problem of reducing the communication load in data shuffling under the wireless communication scenario without a coordinating center. That is, the computing nodes can communicate with each other through an shared wireless interference channel. For example, assuming perfect channel state information, a beamforming strategy was proposed in [101] based on side information cancellation and zeroforcing to trade the abundant computing nodes for reducing communication load, which outperforms the coded TDMA broadcast scheme based on [105]. This work was further extended in [68] to consider imperfect channel state information. The paper [82] proposed a dataset cache strategy and a coded transmission strategy for the corresponding computing results. The goal is to minimize the communication load characterized by latency (in seconds) instead of channel uses (in bits), which is more practical in wireless networks. In [136], the authors noted that to trade abundant computation for the communication load, the computational tasks must be divided into an extremely large number of subtasks, which is impractical. Therefore, they proposed to ameliorate this limitation by node cooperation and designed an efficient scheme for task assignment. Prakash et al. [140] investigated coded computing for distributed graph processing systems, which improves performance significantly compared with the general MapReduce framework by leveraging the structure of graphs.

Coding techniques for straggler mitigation: Another line of work focuses on addressing the straggler problem in distributed computing by coding techniques. Mitigating the effect of stragglers utilizing coding theory was first proposed in [136] for a wired network. The main idea is to leverage redundant computing nodes to perform computational subtasks, while the computation result can be correctly recovered as long as the local computation results from any desired subset of computing nodes are collected. This work was extended to wireless networks [149], where only one local computing node can send its computation results to the fusion center at a time. The paper [211] proposed a subtask assignment method to minimize the total latency which is composed of the latency caused by wireless communication between different computing nodes and the fusion center and the latency caused by the variability of computation time of different devices. Most of the above works focused on linear computations (e.g., matrix multiplication). However, to realize the distributed inference on stateoftheart machine learning algorithms (e.g., DNN), nonlinear computation should be taken into consideration. As a result, the work [93] proposed a learningbased approach to design codes that can handle the stragglers issue in distributed nonlinear computation problems.
V Conclusions and Future Directions
This paper presented a comprehensive survey on the communication challenges and solutions in edge AI systems, which shall support a plethora of AIenabled applications at the network edge. Specifically, we first summarized communication efficient algorithms for distributed training AI models on edge nodes, including zerothorder, firstorder, secondorder, and federated optimization algorithms. We then categorized different system architectures of edge AI systems, including data partition based, and model partition based edge training systems. Next, we revisited works bridging the gap between computation offloading and edge inference. Beyond these system architectures, we also introduced general edge computing defined AI systems. The communication issues and solutions in such architectures were extensively discussed.
The activities and applications of edge AI are growing rapidly, and a number of challenges and future directions are listed below.

Edge AI hardware design:
Hardware of edge nodes determines the physical limits of AI systems, for which there are a growing amount of efforts on edge AI hardware design. For example, Google edge tensor processing unit (TPU) is designed for highspeed inference at the edge. Nvidia has rolled out Jetson TX2 for powerefficient embedded AI computing. Nevertheless, these hardwares mainly focus on performing the entire task, especially edge inference locally. In the future, a variety of edge AI hardwares will be customized for different AI system architectures and applications.

Edge AI software platforms: The past decade has witnessed the blossom of AI software platforms from top companies for supporting cloudbased AI services. Cloudbased AI service providers are trying to include edge nodes into their platforms, though edge nodes only serve as simple extensions of cloud computing nodes currently. Google Cloud IoT, Microsoft Azure IoT, NVIDIA EGX, and Amazon Web Services (AWS) IoT are able to connect IoT devices to cloud platforms, thereby managing edge devices and processing the data from all kinds of IoT devices.

Edge AI as a service: For different fields and applications, there are a variety of additional design targets and constraints, thereby requiring domainspecific edge AI frameworks. Edge AI will be a service infrastructure that integrates the computation, communication, storage and power resources at network edges to enable datadriven intelligent applications. A notable example in credit industry is FATE [185], an industrial grade federated learning framework proposed by Webank. A number of federated learning algorithms [30, 191] were designed to break data isolation among institutions and to preserve the data privacy during edge training. Another representative attempt of edge AI for smart healthcare is NVIDIA Clara [131], which delivers AI to healthcare and life sciences with NVIDIA’s EGX edge computing platform. Since Clara features federated learning, it supports an innovative approach to collaboratively build a healthcare AI model from hospitals and medical institutions, while protecting patient data.
Acknowledgement
We sincerely thank Prof. Zhi Ding from the University of California at Davis for insightful and constructive comments to improve the presentation of this work.
References
 [1] (2019) Communication topologies between learning agents in deep reinforcement learning. arXiv preprint arXiv:1902.06740. Cited by: TABLE I, §IVB2.
 [2] (2017) Nettrim: convex pruning of deep neural networks with performance guarantee. In Pro. Adv. Neural. Inf. Process. Syst. (NIPS), pp. 3180–3189. Cited by: TABLE I, 3rd item.
 [3] (2017) QSGD: communicationefficient SGD via gradient quantization and encoding. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 1709–1720. Cited by: 2nd item.
 [4] (2018) The convergence of sparsified gradient methods. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 5973–5983. Cited by: 3rd item.
 [5] (2019Jul.) Machine learning at the wireless edge: distributed stochastic gradient descent overtheair. In Proc. IEEE Int. Symp. Inform. Theory (ISIT), Cited by: TABLE I, 1st item.
 [6] (2019) Federated learning over wireless fading channels. arXiv preprint arXiv:1907.09769. Cited by: TABLE I, 1st item.
 [7] (2015) Learning separable filters. IEEE Trans. Pattern Anal. Mach. Intell. 37 (1), pp. 94–106. Cited by: TABLE I, 5th item.
 [8] (2015) Communication complexity of distributed convex learning and optimization. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 1756–1764. Cited by: §IIIB1.
 [9] (2017Nov.) Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 34 (6), pp. 26–38. Cited by: §I.
 [10] (2019) Near optimal coded data shuffling for distributed learning. IEEE Trans. Inf. Theory 65 (11), pp. 7325–7349. Cited by: 3rd item.
 [11] (2016) Mapreducebased deep learning with handwritten digit recognition case study. In Proc. IEEE Int. Conf. Big Data (Big Data), pp. 1690–1699. Cited by: §IVE.
 [12] (2018) SIGNSGD: compressed optimisation for nonconvex problems. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 559–568. Cited by: 2nd item.
 [13] (2019) Memoryand communicationaware model compression for distributed deep learning inference on iot. arXiv preprint arXiv:1907.11804. Cited by: §IVD2.
 [14] (2019) Stochastic gradient coding for straggler mitigation in distributed learning. arXiv preprint arXiv:1905.05383. Cited by: TABLE I, 4th item.
 [15] (2016) Gossip training for deep learning. In NIPS Optimization for Machine Learning Workshop, Cited by: TABLE I, §IVB2.
 [16] (2019) Distributed optimization for deep learning with gossip exchange. Neurocomputing 330, pp. 287–296. Cited by: TABLE I, §IVB2.
 [17] (2006) Randomized gossip algorithms. IEEE/ACM Trans. Netw. 14 (SI), pp. 2508–2530. Cited by: §IVB2.
 [18] (2018Sep.) Assessing the economic impact of artificial intelligence. ITUTrends Issue Paper No. 1. Cited by: §I.
 [19] (2016) A stochastic quasinewton method for largescale optimization. SIAM J. Optimization 26 (2), pp. 1008–1031. Cited by: TABLE I, §IIIC.
 [20] (2020) Communication efficient federated learning over multiple access channels. arXiv preprint arXiv:2001.08737. Cited by: 1st item.
 [21] (2018) Adacomp: adaptive residual gradient compression for dataparallel distributed training. In AAAI Conf. Artif. Intell., Cited by: TABLE I, 3rd item.
 [22] (2017) Zoo: zeroth order optimization based blackbox attacks to deep neural networks without training substitute models. In Proc. ACM Workshop Artif. Intell. Security, pp. 15–26. Cited by: §IIIA.
 [23] (2018) LAG: lazily aggregated gradient for communicationefficient distributed learning. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 5050–5060. Cited by: TABLE I, 1st item.
 [24] (2018) Communicationefficient distributed reinforcement learning. arXiv preprint arXiv:1812.03239. Cited by: TABLE I, 1st item.
 [25] (2015) Glimpse: continuous, realtime object recognition on mobile devices. In Proc. ACM Conf. Embedded Netw. Sensor Syst., pp. 155–168. Cited by: 1st item, 1st item.
 [26] (2016) Compressing convolutional neural networks in the frequency domain. In Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining (KDD), pp. 1475–1484. Cited by: TABLE I, 2nd item.
 [27] (2015) Compressing neural networks with the hashing trick. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 2285–2294. Cited by: TABLE I, 2nd item.
 [28] (2019) SignSGD via zerothorder oracle. In Proc. Int. Conf. Learn. Representations (ICLR), Cited by: §IIIA.
 [29] (2017Dec.) Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proc. ACM Meas. Anal. Comput. Syst. 1 (2), pp. 44:1–44:25. External Links: Document, ISSN 24761249, Link Cited by: 3rd item.
 [30] (2019) SecureBoost: a lossless federated learning framework. arXiv preprint arXiv:1901.08755. Cited by: 3rd item.
 [31] (2018Jan.) Model compression and acceleration for deep neural networks: the principles, progress, and challenges. IEEE Signal Process. Mag. 35 (1), pp. 126–136. Cited by: §IIA.
 [32] (2015) An exploration of parameter redundancy in deep networks with circulant projections. In Proc. IEEE Int. Conf. Computer Vision (ICCV), pp. 2857–2865. Cited by: TABLE I, 5th item.
 [33] (2018) Neural networks meet physical networks: distributed inference between edge devices and the cloud. In Proc. ACM Workshop on Hot Topics in Networks, pp. 50–56. Cited by: 2nd item, 2nd item.
 [34] (2019) Network offloading policies for cloud robotics: a learningbased approach. arXiv preprint arXiv:1902.05703. Cited by: §IVD2.
 [35] (2016) Using recurrent neural network models for early detection of heart failure onset. J. Amer. Med. Inf. Assoc. 24 (2), pp. 361–370. Cited by: §IIA.
 [36] (2019) Largescale beamforming for massive mimo via randomized sketching. arXiv preprint arXiv:1903.05904. Cited by: 2nd item.
 [37] (2007) Mapreduce for machine learning on multicore. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 281–288. Cited by: §IVE.
 [38] (2018Nov.) Cisco global cloud index: forecast and methodology, 20162021 white paper. External Links: Link Cited by: §I.
 [39] (2009) Introduction to derivativefree optimization. Vol. 8, SIAM. Cited by: §IIIA.
 [40] (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 3123–3131. Cited by: TABLE I, 1st item.
 [41] (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to + 1 or 1. arXiv preprint arXiv:1602.02830. Cited by: TABLE I, 1st item.
 [42] (2018) GossipGraD: scalable deep learning using gossip communication based asynchronous gradient descent. arXiv preprint arXiv:1803.05880. Cited by: TABLE I, §IVB2.
 [43] (2008) MapReduce: simplified data processing on large clusters. Commun. ACM 51 (1), pp. 107–113. Cited by: 4th item, §IVE.
 [44] (2019) Edge intelligence: the confluence of edge computing and artificial intelligence. arXiv preprint arXiv:1909.00560. Cited by: §IID, §IID.
 [45] (2013) Predicting parameters in deep learning. In Pro. Adv. Neural. Inf. Process. Syst. (NIPS), pp. 2148–2156. Cited by: TABLE I, 5th item.
 [46] (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 1269–1277. Cited by: TABLE I, 5th item.
 [47] (2017) Distributed quantized gradientfree algorithm for multiagent convex optimization. In 2017 29th Chinese Control And Decision Conference (CCDC), pp. 6431–6435. Cited by: §IIIA.
 [48] (2019) Communication constrained cloudbased longterm visual localization in real time. arXiv preprint arXiv:1903.03968. Cited by: 1st item, 1st item.
 [49] (2019Nov.) Secure distributed ondevice learning networks with byzantine adversaries. IEEE Netw. 33 (6), pp. 180–187. Cited by: 3rd item.
 [50] (2020) Distributed sensing with orthogonal multiple access: to code or not to code?. arXiv preprint arXiv:2001.09594. Cited by: 1st item.
 [51] (2019) Highdimensional stochastic gradient quantization for communicationefficient edge learning. Proc. IEEE Global Conf. Signal Inf. Process. (GlobalSIP). External Links: Document Cited by: §IIA, TABLE I, 2nd item.
 [52] (201505) Optimal rates for zeroorder convex optimization: the power of two function evaluations. IEEE Trans. Inf. Theory 61 (5), pp. 2788–2806. External Links: Document, ISSN 00189448 Cited by: TABLE I, §IIIA.
 [53] (2018) A distributed secondorder algorithm you can trust. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 1358–1366. Cited by: TABLE I, §IIIC.
 [54] (2017) Communicationefficient algorithms for distributed stochastic principal component analysis. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 1203–1212. Cited by: §IIIB1.
 [55] (2016) Secure linear regression on vertically partitioned datasets. IACR Cryptology ePrint Archive 2016, pp. 892. Cited by: TABLE I, 3rd item.
 [56] (2012) Vector quantization and signal compression. Vol. 159, Springer Science & Business Media. Cited by: 2nd item.
 [57] (2010Dec.) Multicell mimo cooperative networks: a new look at interference. IEEE J. Sel. Areas Commun. 28 (9), pp. 1380–1408. Cited by: 3rd item.
 [58] (2011) SystemML: declarative machine learning on mapreduce. In Proc. Int. Conf. Data Eng., pp. 231–242. Cited by: §IVE.
 [59] (2006) Toward a theory of innetwork computation in wireless sensor networks. IEEE Commun. Mag. 44 (4), pp. 98–107. Cited by: §IIA, §IIC.
 [60] (2013Oct.) Harnessing interference for analog function computation in wireless sensor networks. IEEE Trans. Signal Process. 61 (20), pp. 4893–4906. Cited by: 1st item.
 [61] (2014) Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115. Cited by: TABLE I, 1st item.
 [62] (2019) Guidelines for reinforcement learning in healthcare. Nat Med 25 (1), pp. 16–18. Cited by: §IIA.
 [63] (2017) Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §IIIB1.
 [64] (2013) Speech recognition with deep recurrent neural networks. In Proc. IEEE Int. Conf. Acoustics Speech Signal Process. (ICASSP), pp. 6645–6649. Cited by: §I.
 [65] (2018) Distributed learning of deep neural network over multiple agents. J. Netw. Comput. Appl. 116, pp. 1–8. Cited by: 2nd item.
 [66] (2015) Deep learning with limited numerical precision. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 1737–1746. Cited by: 1st item.
 [67] (2015) Why random reshuffling beats stochastic gradient descent. Mathematical Programming, pp. 1–36. Cited by: 3rd item.
 [68] (2019) Wireless mapreduce distributed computing with fullduplex radios and imperfect CSI. pp. 1–5. Cited by: TABLE I, 1st item.
 [69] (2019Feb.) Tactile robots as a central embodiment of the tactile internet. Proc. IEEE 107 (2), pp. 471–487. External Links: Document, ISSN 15582256 Cited by: §I.
 [70] (2019) ZONE: zeroth order nonconvex multiagent optimization over networks. IEEE Trans. Autom. Control. Cited by: §IIIA.
 [71] (2018) Improving distributed gradient descent using reedsolomon codes. In Proc. IEEE Int. Symp. Inf. Theory (ISIT), pp. 2027–2031. Cited by: TABLE I, 4th item.
 [72] (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In Proc. Int. Conf. Learn. Representations (ICLR), Cited by: §IIA, §IIC, TABLE I, 3rd item.
 [73] (2015) Learning both weights and connections for efficient neural network. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 1135–1143. Cited by: TABLE I, 3rd item.
 [74] (2017) Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. arXiv preprint arXiv:1711.10677. Cited by: TABLE I, 3rd item.
 [75] (1993) Second order derivatives for network pruning: optimal brain surgeon. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 164–171. Cited by: TABLE I, 3rd item.
 [76] (2014) A hybrid approach to offloading mobile image classification. In Proc. IEEE Int. Conf. Acoustics Speech Signal Process. (ICASSP), pp. 8375–8379. Cited by: §IVD2.
 [77] (2015) Mobile edge computing—a key technology towards 5G. ETSI white paper 11 (11), pp. 1–16. Cited by: §I.
 [78] (2019) Reconfigurable intelligent surface for green edge inference. arXiv preprint arXiv:1912.00820. Cited by: 3rd item, 3rd item, §IVD.
 [79] (2019) Gpipe: efficient training of giant neural networks using pipeline parallelism. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 103–112. Cited by: TABLE I, 1st item.
 [80] (2014) Speeding up convolutional neural networks with low rank expansions. In Proc. British Mach. Vision Conf. (BMVC), Cited by: TABLE I, 5th item.
 [81] (2014) Communicationefficient distributed dual coordinate ascent. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 3068–3076. Cited by: TABLE I, §IIID1, §IIID.
 [82] (2018) Fundamental limits of wireless distributed computing networks. In Proc. INFOCOM, pp. 2600–2608. Cited by: TABLE I, 1st item.
 [83] (2018) SketchML: accelerating distributed machine learning with data sketches. In Proc. Int. Conf. Management Data, pp. 1269–1284. Cited by: 2nd item.
 [84] (201905) Pliable data shuffling for ondevice distributed learning. In Proc. IEEE Int. Conf. Acoustics Speech Signal Process. (ICASSP), Vol. , pp. 7460–7464. External Links: Document, ISSN 2379190X Cited by: TABLE I, 3rd item.
 [85] (2019) Layerwise deep neural network pruning via iteratively reweighted optimization. In Proc. IEEE Int. Conf. Acoustics Speech Signal Process. (ICASSP), pp. 5606–5610. Cited by: TABLE I, 3rd item.
 [86] (2016) How to scale distributed deep learning?. arXiv preprint arXiv:1611.04581. Cited by: TABLE I, §IVB2.
 [87] (2017) Neurosurgeon: collaborative intelligence between the cloud and mobile edge. In ACM SIGARCH Computer Architecture News, Vol. 45, pp. 615–629. Cited by: §I, §IVD2.
 [88] (2004) Privacypreserving distributed mining of association rules on horizontally partitioned data. IEEE Trans. Knowl. Data Eng. (9), pp. 1026–1037. Cited by: TABLE I, 3rd item.
 [89] (2017) On largebatch training for deep learning: generalization gap and sharp minima. In Proc. Int. Conf. Learn. Representations (ICLR), Cited by: §IIIB1.
 [90] (2018) Edgehost partitioning of deep neural networks with feature space encoding for resourceconstrained internetofthings platforms. In Proc. IEEE Int. Conf. Advanced Video Signal Based Surveillance (AVSS), pp. 1–6. Cited by: 2nd item, 2nd item.
 [91] (2016) Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §IIID2, §IIID.
 [92] (2015) Federated optimization: distributed optimization beyond the datacenter. NIPS Optimization for Machine Learning Workshop. External Links: Link Cited by: §I, 3rd item, §IIID.
 [93] (2018) Learning a code: machine learning for approximate nonlinear coded computation. arXiv preprint arXiv:1806.01259. Cited by: TABLE I, 2nd item.
 [94] (2012) Imagenet classification with deep convolutional neural networks. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 1097–1105. Cited by: §I, 1st item, §IIA.

[95]
(2016)
Fast convnets using groupwise brain damage.
In
Proc. IEEE Conf. Comput. Vision Pattern Recognition (CVPR)
, pp. 2554–2564. Cited by: TABLE I, 4th item.  [96] (1990) Optimal brain damage. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 598–605. Cited by: TABLE I, 3rd item.
 [97] (2017) Distributed stochastic variance reduced gradient methods by sampling extra data with replacement. J. Mach. Learn. Res. 18 (1), pp. 4404–4446. Cited by: §IIC, TABLE I, §IIIB1.
 [98] (2018Mar.) Speeding up distributed machine learning using codes. IEEE Trans. Inf. Theory 64 (3), pp. 1514–1529. External Links: Document, ISSN 00189448 Cited by: TABLE I, 3rd item.
 [99] (2019Aug.) The roadmap to 6G: AI empowered wireless networks. IEEE Commun. Mag. 57 (8), pp. 84–90. External Links: Document, ISSN Cited by: §I, §IIA.
 [100] (2019) Edge AI: ondemand accelerating deep neural network inference via edge computing. IEEE Trans. Wireless Commun., pp. 1–1. External Links: Document Cited by: 1st item.
 [101] (2018) Wireless mapreduce distributed computing. In Proc. IEEE Int. Symp. Inform. Theory (ISIT), pp. 1286–1290. Cited by: TABLE I, 1st item.
 [102] (2017Oct.) A scalable framework for wireless distributed computing. IEEE/ACM Trans. Netw. 25 (5), pp. 2643–2654. External Links: Document, ISSN 10636692 Cited by: §IIA, TABLE I, 1st item.
 [103] (2018) Nearoptimal straggler mitigation for distributed gradient methods. In Proc. IEEE Int. Parallel Distrib. Process. Symp. Workshops, pp. 857–866. Cited by: TABLE I, 4th item.
 [104] (2017Apr.) Coding for distributed fog computing. IEEE Commun. Mag. 55 (4), pp. 34–40. Cited by: TABLE I, §IVE.
 [105] (2018Jan.) A fundamental tradeoff between computation and communication in distributed computing. IEEE Trans. Inf. Theory 64 (1), pp. 109–128. Cited by: TABLE I, 4th item, 1st item, 1st item.
 [106] (2019) Federated learning: challenges, methods, and future directions. arXiv preprint arXiv:1908.07873. Cited by: §I.
 [107] (2019) Towards a theoretical understanding of hashingbased neural nets. In Proc. Int. Conf. Artif. Intell. Stat. (AISTATS), pp. 751–760. Cited by: TABLE I, 2nd item.
 [108] (2018) Deep gradient compression: reducing the communication bandwidth for distributed training. In Proc. Int. Conf. Learn. Representations (ICLR), External Links: Link Cited by: §IIC, TABLE I, 3rd item.
 [109] (1989) On the limited memory bfgs method for large scale optimization. Math. Program. 45 (13), pp. 503–528. Cited by: §IIIC.
 [110] (to be published) Clientedgecloud hierarchical federated learning. Proc. IEEE Int. Conf. Commun. (ICC). Cited by: §IIID1.
 [111] (2018) DeepNjpeg: a deep neural network favorable jpegbased image compression framework. In Proc. Annu. Design Autom. Conf., pp. 18. Cited by: 2nd item, 2nd item.
 [112] (2017) Bayesian compression for deep learning. In Pro. Adv. Neural. Inf. Process. Syst. (NIPS), pp. 3288–3298. Cited by: TABLE I, 3rd item.
 [113] (2008Oct.) An overview of limited feedback in wireless communication systems. IEEE J. Sel. Areas Commun. 26 (8), pp. 1341–1365. Cited by: §IIA.
 [114] (2019Jul.) Distributed stochastic gradient descent using ldgm codes. In Proc. IEEE Int. Symp. Inform. Theory (ISIT), Cited by: TABLE I, 4th item.
 [115] (2019Jul.) Robust gradient descent via moment encoding and ldpc codes. In Proc. IEEE Int. Symp. Inform. Theory (ISIT), Cited by: TABLE I, 4th item.
 [116] (2017) Modnn: local distributed mobile computing system for deep neural network. In Design Autom. Test Europe Conf. Exhibition (DATE), 2017, pp. 1396–1401. Cited by: §IVD2.
 [117] (2017Fourthquarter) A survey on mobile edge computing: the communication perspective. IEEE Commun. Surveys Tuts. 19 (4), pp. 2322–2358. External Links: Document Cited by: §I, §IIA, §IIC, 3rd item, §IVD.
 [118] (2018) A privacypreserving deep learning approach for face recognition with edge computing. In USENIX Workshop on Hot Topics in Edge Computing (HotEdge), Cited by: TABLE I, 2nd item.
 [119] (2019) Scalable deep learning on distributed infrastructures: challenges, techniques and tools. arXiv preprint arXiv:1903.11314. Cited by: §IVC.
 [120] (2017) Communicationefficient learning of deep networks from decentralized data. In Proc. Int. Conf. Artif. Intell. Stat. (AISTATS), pp. 1273–1282. Cited by: §IIA, TABLE I, Fig. 3, §IIID1, §IIID, 2nd item.
 [121] (2017) Device placement optimization with reinforcement learning. In Proc. Int. Conf. Mach. Learn. (ICML), Cited by: TABLE I, 1st item.
 [122] (2015) Cloudbased collaborative 3d mapping in realtime with lowcost robots. IEEE Trans. Autom. Sci. Eng. 12 (2), pp. 423–431. Cited by: 1st item, 1st item.
 [123] (2016) A linearlyconvergent stochastic lbfgs algorithm. In Proc. Int. Conf. Artif. Intell. Stat. (AISTATS), pp. 249–258. Cited by: TABLE I, §IIIC.
 [124] (2019) Fitting ReLUs via SGD and quantized SGD. In Proc. IEEE Int. Symp. Inf. Theory (ISIT), Cited by: 2nd item.
 [125] (2019) Machine learning at the network edge: a survey. arXiv preprint arXiv:1908.00080. Cited by: §IID, §IID.
 [126] (2019) PipeDream: generalized pipeline parallelism for DNN training. In Proc. ACM Symp. Operating Syst. Principles, pp. 1–15. Cited by: TABLE I, 1st item.
 [127] (2007Oct.) Computation over multipleaccess channels. IEEE Trans. Inf. Theory 53 (10), pp. 3498–3516. Cited by: 1st item.
 [128] (2018) Network topology and communicationcomputation tradeoffs in decentralized optimization. Proc. IEEE 106 (5), pp. 953–976. Cited by: TABLE I, §IVB2.
 [129] (2019) The role of network topology for distributed machine learning. In Proc. INFOCOM, pp. 2350–2358. Cited by: TABLE I, §IVB2.
 [130] (2017) Random gradientfree minimization of convex functions. Found. Comput. Math. 17 (2), pp. 527–566. Cited by: Fig. 3, §IIIA.
 [131] (2019) NVIDIA Clara: an application framework optimized for healthcare and life sciences developers. Note: https://developer.nvidia.com/clara Cited by: 3rd item.
 [132] (2018) Learning compact neural networks with regularization. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 3963–3972. Cited by: TABLE I, 4th item.
 [133] (2019Jul.) Speeding up distributed gradient descent by utilizing nonpersistent stragglers. In Proc. IEEE Int. Symp. Inform. Theory (ISIT), Cited by: TABLE I, 4th item.
 [134] (2018) Exact convergence of gradientfree distributed optimization method in a multiagent system. In Proc. IEEE Conf. Decision Control (CDC), pp. 5728–5733. Cited by: §IIIA.
 [135] (2019Nov.) Wireless network intelligence at the edge. Proc. IEEE 107 (11), pp. 2204–2239. Cited by: §I, §IID, §IID.
 [136] (2018) Coded distributed computing with node cooperation substantially increases speedup factors. In Proc. IEEE Int. Symp. Inform. Theory (ISIT), pp. 1291–1295. Cited by: TABLE I, 4th item, 1st item, 2nd item.
 [137] (2009) Bandwidth optimal allreduce algorithms for clusters of workstations. J. Parallel and Dist. Comput. 69 (2), pp. 117–124. Cited by: TABLE I, §IVB2.
 [138] (2015) Communicationconstrained multiauv cooperative slam. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pp. 509–516. Cited by: 1st item, 1st item.
 [139] (2016) Iterative hessian sketch: fast and accurate solution approximation for constrained leastsquares. J. Mach. Learn. Res. 17 (1), pp. 1842–1879. Cited by: 2nd item.
 [140] (2018) Coded computing for distributed graph analytics. In Proc. IEEE Int. Symp. Inf. Theory (ISIT), pp. 1221–1225. Cited by: TABLE I, 1st item.
 [141] (2018) Elastic gossip: distributing neural network training using gossiplike protocols. arXiv preprint arXiv:1812.02407. Cited by: TABLE I, §IVB2.
 [142] (2019) Towards smart and reconfigurable environment: intelligent reflecting surface aided wireless network. IEEE Commun. Mag.. External Links: Document, ISSN 15581896 Cited by: 3rd item.
 [143] (2016) Xnornet: imagenet classification using binary convolutional neural networks. In Proc. Eur. Conf. on Comput. Vision (ECCV), pp. 525–542. Cited by: TABLE I, 1st item.
 [144] (2018) Gradient coding from cyclic mds codes and expander graphs. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 4302–4310. Cited by: TABLE I, 4th item, 4th item.
 [145] (2013) Parallel stochastic gradient algorithms for largescale matrix completion. Math. Programm. Comput. 5 (2), pp. 201–226. Cited by: 3rd item.
 [146] (2019) A tour of reinforcement learning: the view from continuous control. Annu. Rev. Control Robot. Auton. Syst. 2, pp. 253–279. Cited by: 1st item, §IIIA.
 [147] (2019) Artificial intelligenceenabled healthcare delivery. J. Royal Soc. Med. 112 (1), pp. 22–28. Cited by: §IIA.
 [148] (2016Apr.) Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). OJ L 119, pp. 1–88. Note: https://eurlex.europa.eu/legalcontent/EN/TXT/HTML/?uri=OJ:L:2016:119:FULL&from=EN Cited by: 3rd item, §IIID.
 [149] (2017) Latency analysis of coded computation schemes over wireless networks. In Proc. 55th Annu. Allerton Conf. Commun. Control Comput. (Allerton), pp. 1256–1263. Cited by: TABLE I, 2nd item.
 [150] (2019) CodedReduce: a fast and robust framework for gradient aggregation in distributed learning. arXiv preprint arXiv:1902.01981. Cited by: TABLE I, §IVB2.
 [151] (2013) Learning separable filters. In Proc. IEEE Conf. Comput. Vision Pattern Recognition (CVPR), pp. 2754–2761. Cited by: TABLE I, 5th item.
 [152] (2018) Communicationefficient distributed strongly convex stochastic optimization: nonasymptotic rates. arXiv preprint arXiv:1809.02920. Cited by: §IIIA.
 [153] (2018) Distributed zeroth order optimization over random networks: a kieferwolfowitz stochastic approximation approach. In Proc. IEEE Conf. Decision Control (CDC), pp. 4951–4958. Cited by: TABLE I, §IIIA.
 [154] (2018) Nonasymptotic rates for communication efficient distributed zeroth order strongly convex optimization. In Proc. IEEE Global Conf. Signal Inf. Process. (GlobalSIP), pp. 628–632. Cited by: §IIIA.
 [155] Federated optimization in heterogeneous networks. In ICML Workshop on Adaptive and Multitask Learning, Cited by: TABLE I, §IIID1.
 [156] (2013) Lowrank matrix factorization for deep neural network training with highdimensional output targets. In Proc. IEEE Int. Conf. Acoustics Speech Signal Process. (ICASSP), pp. 6655–6659. Cited by: TABLE I, 5th item.
 [157] (2007) A stochastic quasinewton method for online convex optimization. In Proc. Int. Conf. Artif. Intell. Stat. (AISTATS), pp. 436–443. Cited by: TABLE I, §IIIC.
 [158] (2014) 1bit stochastic gradient descent and its application to dataparallel distributed training of speech dnns. In Proc. Fifteenth Annu. Conf. Int. Speech Commun. Assoc., Cited by: 2nd item.

[159]
(2018)
Horovod: fast and easy distributed deep learning in tensorflow
. arXiv preprint arXiv:1802.05799. Cited by: TABLE I, §IVB2.  [160] (2014) Distributed stochastic optimization and learning. In Proc. 52nd Annu. Allerton Conf. Commun. Control Comput. (Allerton), pp. 850–857. Cited by: §IIIB1.
 [161] (2014) Communicationefficient distributed optimization using an approximate newtontype method. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 1000–1008. Cited by: TABLE I, Fig. 3, §IIIC, §IIIC.
 [162] (1948) A mathematical theory of communication. Bell Syst. Tech. J. 27 (3), pp. 379–423. Cited by: §IIC.
 [163] (2019) BottleNet++: an endtoend approach for feature compression in deviceedge coinference systems. arXiv preprint arXiv:1910.14315. Cited by: 2nd item.
 [164] (2019) Improving deviceedge cooperative inference of deep learning via 2step pruning. arXiv preprint arXiv:1903.03472. Cited by: 2nd item, 2nd item.
 [165] (2015) Structured transforms for smallfootprint deep learning. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 3088–3096. Cited by: TABLE I, 5th item.
 [166] (2017) Federated multitask learning. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 4427–4437. Cited by: TABLE I, §IIID1, §IIID.
 [167] (2017Jun.) A pliable index coding approach to data shuffling. In Proc. IEEE Int. Symp. Inf. Theory (ISIT), Vol. , pp. 2558–2562. External Links: Document, ISSN 21578117 Cited by: TABLE I, 3rd item.
 [168] (2015) Scalable distributed dnn training using commodity gpu cloud computing. In Proc. Sixteenth Annu. Conf. Int. Speech Commun. Assoc., Cited by: 3rd item.
 [169] (2019) Optimizing network performance for distributed dnn training on gpu clusters: imagenet/alexnet training in 1.5 minutes. arXiv preprint arXiv:1902.06855. Cited by: 1st item.
 [170] (2017) Distributed mean estimation with limited communication. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 3329–3337. Cited by: 2nd item.
 [171] (2017Nov.) Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105 (12), pp. 2295–2329. Cited by: §I, §IIA.
 [172] (2017) Gradient coding: avoiding stragglers in distributed learning. In Proc. Int. Conf. Mach. Learn. (ICML), Vol. 70, pp. 3368–3376. Cited by: TABLE I, 4th item.
 [173] (2018) Communication compression for decentralized training. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 7652–7662. Cited by: 2nd item.
 [174] (2016) Branchynet: fast inference via early exiting from deep neural networks. In Proc. Int. Conf. Pattern Recognition (ICPR), pp. 2464–2469. Cited by: 1st item, 1st item.
 [175] (2017) Distributed deep neural networks over the cloud, the edge and end devices. In Proc. IEEE Int. Conf. Dist. Comput. Syst. (ICDCS), pp. 328–339. Cited by: 1st item, 1st item, §IVD.
 [176] (2017) Soft weightsharing for neural network compression. In Proc. Int. Conf. Learn. Representations (ICLR), Cited by: TABLE I, 3rd item.

[177]
(2003)
Privacypreserving kmeans clustering over vertically partitioned data
. In Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining (KDD), pp. 206–215. Cited by: TABLE I, 3rd item.  [178] (2005) Privacypreserving decision trees over vertically partitioned data. In IFIP Annu. Conf. Data Appl. Security Privacy, pp. 139–152. Cited by: TABLE I, 3rd item.
 [179] (2018) Not just privacy: improving performance of private deep learning in mobile cloud. In Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining (KDD), pp. 2407–2416. Cited by: TABLE I, 2nd item.
 [180] (2018) Adaptive communication strategies to achieve the best errorruntime tradeoff in localupdate SGD. arXiv preprint arXiv:1810.08313. Cited by: TABLE I, 2nd item.
 [181] (2018) Cooperative SGD: a unified framework for the design and analysis of communicationefficient SGD algorithms. arXiv preprint arXiv:1808.07576. Cited by: §IIID1.
 [182] (2019Jun.) Adaptive federated learning in resource constrained edge computing systems. IEEE J. Sel. Areas Commun. 37 (6), pp. 1205–1221. External Links: Document, ISSN 15580008 Cited by: TABLE I, 2nd item.
 [183] (2018) GIANT: globally improved approximate newton method for distributed optimization. pp. 2332–2342. Cited by: TABLE I, §IIIC.
 [184] (2018) Gradient sparsification for communicationefficient distributed optimization. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 1299–1309. Cited by: 3rd item.
 [185] (2018) FATE: an industrial grade federated learning framework. Note: https://fate.fedai.org Cited by: 3rd item.
 [186] (2016) Learning structured sparsity in deep neural networks. In Pro. Adv. Neural. Inf. Process. Syst. (NIPS), pp. 2074–2082. Cited by: TABLE I, 4th item.
 [187] (2020) NOMAenhanced computation over multiaccess channels. IEEE Trans. Wireless Commun.. Cited by: 1st item.
 [188] (2016) Quantized convolutional neural networks for mobile devices. In Proc. IEEE Conf. Comput. Vision Pattern Recognition (CVPR), pp. 4820–4828. Cited by: 1st item.
 [189] (2015) Petuum: a new platform for distributed machine learning on big data. IEEE Trans. Big Data 1 (2), pp. 49–67. Cited by: §IVC.
 [190] (2019Jun.) Data shuffling in wireless distributed computing via lowrank optimization. IEEE Trans. Signal Process. 67 (12), pp. 3087–3099. External Links: Document Cited by: §IIA, TABLE I, Fig. 7, 4th item, 1st item, §IVE.
 [191] (2019) A quasinewton method based vertical federated learning framework for logistic regression. In NeurIPS Workshops on Federated Learning for Data Privacy and Confidentiality, Cited by: 3rd item.
 [192] (2020) Federated learning via overtheair computation. IEEE Trans. Wireless Commun.. Cited by: §IIA, §IIC, TABLE I, 1st item.
 [193] (2019) Energyefficient processing and robust wireless cooperative transmission for edge inference. arXiv preprint arXiv:1907.12475. Cited by: 3rd item, 3rd item, §IVD.
 [194] (2019) Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. 10 (2), pp. 12. Cited by: 3rd item, §IIA, TABLE I, 2nd item, 3rd item.
 [195] (2015) Deep fried convnets. In Proc. IEEE Int. Conf. Computer Vision (ICCV), pp. 1476–1483. Cited by: TABLE I, 5th item.
 [196] (2018) Communicationcomputation efficient gradient coding. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 5606–5615. Cited by: TABLE I, 4th item.
 [197] (2018) Gradient diversity: a key ingredient for scalable distributed learning. In Proc. Int. Conf. Artif. Intell. Stat. (AISTATS), pp. 1998–2007. Cited by: §IIIB1.
 [198] (2006) Privacypreserving svm classification on vertically partitioned data. In PacificAsia Conf. Knowl. Discovery Data Mining, pp. 647–656. Cited by: TABLE I, 3rd item.
 [199] (2018) Gradiveq: vector quantization for bandwidthefficient gradient aggregation in distributed cnn training. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 5129–5139. Cited by: 2nd item.
 [200] (2015) Zerothorder method for distributed optimization with approximate projections. IEEE Trans. Neural Netw. Learn. Syst. 27 (2), pp. 284–294. Cited by: §IIIA.
 [201] (2014) Randomized gradientfree method for multiagent optimization over timevarying networks. IEEE Trans. Neural Netw. Learn. Syst. 26 (6), pp. 1342–1347. Cited by: TABLE I, §IIIA.
 [202] (2018) Variancereduced stochastic learning by networked agents under random reshuffling. IEEE Trans. Signal Process. 67 (2), pp. 351–366. Cited by: §IIC, TABLE I, §IIIB1.
 [203] (2020) Reconfigurableintelligentsurface empowered 6g wireless communications: challenges and opportunities. arXiv preprint arXiv:2001.00364. Cited by: 3rd item.
 [204] (2019) On model coding for distributed inference and transmission in mobile edge computing systems. IEEE Commun. Letters. Cited by: 3rd item, 3rd item.
 [205] (2020Feb.) Mobile edge intelligence and computing for the internet of vehicles. Proc. IEEE 108 (2), pp. 246–261. External Links: Document, ISSN 15582256 Cited by: §I.
 [206] (2019) Distributed deep learning strategies for automatic speech recognition. In Proc. IEEE Int. Conf. Acoustics Speech Signal Process. (ICASSP), pp. 5706–5710. Cited by: §IIIB1.
 [207] (2013) Communicationefficient algorithms for statistical optimization. J. Mach. Learn. Res. 14 (1), pp. 3321–3363. Cited by: §IIIB1.
 [208] (2015) Disco: distributed optimization for selfconcordant empirical loss. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 362–370. Cited by: TABLE I, §IIIC.
 [209] (2012) Communicationefficient algorithms for statistical optimization. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 1502–1510. Cited by: §IIID1.
 [210] (2018) Communicationefficient distributed optimization of selfconcordant empirical loss. In LargeScale and Distributed Optimization, pp. 289–341. Cited by: §IIIB1.
 [211] (2019) A nodeselectionbased subtask assignment method for coded edge computing. IEEE Commun. Letters 23 (5), pp. 797–801. Cited by: TABLE I, 2nd item.
 [212] (2018) Deepthings: distributed adaptive deep learning inference on resourceconstrained iot edge clusters. IEEE Trans. Comput.Aided Design Integr. Circuits Syst. 37 (11), pp. 2348–2359. Cited by: §IVD2.
 [213] (2018) On the convergence properties of a kstep averaging stochastic gradient descent algorithm for nonconvex optimization. In Proc. Int. Joint Conf. Artif. Intell. (IJCAI), pp. 3219–3227. Cited by: TABLE I, 2nd item.
 [214] (2019Aug.) Edge intelligence: paving the last mile of artificial intelligence with edge computing. Proc. IEEE 107 (8), pp. 1738–1762. External Links: Document, ISSN 15582256 Cited by: §I, §I, §IIA, §IID, §IID.
 [215] (2020) Onebit overtheair aggregation for communicationefficient federated edge learning: design and convergence analysis. arXiv preprint arXiv:2001.05713. Cited by: 1st item.
 [216] (2018to appear) MIMO overtheair computation for highmobility multimodal sensing. In IEEE Internet Things J., Cited by: 1st item.
 [217] (to appear) Towards an intelligent edge: wireless communication meets machine learning. IEEE Commun. Mag.. Cited by: §I.
 [218] (2019) Lowlatency broadband analog aggregation for federated edge learning. IEEE Trans. Wireless Commun.. Cited by: TABLE I, 1st item.
 [219] (2010) Parallelized stochastic gradient descent. In Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 2595–2603. Cited by: §IIIB1.
Comments
There are no comments yet.