1 Significance and Motivation
The pursuit of extremely stringent latency and reliability guarantees is essential in the fifth generation (5G) communication system and beyond [1, 2]. In a wirelessly automated factory, the remote control of assembly robots should provision the same level of target latency and reliability offered by existing wired factory systems. To this end, for instance, control packets should be delivered within 1 ms with 99.99999% reliability [3, 4, 5]. In the emerging nonterrestrial communication enabled by a massive constellation of loworbit satellites [6, 7, 8, 9, 10], the orbiting speed is over 8 km per second, under which a single emergency control packet loss may incur collisions with other satellites and space debris. Unfortunately, traditional methods postulate known channel and network topological models while focusing primarily on maximizing data rates. Such modelbased and besteffort solutions are far from enough to meet the challenging latency and reliability requirements under limited radio resources and randomness on wireless channels and network topologies in practice.
Realizing the aforementioned pressing concern has recently sparked huge attention to the introduction of machine learning (ML) based approaches into communication system designs [11, 12, 13, 14]. By leveraging ML at the network edge, each edge node can proactively carry out decisionmaking based on its local predictions, thereby experiencing zero latency [2, 15]. Furthermore, real data observations construct these ML models that directly reflect the environment in reality without modeling artifacts. In these respects, one may misapprehend that communication becomes less important in 5G and beyond where everything is locally predictable. The answer is the opposite, as accurate ML prediction cannot be achieved and sustained without communication.
In particular, observing dispersed data is prerequisite to train and run datadriven ML models with high prediction or inference accuracy. The observation does not have to be done by directly collecting the raw data from edge nodes, which may violate their data privacy. Alternatively, by leveraging federated learning (FL), it is possible to exchange ML model parameters that reflect the data observed by each ML model without revealing raw data [16]. Similarly, one can exchange ML model outputs [17, 18] or hidden activations [19] for higher communication efficiency while preserving data privacy. Such a communication is not a onetime event, since a trained ML model can easily be outdated and should thus be continually retrained under timevarying data distributions and environments. As a consequence, ML will not only be a key enabler of future communication systems, but also be one major source of data traffic, which warrants taming the new kind of traffic generated by distributed learning. Furthermore, communication environments have a considerable impact on the performance of ML. Indeed, temporal network topology variations and uplinkdownlink channel asymmetry determine learning stragglers. In analog transmissions, channel fluctuations directly distort communicating information [20], affecting the ML accuracy and data privacy. This mandates to codesign distributed ML and communication operations.
Spurred by the aforementioned motivations, this article aims to present communicationefficient and distributed learning frameworks built upon jointly optimizing the types of communication payloads, transmissions, and scheduling as well as ML architectures and algorithms under wireless channel dynamics and network topology variations. To reach the overarching goal, as visualized in Fig. 1, this article is structured as follows. In Sec. 2, major technical challenges are summarized. In Sec. 3, existing communicationefficient and distributed learning frameworks and their limitations are briefly reviewed. To improve these vanilla distributed learning frameworks, several ML and communication design principles are discussed in Sec. 4 and 5. Finally, selected applications of such principles and their effectiveness are elaborated in Sec. 6, followed by concluding remarks in Sec. 7.
2 Key Challenges
Towards understanding the underlying blackbox operations of ML, centralized ML architectures have been the prime focus in the theoretical studies. With the paradigm shift from cloudcentric to ondevice ML, above theoretical analysis cannot be readily applicable to investigate the current distributed ML architecture. In this view, we identify several key challenges that needs to be addressed in designing distributed learning over wireless networks as discussed next.
Data Shortage. One of the main downsides of shifting from cloud to device in datadriven ML is the limited access to sufficiently large datasets. Devices with low exposure and storage may not be able to accumulate rich datasets, in which training ondevices may lose generalization and are susceptible to unseen data [21]. Data acquisition within the device could results in higher endtoend training latencies and/or outdated ML models in the presence of dynamic data. To overcome the data shortage, robust and collaborative ML designs need to be investigated.
NonIID Data. Usergenerated data could be highly personalized (e.g. different angles and frame rates of surveilance cameras,), imbalanced (e.g., some labels corresponding to extreme events have much fewer samples than other labels), and multimodal (e.g., temperature and humidity sensors for weather prediction), all of which are described as nonindependent and identically distributed (IID) data. Under nonIID data, it is common that the accuracy and convergence speed of distributed learning are significantly degraded [22, 23, 24, 25, 26]. Furthermore, majority of the analytical frameworks are devised on the training with IID data and cannot be easily extended towards distributed learning over nonIID data [27]. Hence, deriving convergence characteristic and reliability and robustness guarantees with ondevice learning is a daunting task.
Data Privacy. Data owned by the devices may contain privacysensitive information, and thus, exchanging ML model parameters instead data is widely used in distributed learning. Yet, the exposed model parameters could be reversely traced, in which, privacy is only partially preserved [28]. To further enhance privacy, adopting extra coding, introducing noise to shared parameters, and exchanging redundant information are some viable solutions. However, each of above solutions introduce additional challenges (e.g., increased processing delays with extra coding, loss of inference accuracy due to the excess noise, and extra communication delays with redundant information).
Computing Resource Limitation.
Training and operating ML models require huge computation processor energy, memory, and highspeed interprocessor communication links, which is commonly not available at batterylimited small edge devices. Hence, deep learning computations are often carried out at a cloud server using highperformance computing (HPC) resources
[29]consisting of graphics processing units (GPUs), each of which is equipped with thousands of core processing units (e.g., NVIDIA GTX 2080 Ti has 4 352 CUDA cores and 544 tensor processing units
[30]). This cannot be expected to shift to the network edge without simplifying their complexities. In addition, due to the limited energy and memory/storage at edge devices, processing lowcomplex small models and tasks is of the utmost importance. In this view, designs of energyefficient lowprecision ML andbinary neural networks
(NNs) need to be considered with distributed learning [31, 32, 33].Communication Resource Limitation. Relying on the limited wireless resources that are shared among multitude of devices and services and thus, susceptible to high interference and intermittent connectivity can restrain the distributed learning performance and speed [12]. Mobile service operators are restricted to limited frequency bands as well as bandwidths, in which the difficulties on ensuring reliable and lowlatency connectivity for training devices grow exponentially as the network scales. Introducing more bandwidth to the network via the usage of high frequency bands, e.g., millimeter waves (mmWaves), cannot simply resolve the lack of wireless resources due to their inherited unreliable nature of channel conditions (propagation losses, blockages, and fading) [34]. While increasing/optimizing transmit power and adopting encodingdecoding techniques can be beneficial in terms of enhancing reliable connectivity, training devices may not be able to exploit them with the limited power availability [35]. Therefore, communication resource management is a key aspect on realizing distributed learning.
Poor Channel Conditions. Distributed learning over large number of devices collaborating one another relies on the interdevice communication over wireless links. Under wireless channel dynamics, communication among devices is likely to be affected by poor channel conditions and the transmission noise yielding increased training latencies and losses in both training and inference accuracy [12]. With the limited wireless resources, it is crucial to adopt existing communication techniques (e.g., scheduling, coding, quantizing, relaying, interference managing, millimeter wave communication etc.) and to extend them considering the aspects of distributed learning (e.g., guarantees on training latency, accuracy, reliability, and robustness).
TimeVarying Network Topology. Mobility is inherited in devices, in which, distributed learning needs to cope with dynamic network topologies. With timevarying networks, learning agents are affected by loss of connectivity, inconsistent and asynchronous collaboration, frequent model mismatches, and tendency of having outdated data and models [21]. Developing distributed training mechanisms and analyzing them over above dynamics is extremely difficult. Resorting to predictive/proactive techniques and recasting the interactions among many agents to simplified statistical models are essential for learning over dynamic wireless network topologies.
3 Related Distributed Learning Methods
Distributed ML algorithms are briefly categorized into the methods exchanging model parameters, model outputs, and hidden activations, with or without the aid of a parameter server. In this section, we introduce representative distributed ML methods, followed by identifying the limitations of these vanilla approaches, calling for applying new key principles and developing advanced ML frameworks to be elaborated in the next sections.
3.1 Federated Learning (FL)
FL is a distributed training framework, which has been successfully adopted for Google’s predictive keyboards [36] and many other use cases in the areas of healthcare, intelligent transportation, and industrial automation [37, 27]. In essence, FL is designed to periodically upload workers’ model parameters (e.g., NN weights and/or gradients) during local training to a parameter server that performs model averaging and broadcasts the resultant global model to all workers [38]. Here, avoiding raw data exchanges preserves data privacy, while adjusting the uploading period improves communication efficiency.
Recent studies have investigated different training aspects including personalization (i.e., multitask learning) [39], robustness guarantees [40, 41], and training over dynamic topologies [42]. One critical issue of FL is that its communication overhead is proportional to the number of model parameters. Consequently, FL struggles with supporting deep NNs over capacitylimited wireless channels.
3.2 Group ADMM (GADMM)
The parameter server in FL cannot be connected with faraway workers. Furthermore, it is vulnerable to a single point of attack or failure [43]. In this regard, leveraging the alternating direction method of multipliers (ADMM) method, group ADMM (GADMM) aims to enable distributed learning without any central entity while communicating only with neighboring workers [44]. To this end, GADMM divides the workers into head and tail groups. Each worker from head or tail group exchanges variables with only two workers from the tail/head group forming a chain. At each iteration, every head worker first updates its primal variable (i.e., models) in parallel by minimizing the augmented Lagrangian function defined in ADMM, while utilizing its two neighboring tail workers’ models in the previous iteration. Once head workers update their models, each worker transmits its updated model to its two neighbors from the tail group. Then, following the same way, every tail worker updates its model by utilizing its two neighboring head workers’ models received in the current iteration. Finally, the dual variables are updated locally at each worker.
With GADMM, at every communication round, only half of the workers are competing for the limited bandwidth. Moreover, by limiting the communication only to the two neighboring workers, the communication energy can significantly be reduced. Nonetheless, GADMM relies on model parameter exchanges as in FL whose communication payload size increases with the number of parameters, limiting the scalability of GADMM particularly under deep NNs.
3.3 Federated Distillation (FD)
Modern deep NN architectures often have a large number of model parameters. For instance, GPT3 model is a stateoftheart NN architecture for natural language processing (NLP) tasks, and has
billion parameters corresponding to over GB [45]. Exchanging the sheer amount of deep NN model parameter is costly, hindering frequent communications particularly under limited wireless resources. Alternatively, FD only exchanges the models’ outputs whose dimensions are much smaller than the model sizes (e.g., 10 classes in the MNIST dataset). To illustrate, in a classification task, each worker runs local iterations while storing the average model output (i.e., logit) per class. At a regular interval, these local average outputs are uploaded to the parameter server aggregating and averaging the local average output across workers per class. The resultant global average outputs are downloaded by each worker. Finally, to transfer the downloaded global knowledge into local models, each worker runs local iterations with its own loss function in addition to a regularizer measuring the gap between its own prediction output of a training sample and the global average output for the given class of the sample. Such a regularization method is called knowledge distillation (KD) that is to be detailed in Sec.
5.2.FD is not limited to simple classification tasks under a perfectly controlled environment. In [46]
, FD is extended to an reinforcement learning (RL) application by replacing the aforementioned preclass averaging step of FD with an averaging operations across neighboring states for an RL task. In
[47, 48, 18], FD is implemented in a wireless fading channel, demonstrating comparable accuracy under channel fluctuations and outages with much less payload sizes compared to FL. Nonetheless, FD is more vulnerable to the problem of nonIID data distributions. Even if a worker obtains the global average outputs for all classes, when the worker lacks the samples in a specific target class, the global knowledge is rarely transferred into the local model of the worker.3.4 Split Learning (SL)
A largesized deep NN cannot be fit into edge devices’ small memory. Split learning (SL) resolves this problem by dividing a single NN into multiple segments and distributing the lower segments across multiple workers storing raw data [19, 49]. By connecting the lower segments with a shared upper segment stored at a parameter server, each device uploads its NN activations of the cutlayer (i.e., lower segment’s last layer) to the server calculating the loss values, and downloads the gradients to update its lower segment. As done in FL, FD, and GADMM, SL also hides raw data, preserving data privacy. For this reason, SL has recently been adopted in medical applications wherein dispersed private health records should be exploited without sharing raw data [19, 49] [50]. SL has also been known for its robustness against nonIID data distributions, and applied for fusing heterogeneous vision and radiofrequency (RF) modalities to predict millimeterwave channels [51, 52, 53].
While effective in terms of accuracy, the communication efficiency of SL is still questionable. As opposed to FL, FD, and GADMM that periodically exchange model updates, SL requires to exchange instantaneous model updates in feedforward and backward propagations. For some applications, SL yields less communication overhead compared to the aforementioned periodicupdate benchmark schemes by achieving much faster convergence [54], which may not always be feasible under different tasks and datasets. Furthermore, the communication cost of SL depends on the NN architecture and how to cut its NN layers, calling for more investiation on codesigining its communicataion and NN architectures.
3.5 MultiAgent Reinforcement Learning (MARL)
Thus far we have implicitly considered that the datasets are fixed and independent across different workers. This isolated and stationary dataset assumption is not feasible when each worker interacts with other workers in a common environment, while making decisions based on its own observation of the environment. Multiagent reinforcement learning (MARL) is capable of reflecting such workertoenvironment and interworker interactions. Depending on the existence of a central controller and the types of interactions, MARL is categorized into centralized/decentralized and cooperative/competitive frameworks, respectively [55].
Centralized MARL frameworks postulate a central controller that learns decisionmaking polices by collecting all workers’ experiences that comprises their observed states, taken actions, and received rewards [56]. Exchanging such information may incur huge communication and memory resources while violating data privacy. Decentralized MARL without the central controller does not incur such issues, at the cost of not guaranteeing the equilibrium of the constituted policies of individual workers. Even under cooperative MARL wherein all workers aim to achieve the same goal, it may not guarantee the convergence to equilibrium policies without central coordination [57]. Competitive MARL aggravates the problem, wherein every worker’s goal competes over a shared common environment and resources as a zerosum game. Guaranteeing the convergence should thus require additional communication, as we shall discuss with a use case in Sec. 6.7. Nonetheless, note that all the rest of the discussions in this work are centered around distributed learning scenarios that are cooperative and NN based, rather than exploiting MARL in depth.
4 Key Communication Principles
Both communication efficiency and accuracy of distributed learning can be improved by leveraging advanced communication principles coping with limited resources and timevarying communication dynamics as discussed in Sec. 2. Towards improving vanilla distributed learning methods presented in Sec. 3, several key communication principles are introduced in this section, and their effectiveness will be elaborated with selected use cases in Sec. 6.
4.1 Link Sparsification
Reducing the number of links can significantly decrease the communication bandwidth and energy of distributed learning. Such link sparsification can be implemented in temporal and/or spatial domain. Lazy aggregated gradient descent (LAG) [58] is one of its kind pursuing temporal link sparsity by enforcing each worker not to share its model update if the difference, measured by the infinity norm, between the current and previous updates does not exceed a certain threshold. Alternatively, to achieve the spatial link sparsity, one can enforce a sparse network topology by making each worker communicate only with very few neighbors, as exemplified by decentralized gradient descent (GD), dual averaging [59], and GADMM algorithms [44].
Link sparsification is not always free, but may come at the cost of higher learning convergence speed and/or lower accuracy. To illustrate, for the spatial link sparficiation, a very sparse network graph (e.g., ring topology with nearestneighbor based connectivity) yields high communication efficiency per iteration, but may incur more iterations for reaching the convergence and/or a target accuracy level, compared to a denser network graph (e.g., fully connected or star topology with the parameter server). Optimizing the sparsity under the tradeoff between periteration communication cost and convergence speed is thus crucial.
4.2 Quantization
For each communication round, quantization decreases the number of bits to represent model updates, thereby reducing the communication payload sizes in distributed learning. Due to the reduced arithmetic precision of the model updates, quantization introduces errors, which may hinder the convergence of learning algorithms and/or degrade accuracy. Therefore, a quantizer and its quantizing levels should be carefully designed so as to guarantee the convergence with high accuracy. To this end, one can quantize each element of a gradient vector
[60, 61, 62] or the gradient difference vector between the current and previous model updates [63, 58]. For the gradient quantization, the methods in [60, 61] adjust the qantizing levels under the tradeoff between periteration communiction cost and the convergence speed. SignSGD [62] considers an extreme case wherein gradients are quantized using only and , and shows its convergence by the aid of a majority vote of the workers. There are many other variants of gradient quantized distributed learning algorithms including error compensation [64], variancereduced quantization
[65], and tenary quantization [66].Quantization can create synergy by integrating with link sparsification elaborated in Sec. 4.1. Lazilly aggregated quantized gradient method (LAQ) is one example that combines the gradient update quantization with temporal sparsification, in a way that the number of links is sparsified based on the temporal gradient update difference, and the gradient update differnece is adaptively adjusted for reducing perlink payload size while ensuring the convergence [58]. On the other hand, the method in [67] merges stochastic quantization with the spatial sparsification of GADMM [44]
, in which the weight update differnece is rounded up and down with probability
and , respectively, while is adaptively adjusted to minimize communication cost while preserving the convergence guarantees of vanilla GADMM [44].The aforementioned methods quantize each element of a model update individually. Alternatively, the model update vector can be quantized altogether by clustering and mapping the updates into the centroids in a multidimensional vector space. Leveraging the universal quantization algorithm [68], the work [69] applies universal vector quantization to federated learning, coined UVeQFed, such that the quantization error can be bounded by a term that vanishes as the number of worker grows.
4.3 Short Packet Aggregation
Whether the length of a communication packet is long or short has a significant impact on communication data rates. To be specific, in a large packet regime, the data rate can be formalized by the wellknown Shannon formula
per unit bandwidth over the additive white Gaussian noise (AWGN) channel for a given signaltonoise ratio (SNR). Its derivation relies on assuming an infinite packet length
to ensure the zero packet error probability [70], and thus becomes a tight approximation for large packets. Since packet lengths are proportional to communication payload sizes, in the distributed learning context, the Shannon formula is suitable for deep NNs with perodic model parameter exchanging methods such as FL.By contrast, SL exchanges a single NN layer’s instantaneous activation and gradient whose corresponding packet length can be very short. In this short packet regime with finite and nonnegligible , the data rate can be described using a formula proposed by Y. Polyanskiy et al. [71], given as:
(1) 
where is the inverse of the Gaussian Q function, and is the term capturing channel dispersion, e.g., under the AWGN, . This formula implies that the short packet length incurs a penalty on the data rate that is proportional to . To alleviate such a penalty, one can aggregate consecutive packets, increasing [72]. Through the lens of SL, this packet aggregation coincides with increasing the batch size of each worker. A larger batch size often yields faster convergence at the cost of compromising accuracy [73]. Consequently, there exists a tradeoff among data rate, batch size, and accuracy in SL, as we shall discuss in Sec. 6.11.
4.4 Analog Transmission
The limited communication bandwidth is one key challenge in distributed learning over wireless channels. The wirelessly connected workers using the same channel may interfere with each other during their overtheair transmissions. To avoid their interference, under digital transmissions, it is common to avoid such interference by allocate orthogonal channel bandwidths to different workers [12, 74, 75, 76, 77]. As a result, the workers compete over the limited bandwidth, which is thus not scalable for supporting a large number of workers. Alternatively, motivated by the fact that the parameter server in FL is interested in the aggregated model updates of all workers, i.e., global model with workers, rather than the individual updates , several recent works have utilized analog transmissions so as to harness interference without separate channel allocation [78, 79, 80, 81].
Under analog transmissions, each transmitted signal from a worker in FL is perturbed by fading, i.e, multiplied by the fading gain , and superpositioned overtheair with all other workers’ signals using the same channel. Consequently, is received by the parameter server. The suporpositioning property of analog transmissions is favorable for averaging the models updates using the entire bandwidth for all workers, rather than competing over the limited bandwidth with each other under digital transmissions. By contrast, the fading perturbation may hinder obtaining the received signal in a desired form at the parameter server, e.g., equal or weighted averaging with the weight that is proportional to the ratio of each worker’s data size [82]. One way to cope with the fading perturbation is the channel inversion method. By inversely perturbing the signal before transmission, i.e., multiplying by , the fading can be canceled out at reception [83]. This channel inversion however consumes the transmit power inversely proportional to the channel gain, which is not viable for small under the limited edge device energy budget. For this reason, it is common to allow transmissions only when the channel gains exceed a certain threshold [78, 79, 80]. As discussed in Sec. 4.1, such temporal sparsification may hinder the convergence of learning algorithms.
Alternatively, the method proposed in [84] only utilizes the superpositioning property of analog transmissions without channel inversion. This is done by reformulating FL and optimizing it direcly with perturbed model updates as follows. To be specific, recall the original unconstrained problem of FL, aiming to minimize , by locally minimizing at each worker and globally averaging their model parameters at the parameter server, yielding . This boils down to the following constrained average consensus problem:
(2)  
s.t.  (3) 
To incorporate the fading perturbed model updates in the problem formulation, by multiplying the fading gain at both sizes, (3) is recast as its equivalent constraint . This reformulated problem is solved using ADMM while directly incorporating the perturbed model updates, i.e., , without inverting the fading gain . As a consequence of avoiding channel inversion, the convergence becomes less sensitive to the transmit power constraint. Furthermore, thanks to directly exploiting the perturbed model updates, it is more robust against the adversarial or honestbutcurious parameter server, to be further elaborated in Sec. 6.4.
4.5 Scheduling and Offloading
Heterogeneity is prevalent in distributed learning, in terms of the availability and access to the training data and resources for the communication, computation, and memory. Such heterogeneity results in the learning workers having outdated models compared to other workers, referred to as stragglers. Waiting these stragglers may cause significant delays to the overall training operations, whereas ignoring them may hinder guaranteeing the convergence or achieving high accuracy. Scheduling is effective in balancing and resolving this straggler handling problem. To this end, it is of paramount importance to identify the root cause of each straggler and its contribution to the overall learning performance.
The lack of computing resources can be one major cause of stragglers. It happens when largesized models and datasets with multiple tasks are processed by ondevice and batterylimited workers. In this case, as studied in [85], an effective solution could be scheduling the resultant stragglers while offloading their computationally demanding tasks (or even training data with a loss of privacy) to neighbors or edge servers, a conceptual design known as mobile edge computing (MEC) [86, 87]. Such task offloading in MEC needs to take into the account of device heterogeneity [88], communication limitations [89, 90], and demandsupply capabilities of processing power [91] in addition to its impact on the tolerable training latency[87] and target training/inference accuracy [92] while ensuring devices’ privacy [93].
Another source of stragglers is poor channel conditions such as the channels in deep fades and high interference, as well as communication resource limitation such as limited bandwidth and uplink transmit power. To remedy this type of straggler problem, it is useful to utilize advanced multiple access control techniques such as joint scheduling and resource management, interference mitigation and alignment, proactive scheduling via channel prediction, and multihop relaying [94, 95, 46, 96]. Reflecting both computing and communication limitations, adjusting the model complexity is also effective in mitigating stragglers [97].
5 Key Machine Learning Principles
Communication efficiency of distributed learning is significantly affected by ML architectures and algorithms. In this section, several machine learning principles are presented for improving vanilla distributed learning methods discussed in Sec. 3, and their effectiveness will be validated by representative use cases in Sec. 6.
5.1 Model Split
Running a largesized deep NN consumes huge memory that may not fit within edge devices. The energy consumption of this model is proportional to the model sizes [98], aggravating the problem under batterylimited edge devices. SL resolves such issues by splitting a single NN model into multiple segments stored and operated by different edge nodes. In essence, this problem is traced back to model parallelism focusing on how to partition and offload NN segments, as opposed to data parallelism considering a largesized global dataset dispersed across different workers running NN models, each of which is separate but has the same architecture [99].
Traditionally model parallelsm has focused primarily on the NN partitioning based on computing latency [100]. For instance, a convonlutional NN comprises fullyconnected layers and convolutional layers, and the convolutional layers often consume much longer processing delays compared to the fullyconnected layers, e.g., in AlexNet [101] and ResNet [102] architectures. Therefore, even if two edge nodes have the same memory size, equally partitioning an NN may not be an optimal way, incurring imbalanced processing overhead. Beyond this, in the context of SL, communication efficiency and data privacy should also be taken into account. Indeed, cutting a NN’s bottleneck layer having the smallest dimension (e.g., VAE’s bottleneck layer for latent variables [103]) can maximally reduce the SL communication payload sizes. Furthermore, in a classification task, not only unlabled data samples but also their groundtruth labels can be privacysensitive (e.g., unlabled Xray images and their groundtruth diagnosis results) [104, 105, 106]. In this case, the input and output layers are linked to the raw samples and groundtruth labels (for training loss calculation), respectively, and a NN should thus be partitioned such that the input and output layers are colocated at the data owner. More discussions on model split are deferred to Sec. 6.10.
5.2 Knowledge Distillation
Knowledge distillation (KD) aims to imbue an empty student model with a teacher’s knowledge [107]. In a classification task, as opposed to the standard model training that attempts to match the student model’s onehot prediction (e.g., [cat, dog] = [0,1]) of each unlabled sample with its groundtruth label, KD tries to match the model’s output layer activation, socalled logit (e.g., [cat, dog] = [0.3, 0.7]), with the teacher’s logit for the same sample. This logit contains more information than its onehot prediction, thereby training the student model faster than the standard training with much less samples [108].
The teacher’s knowledge of KD can be constructed in different ways. Originally, the knowledge is a pretrained teacher model’s logit, which is transferred to a smaller student model for model compression [107]. The knowledge can also be an ensemble of other student models’ logits [109], in that the ensemble of predictions is often more accurate than individual predictions. Leveraging this to enable distributed learning, the knowledge in FD is constructed the ensemble of different workers’ prediction, each of which is locally averaged per label in a classification task [104, 110] or across neighboring states in reinforcement learning [111]. The local averaging step avoids the same sample observations of the student and teacher models (i.e., ensemble of all student models), thereby reducing significant communication overhead while preserving local data sample privacy. Lastly, for given averaged logits as the teacher’s knowledge, running KD with an empty student model at the parameter server realizes a fast oneshot FL or the information type conversion from logits to the parameters of the trained student model, which will be discussed with a use case in Sec. 10.
5.3 Mixup Augmentation
Mixup is a data augmentation technique generating a synthetic sample by superpositioning two different samples [22]. As an example, in a binary classification task, a sample in the label is linearly combined with another sample in the label , thereby yielding a synthetic sample given as:
(4) 
The term
is the mixing ratio that is randomly sampled from a bathtubshaped beta distribution such that
resembles a sample in the label either orwith a slight difference. Manifold Mixup applies the same technique to superposition two different hidden representations, which often performs similar or even higher accuracy than vanilla Mixup that combines raw samples
[112].Both vanilla and Manifold Mixup are commonly used in standalone training, particularly for adversarial learning that intentionally feeds distorted samples to obtain more generalized models [22, 112]. In distributed learning, these techniques can also be utilized for sharing proxy samples without revealing raw data samples [113]. For example, to rectify nonIID data distributions, each worker can exchange mixedup samples or manifold mixedup representations to complement missing samples in some labels [105]. By uploading the mixedup samples or representations to a parameter server, the workers’ training computation can be offloaded to the server enabling oneshot FL [114]. The number of these generated proxy samples or representations can further be oversampled by mixing them across different workers [111] and/or remixing the mixedup samples or representations [47]. More use cases and effectiveness of Mixup and manifold Mixup will be discussed in Sec. 6.8, 6.9, and 6.12.
5.4 Gaussian Process Regression
Dynamics in the environment, agents’ hardware, and random choices of training batches and learning parameters cause computing and communication resources and training model parameters to change over the training duration. These dynamics of resource and model states can be viewed as stochastic processes. Considering a Gaussian process prior probability distribution on above stochastic processes provides means of analyzing them using Bayesian inference methods
[115]. Gaussian process regression (GPR) is the process of determining a set of kernel hyperparameters defining the covariance matrix between the all possible observations over time (and space) assuming zeromean distribution therein. Using GPR, the posterior mean and variance at unseen observations can be analytically estimated.
By modeling the dynamics of the resource (computation and/or communication) availability as a time series, GPR can be adopted to predict future resources (mean) with the uncertainty bounds (variance) [116]. This allows to identify agents who are likely to be stragglers in advance, in which agents and resources can be proactively scheduled. As a result, the overall training latency can be decreased with minimum loss of training performance and the overall resource utilization can be improved. Similarly, model parameter dynamics can be analyzed with GPR. Under the communication bottleneck in collaborative learning, agents can utilize estimated model parameters of others to continue local training while using the limited resources only when the uncertainties of model estimations are unacceptable.
5.5 MeanField Game (MFG) Learning
Decentralized decisionmaking of competitive and mutually interactive workers is a challenging task as discussed in Sec. 3.5. Due to these interactions, it is common to determine a single worker’s action by fixing all other worker states, and then iterate it for the next worker until all workers’ actions converge to the Nash equilibrium, a stable state at which no worker gains more reward by changing its action [117]. The complexity of this problem is thus increasing exponentially with the number of workers, which is unfit for dealing with massive interactive workers. Meanfield game (MFG) is a useful framework to greatly reduce the complexity [118, 119, 120, 121, 122]
. At its core, MFG approximates the problem of massive interactive workers as the problem of each single worker interacting with a virtual worker whose state distribution is given by the distribution of the entire population. Then each worker’s decisionmaking boils down to solving two partial differential equations (PDEs), the HamiltonJacobiBellman (HJB) equation and the FokkerPlankKolmogorov (FPK) equation
[123]. By solving FPK, one can obtain the population state distribution, called meanfield (MF) distribution. For the given MF distribution, solving HJB results in the optimal action of each worker.One common limitation of MFGtheoretic approaches is the curse of dimensionality, which is detoured by the MFG learning framework. To be specific, a PDE is often solved numerically by discretizing the domain so that the derivatives therein can be approximated using finite differences. To guarantee the convergence of such a finite difference method, the discretizing step size should decrease with the domain dimension. As an example, for a given
dimensional domain vector , the discretization step size should satisfy according to the CourantFriedrichsLewy condition [124]. Consequently, the dimensionality increase in states and actions incurs huge extra computing overhead for solving FPK and HJB equations, respectively. MFG learning resolves this issue by recasting the problem of solving HJB and FPK equations, i.e., and , respectively, as the regression tasks of minimizing and , respectively. To solve these two regression tasks, a pair of HJB NN and FPK NN are introduced in that NNs are good at tackling regression problems via simple firstorder algorithms such as the gradient descent method. The effectiveness of MFG learning will be corroborated with a massive drnoe control use case in Sec. 6.7.6 Use Cases: CommunicationEfficient and Distributed Learning Frameworks
By applying the ML and communication principles introduced in Sec. 4 and 5 to vanilla distributed ML methods in Sec. 3, in this section we present communicationefficient and distributed learning frameworks with selected use cases. The mapping between specific principles and use cases is illustrated in Fig. 1.
6.1 QuantizedGADMM (QGADMM)
Utilizing GADMM that exploits sparse connectivity (Sec. 4.1), QGADMM allows each worker to share a quantized version of its model with neighbors [67]. Using stochastic quantization, one of the key communication principle in Sec. 4.2, with adjustable quantization range, QGADMM can significantly reduce the communication energy compared to original GADMM at a zero cost in terms of the convergence speed and accuracy.
The stochastic quantization places the th dimensional element of the previously quantized model vector at the center of the quantization range that is equally divided into quantization levels. This yields a quantization step size of resolution . Each worker quantizes the difference between the current and the previously quantized models by choosing a rounding probability yielding a zero quantization error on average.
Each worker then transmits and the index of the quantization level to its neighboring workers. At the receiver, can be reconstructed by . Consequently, when the full arithmetic precision uses bits to represent , the payload size of QGADMM is bits where is the model size. Compared to GADMM whose payload size is bits, QGADMM can achieve a huge reduction in communication overhead, particularly for large .
Fig. 2 compares QGADMM with GADMM, and two PSbased schemes (QGD, and ADIANA [125]) in terms of the loss versus the total sum energy for a system of workers. Here, linear regression of California housing dataset with input features is tested. In the full precision GADMM, each worker will transmit bits to represent all elements in the model vector. In contrast, each worker of QGADMM only uses bits, with bits to represent each element in the model vector. Following the Shannon’s capacity theorem, more bits consumes more transmission energy for the same bandwidth, transmission duration, and noise spectral density. Fig. 2 exhibits significant reduction in the total energy consumption, a key challenge discussed under Sec. 2, for QGADMM compared to all baselines, owing to i) the decentralization where workers communicate with only nearby neighbors (Sec. 4.1), ii) the fast convergence inherited from GADMM (Sec. 3.2), and iii) the reduction of transmitted bits at every iteration while ensuring convergence via stochastic quantization (Sec. 4.2),.
6.2 Dynamic GADMM (DGADMM)
In practise, due to device mobility, the network topology is time variant, in which neighboring nodes continuously change over time as highlighted in Sec. 2. Hence, to enable distributed learning over dynamic network of workers, Dynamic GADMM (DGADMM), which inherits the theoretical convergence guarantees of GADMM is proposed in [44]. While adapting to network dynamics, DGADMM improves the convergence speed of GADMM, i.e., random changes in sparse and logical neighbors (Sec. 4.1) of a static physical topology can significantly accelerate the convergence of GADMM. Although the sparsity of network graphs yields slow convergence speeds [126], the reductions of convergence speed in DGADMM compared to the standard PSbased ADMM can be compensated by continuously altering neighbors with DGADMM. In addition, with dynamic topology changes, DGADMM exhibits significant communication cost reductions compared to GADMM [44].
Fig. 3 compares DGADMM with both GADMM and standard ADMM. From Fig.3, it can be seen that utilizing DGADMM significantly increases the convergence speed of GADMM and hence, reduces the total communication cost even when the topology is fixed. Therefore, DGADMM can compensate for the decrease in the convergence speed of GADMM compared to PSbased ADMM due to topology decentralization and maintains a low communication cost per iteration gained by GADMM.
6.3 Censored Generalized GADMM (CGGADMM)
In GADMM, every worker has to share its own model with only up to two neighboring workers at every iteration. To reduce communication overhead while addressing more genereral network topologies, we propose censored generalized GADMM (CGGADMM). In CGGADMM, by exploiting temporal sparsity, each worker shares its model only if the difference between the current and the previous models exceeds a certain threshold [58]. Furthermore, each worker in CGGADMM can communicate with an arbitrary number of neighbors in a different group (i.e., under any bipartite graph), which is helpful addressing timevarying network topologies (Sec. 2). Theoretically, CGGADMM inherits the same performance and convergence guarantees of Vanilla GGADMM, under a nonincreasing and nonnegative censoring threshold sequence; particularly if the threshold at iteration follows where and . Furthermore, by integrating QGADMM and CGGADMM, we propose CQGGADMM that performs the censoring based link sparsification with payload quantization (Sec. 4.2). Consequently, CQGGADMM decreaes both the cost per channel use and the number of channels, thereby significantly reducing the communication energy and competition on the limited bandwidth.
The benefits of censoring and quantization in terms of reduced energy consumption are elaborated in Fig.4 using the linear regression problem described in Sec. 6.1. It can be noted that introducing censoring on top of GGADMM can provide about twofold reduction in the total communication cost. Moreover, implementing both censoring and quantization can further lower the total communication cost.
6.4 Analog Federated ADMM (AFADMM)
In AFADMM [84], each worker transmits an analog signal (Sec. 4.4) that is a function of the th element in the model over a shared channel among all workers. All transmitted signals are superpositioned over the air while hiding each private local model in the crowd preserving the privacy (Sec. 2). Consequently, the PS receives aggregated signals of all individuals perturbed by their complex fading channels. Hence, AFADMM aggregates multiple workers’ updates at the PS without competition on the available bandwidth via analog transmissions. It was proven in [84] that AFADMM converges to the optimal solution for convex functions and preserves privacy. Moreover, AFADMM copes with the nuisances incurred by analog transmissions, in terms of timevarying channel fading, noise, and transmit power limitation.
Fig. 5(a) compares analog and digital implementations of ADMM on a linear regression task. We plot the loss vs the number of uploads (communication rounds). As observed in Fig. 5(a), AFADMM requires the lowest communication rounds to achieve a target loss . Even with more subcarriers, DFADMM fails to reach the same speed due to the orthogonal subcarrier allocation to each worker under limited bandwidth. However, if one aims to achieve very low loss below , AFADMM suffers from noisy reception, and DFADMM may thus be a better choice, as long as very large bandwidth and/or long uploading time are available.
Fig. 5(b) validates the applicability of the stochastic version of AFADMM (ASFADMM) on the stochastic and nonconvex problem of image processing using DNN. Note that the model size for the tested DNN architecture is several order of magnitudes higher than the model size of the linear regression problem discussed above (For the simulation details see [84]). As observed from Fig. 5(b), ASFADMM significantly outperforms the digital implementation (DSFADMM) in terms of the convergence speed while achieving the maximum accuracy. In fact, ASFADMM outperforms 10xDSFADMM which has larger badnwdith (i.e., more subcarriers).
6.5 Quantum Scheduler Aided FL
In modern quantum computing research, the design and implementation of quantum approximate optimization algorithms (QAOA) is of great interest [127, 128]. With QAOAbased methods, many approximation algorithms for NPhard problems are under development. Among the NPhard problems, the QAOAbased approximation solution approach to maxweight independent set (MWIS) problem is actively under discussion where the MWIS formulation is widely used for network scheduling modeling, e.g., devicetodevice wireless networks [129]. As studied in [130], scheduling problems are considered and formulated with MWIS in FL over wireless channels where the objective for the scheduling is sumratemaximization.
This QAOAbased approach finds proper parameters for quantum approximation from classical optimization approaches. The use of QAOA is beneficial in terms of computation time and complexity comparing to the other MWIS solution approaches such as messagepassing. Based on the parameters, the approximation solution to the given problem can be obtained from the optimum of the expectation value of Hamiltonian. As presented in Fig. 6(b)
, the QAOAbased MWIS schedulers outperform greedy and random scheduling baselines, wherein the performance is measured by the cumulative distribution function (CDF) of the proportion between the QAoAscheduled workers’ weights and the optimal weights after an exhaustive search. The same tendency holds for various
values, where means the number of alternation of quantum approximation computation, having an impact on the convergence speed and accuracy. Fig. 6(a) illustrates thelevel quantum circuit under study, describing the QAOA quantum gate computing procedures for solving the MWIS problem. The QAOAbased quantum scheduler can be designed and implemented using Cirq and TensorFlowQuantum
[131], where Cirq is a Python framework for creating, editing, and invoking noisy intermediate scale quantum (NISQ) circuits, while TensorFlowQuantum integrates quantum computing algorithm with the logic designed in Cirq.6.6 GPR Aided FL
Communication plays a key role in the local model aggregation and global model sharing steps of the FL (Sec. 3.1) over wireless networks. The poor channel conditions in both uplink and downlink introduces stragglers from the communication point of the view (Sec. 2), in which, channel measurement or accurate estimation is essential under the limited communication resources [27]. Although measuring channels aids to utilize agent and resource scheduling, the channel sampling and pilot transmissions therein require high reliable (possibly dedicated) resources as well as introduce significant latency to the training process. To overcome the cons of channel measurement, GPRbased channel estimation can be adopted in FL (GPRFL) [95]. By modeling the dynamic channel states as stochastic processes with a Gaussian prior, time series prediction in GPR can be used to estimate the channels and their uncertainty (Sec. 5.4). Using the uncertainty of channels from GPR as a regularizer within FL loss function, joint channel sampling and allocation for stragglerfree scheduling (Sec. 4.5) can be carried out simultaneously to reduce the sampling latency [132].
To illustrate the benefits of GPRFL, we compare the training loss dynamics (relative to the loss of centralized training) of GPRFL under limited wireless resources with three other methods as illustrated in Fig. 7: i) SchedFL: joint agent and resource scheduling towards minimizing training loss similar to GPRFL is used with channel measurements, ii) PF: proportional fair scheduling in terms of contribution to model aggregation without channel measurements, and iii) IDEAL: FL without communication constraints. Note that a single resource block is dedicated for the channel measurement in SchedFL. GPRFL reaps the benefit of the additional resource by utilizing it in agent scheduling over SchedFL, yielding a lower loss as close to IDEAL. In contrast, PF performs poorly even with the additional resource, due to the absence of the training loss minimization objective within its scheduling policy. It is worth noting that due to underlying complexity in GPR and lack of channel sampling, GPRFL may tend to lose its performance under the availability of excessive amount of total resources compared to the number of agents. A viable solution is to limit agents’ access to subsets of resources rather the entire resource pool.
6.7 Federated MFG Learning for Massive UAV Control
By integrating MFG learning with FL, in this use case we study controlling a massive number of unmanned aerial vehicles (UAVs) in a communicationefficient and decentralized way. Following the MFG learning framework as elaborated in Sec. 5.5, each UAV is equipped with a pair of HJB and FPK NNs. The HJB NN outputs (i) the UAV’s optimal action (i.e., acceleration) and (ii) the resultant cost functional value by feeding (iii) the UAV’s observed state and (iv) the state distribution of the entire UAV population (i.e., MF distribution). The FPK NN outputs (iv) the MF distribution by feeding (iii) the UAV’s state and (ii) the cost functional value obtained from the HJB. While (iii) is fixed, (ii) and (iv) are recursively updated until convergence, at which the optimal action (i) is finally determined [121, 122]. According to the MFG theory [133], the aforementioned optimal control can achieve the epsilonNash equilibrium as long as the initial states of all UAVs are exchanged without any further interUAV communication. This is true when the outputs of the HJB and FPK NNs accurately approximate the solutions of the HJB and FPK equations; in other words, HJB and FPK NNs are ideally trained, which is not feasible due to the lack of training samples (i.e., observed states).
To accelerate the training of HJB and FPK NNs, following FL, each UAV periodically broadcasts its NN weights with its neighbors, and updates its model by averaging the received weights within a predefined latency deadline. As each UAV has HJB and FPK NNs, there are three possible configurations, exchanging only HJB NN (MfgFLH), only FPK NN (MfgFLF), or both HJB and FPK NNs (MfgFLB) at the cost of the increased communication payload sizes. With 25 UAVs dispatched from a common source to a destination, Fig. 8 shows that MfgFLB achieves the best trajectory without any collision, while all MfgFL based methods yield better results than a baseline operates by only running the HJB NN while exchanging raw states of neighboring UAVs. Here, the curve color indicates the value of in the cost function, a swarming term that decreases with the relative velocities, and increases with the relative distances of all UAVs. Again, MfgFLB yields the lowest even at the early stage, supporting the collisionfree results. Furthermore, MfgFLB consumes the minimum motion energy until reaching the destination as shown in Fig. 9(a), and is more robustness against external disturbances reflected by the variance of random wind velocity as observed in Fig. 9(b). In addition, Fig. 9(c) illustrates that MfgFLB exchanges the least amount of the packets even though its percommunication payload size is 2x greater than MfgFLH or MfgFLB. Lastly, for different communication periods, MfgFLB results in the least energy consumption as seen by Fig. 9(d).
6.8 Downlink FL After Uplink FD
In mobile communication systems, uplink data rates are often much lower than downlink rates due to the limited transmission power of mobile devices [134]. Therefore, FD (Sec. 3.3) is useful in the uplink thanks to its small payload sizes, whereas in the downlink FL (Sec. 3.1) is preferable in that exchanging model parameters commonly achieves higher accuracy than exchanging model outputs [47]. To jointly exploit FD and FL under uplinkdownlink asymmetric channels, we present an FLafterFD algorithm combined with twoway Mixup (Mix2FLD). In Mix2FLD, the model outputs (i.e., logits) are uploaded to a server via FD, which should be converted into a global model whose parameters can be downloaded by and updated at each device using FL. Such a model outputtoparameter conversion is viable using KD (Sec. 5.2) that updates the global model at the server by minimizing the difference between the uploaded outputs and the the outputs of the global model. This requires a handful of seed samples to generate the global model’s outputs, which is a major challenge of its implementation due to the extra communication overhead and possible data privacy violation as highlighted in Sec. 2.
In a classification task, we resolve the aforementioned problem by applying the Mixup method twice. Precisely, before uploading each device encodes multiple samples by them via Mixup (Sec. 5.3). Then, the server decodes the Mixupencoded samples uploaded from different devices by additionally superpositioning them, in a way that the decoded samples have onehot labels. Such a decoding commonly improves accuracy particularly under nonIID data distributions [47, 114]. Note that the encoding not only preserves raw data privacy but also reduces communication overhead (Sec. 2) since the decoding based on the Mixup data augmentation can generate multiple synthetic seed samples by changing the superpositioning combinations.
Fig. 10 first verifies our conjecture that FL achieves higher accuracy when the uplink channel capacity is as high as the downlink (Uplink = Downlink). However, when the uplink channel capacity is bottlenecked (Uplink Downlink), the accuracy of FL is significantly degraded due to its large payload sizes and the resultant frequent uploading failures within a target latency deadline. In this uplinkdownlink asymmetric channel, Mix2FLD achieves higher accuracy with less variance than FL and FD.
6.9 OneShot FL via XOR Mixup
Imbalanced data distributions could significantly degrade FL performance Sec. 3.1) [12, 22, 47]. For the MNIST and CIFAR10 datasets wherein each worker has scarce samples of specific labels, the classification accuracy is degraded by up to % and %, respectively, compared to the IID counterparts [135]. To correct such a nonIID data problem, a straightforward solution is to exchange and fill in missing raw samples, which may however violate data privacy. Alternatively, we apply an XOR based mixup data augmentation method (XorMixup) that is extended to a novel oneshot FL framework, termed XorMixFL.
XorMixup was inspired by the Mixup data augmentation technique (Vanilla Mixup) producing a synthetic sample by linearly superpositioning two raw samples and (Sec. 5.3) [136]. Similarly, XorMixup combines two samples not linearly but using the bitwise XOR operation that has the following flipping property: . To preserve the data privacy while generating realistic synthetic samples, (i) each worker encodes two local samples that is exchanged with other devices, and (ii) the received is decoded not using the original but a sample stored in a different worker, which has the same label of . Consequently, the decoding yields that reflects some key features of but is not the same as . Owing to the mixing nature, both (i) and (ii) preserve raw data privacy across different workers, while (ii) improves the synthetic sample’s authenticity, increasing oneshot FL accuracy as elaborated next.
As illustrated in Fig. 11(a), by applying XorMixup to a oneshot FL framework having only one communication round [137, 138], each device in XorMixFL uploads its encoded seed samples to a server. The server decodes and augments the seed samples using its own base samples until all the samples are evenly distributed across labels. The server can be treated as one of the devices, or a parameter server storing an imbalanced dataset. Then, utilizing the reconstructed dataset, the server trains a global model that is downloaded by each device until convergence. Under a nonIID MNIST dataset, simulation results in Fig. 11(b) corroborate that XorMixFL achieves up to % and % higher accuracy than standalone ML and Vanilla FL, respectively.
6.10 Tripartite SL for Medical Diagnosis
In this use case, we study a privacypreserving SL framework (Sec. 5.1) for multiple medical platforms (e.g., hospitals or ehealth wearables). These platforms store their own privacysensitive medical data, and are willing to cooperatively train a global model by the aid of a server storing a fraction of the model. Specifically, we consider a medical image classification task, in which not only the raw samples (e.g., a chest Xray images) but also their groundtruth labels (e.g., lung cancer diagnosis) are privacy sensitive. In an NN model, each raw sample is fed to the input layer, and its groundtruth label is compared with the model’s prediction for loss calculation at the output layer. Therefore, to preserve the data privacy of each sampleandlabel pair, both input and output layers should be stored by each platform, while the rest of the layers can be offloaded to the server, resulting in tripartite SL. This is in stark contrast to the standard bipartite SL where only the input layer is stored at each worker, while the remaining layers can be offloaded to the server.
Following the aforementioned tripartite SL, as illustrated in Fig. 12(a), we consider a single NN having layers whose intput layer and output layer are stored at each platform, while the rest is run at the server. As depicted in Fig. 12(b), for each iteration, in the forward propagation, the activation of and are exchanged between a platform and the server without revealing raw samples. After calculating the loss a the output layer stored at the platform, the gradients of and are exchanged while hiding the groundtruth labels. While effective in preserving data privacy, the communication efficiency of tripartite SL is questionable due to frequent forward and backward propagations over wireless channels.
To validate its communication efficiency, we compare tripartite SL with the largescale minibatch stochastic gradient descent (LSSGD)
[139] by measuring their transmitted data until convergence under VGG and ResNet NN model architectures with a medical Xray dataset, CheXpert [140]. Fig. 12(c) shows that under VGG, tripartite SL yields GB transmitted data to achieve % test accuracy, while LSSGD incurs GB transmitted data with % accuracy. A similar tendency can be observed under ResNet, in which tripartite SL consumes GB transmitted data with % accuracy, whereas LSSGD results in GB transmitted data with % accuracy. This experiment concludes that in spite of more frequent communications due to the layer splits and exchanging instantaneous forward/backward propagations, rather than periodically exchanging model parameters, tripartite SL ends up with achieving lower total communication cost until the convergence. This is viable thanks to its much less communication rounds (i.e., faster convergence) and much smaller communication payload sizes.6.11 Channel and Packet Adaptive Parallel SL
Vanilla SL is inefficient in terms of communication energy consumption when supporting multiple devices through wireless channels (Sec. 5.1). Consider a server storing a common upper segment of NN layers that are associated with multiple devices storing its lower segments and feeding their own data samples. For these multiple devices, Vanilla SL is often implemented in a sequential manner preventing multiple devices to simultaneously connects with the server. Concatenating the output features of multiple devices into a single large vector and feeding into the server can improve the SL performance [19]
, with increased transmission energy consumption in the uplink and back propagation overhead in the downlink. Lastly, since the model structure cannot be dynamically adjusted during training and inference, for a fixed dimension of the server’s input layer, the server needs to wait until the input layer is entirely filled with a predetermined number of devices or to pad arbitrary values for straggling devices due to poor channel conditions, increasing latency or degrading accuracy, respectively. Moreover, straggling devices due to intermittent connectivity under poor channel conditions (Sec.
2) either increase waiting times of acquiring the input layer at the server or persuade the server to pad arbitrary values yielding loss of accuracy.In this view, parallel SL architecture utilizing feature averaging via Mixup augmentationbased (Section 5.3) multiple devices’ outputs superpositioning can be used [112]. Adopting feature averaging in contrast to output concatenation allows server’s input dimension to remain fixed independent from the number of contributing devices, enabling communication and energy efficient scalability with low training latency as illustrated in Fig. 13(a). Additionally, controlling the batch size, packet sizes of the devices’ cut layer’s activation to be exchanged with the server can be controlled. With small batch sizes, the accuracy can be improved with the cost of degraded uplink data rates over short packets, which can be resolved by data aggregation as discussed in Sec. 4.3. The tradeoff between test accuracy and training latency based on short packet aggregation for different choices of batch sizes is illustrated in Fig. 13(b).
6.12 Heteromodal SL for mmWave Channel Prediction
In this use case, we focus on predicting future mmWave channels by utilizing preceding mmWave signal received signal strength (RSS) history and image frames captured by two RGB depth (RGBD) cameras mounted in different locations. Fusing these multiple modalities are essential in improving the prediction accuracy by complementing missing features one another. In particular, camera images involve useful features of blockage mobility patterns determining sudden lineofsight (LOS) and NLOS transitions that are hardly observed from RSS, whereas RSS better describes shortterm channel fluctuations for a given LOS or NLOS channel condition. Furthermore, the use of multiple cameras can overcome occlusions and missing frames (Sec. 2) due to the limited fieldofviews (FoVs) and insufficient frame rates of cameras, respectively. It is however challenging to fuse such multimodal and heterogeneous data. Indeed, these data are nonIID, under which FL and its variants cannot achieve high accuracy as highlighted in Sec. 3.1.
In this regard, a joint design of SL (Sec. 5.1) that is robust against nonIID data distributions [141, 53]
and feature interpolation and averaging via the Mixup data augmentation (Sec.
5.3) with heterogeneous FoVs and frame rates improving energy efficiency is considered. In the SL design, each camera feeds a sequence of image frames into its convolutional and recurrent layers whose output is uploaded to a BS’s fully connected layers at which the BS’s uplink mmWave RSS is fused with the uploaded features from the cameras. The proposed SL framework is validated by simulation with data measured in a real experiment using 60 GHz mmWave signals and two Kinect RGBD cameras [142]. When predicting the future uplink mmWave RSS in 500 ms by observing a sequence of RSS or image frames during 100 ms. To improve accuracy without degrading communication efficiency, the sequence of image features generated from the camera with a lower frame rate, missing feature elements are interpolated by equally superpositioning neighboring features via manifold Mixup (Sec. 5.3). Such an interpolation reduces the nonIIDness induced by the heterogeneous frame rates, yielding higher accuracy as shown in Fig. 14(a). Note that this manifold Mixup for feature interpolation is performed within a sequence of features, whereas the aforementioned manifold Mixup for feature averaging is performed across the sequences uploaded from different cameras. Compared to a baseline scheme directly interpolating missing frames at cameras before transmissions, the aforementioned interpolation is performed at the BS after transmissions without increasing the communication payload sizes achieving low transmission latency as observed in Fig. 14(b), while yielding low power consumption at both cameras and BS as shown in Figs. 14(c) and (d).7 Concluding Remarks
Imbuing intelligence into edge devices enables lowlatency and scalable decisionmaking at the network edge in 5G communication systems and beyond. On the other hand, updating outdated edge intelligence mandates communication with federating edge devices, improving the accuracy and reliability of the decisionmaking at the edge. To create greater synergy, this work has explored communicationefficient and distributed learning frameworks and their use cases by codesigning ML and communication principles under various challenges incurred by communication, computing, energy, and data privacy issues. The overarching goal of this article is to foster more fundamental research in this direction and bridge connections between communication and ML communities.
Jihong Park (S’09M’16) is a Lecturer (assistant professor) at the School of IT, Deakin University, Australia. He received the B.S. and Ph.D. degrees from Yonsei University, Seoul, Korea, in 2009 and 2016, respectively. He was a PostDoctoral Researcher with Aalborg University, Denmark, from 2016 to 2017; the University of Oulu, Finland, from 2018 to 2019. His recent research focus includes communicationefficient distributed machine learning, distributed control, and distributed ledger technology, as well as their applications for beyond 5G/6G communication systems. He served as a Conference/Workshop Program Committee Member for IEEE GLOBECOM, ICC, and WCNC, as well as NeurIPS, ICML, and IJCAI. He received the IEEE GLOBECOM Student Travel Grant in 2014, the IEEE Seoul Section Student Paper Contest Bronze Prize in 2014, and the 6th IDISETNEWS (The Electronic Times) Paper Contest Award sponsored by the Ministry of Science, ICT, and Future Planning of Korea. Currently, he is an Associate Editor of Frontiers in Data Science for Communications, a Review Editor of Frontiers in Aerial and Space Networks, and a Guest Editor of MDPI Telecom SI on “millimeter wave communiations and networking in 5G and beyond.” 
Sumudu Samarakoon (S’08AM’18) received his B. Sc. Degree (Hons.) in Electronic and Telecommunication Engineering from the University of Moratuwa, Sri Lanka in 2009, the M. Eng. degree from the Asian Institute of Technology, Thailand in 2011, and Ph. D. degree in Communication Engineering from University of Oulu, Finland in 2017. He is currently working in Centre for Wireless Communications, University of Oulu, Finland as a post doctoral researcher. His main research interests are in heterogeneous networks, small cells, radio resource management, reinforcement learning, and game theory. In 2016, he received the Best Paper Award at the European Wireless Conference and Excellence Awards for innovators and the outstanding doctoral student in the Radio Technology Unit, CWC, University of Oulu. 
Anis Elgabli is a postdoctoral researcher at the Centre for Wireless Communications, University of Oulu. He received the B.Sc. degree in electrical and electronic engineering from the University of Tripoli, Libya, in 2004, the M.Eng. degree from UKM, Malaysia, in 2007, and MSc and PhD from the department of electrical and computer engineering, Purdue university, Indiana, USA in 2015 and 2018 respectively. His main research interests are in heterogeneous networks, radio resource management, vehicular communication, video streaming, and distributed machine learning. He was the recipient of the best paper award in HotSpot workshop, 2018 (Infocom 2018). 
Mehdi Bennis is an Associate Professor at the Centre for Wireless Communications, University of Oulu, Finland, an Academy of Finland Research Fellow and head of the intelligent connectivity and networks/systems group (ICON). His main research interests are in radio resource management, heterogeneous networks, game theory and machine learning in 5G networks and beyond. He has coauthored one book and published more than 200 research papers in international conferences, journals and book chapters. He has been the recipient of several prestigious awards including the 2015 Fred W. Ellersick Prize from the IEEE Communications Society, the 2016 Best Tutorial Prize from the IEEE Communications Society, the 2017 EURASIP Best paper Award for the Journal of Wireless Communications and Networks, the allUniversity of Oulu award for research and the 2019 IEEE ComSoc Radio Communications Committee Early Achievement Award. Dr Bennis is an editor of IEEE TCOM. 
Joongheon Kim (M’06–SM’18) is currently an assistant professor of electrical engineering with Korea University, Seoul, Korea. He received his B.S. (2004) and M.S. (2006) in computer science and engineering from Korea University, Seoul, Korea; and his Ph.D. (2014) in computer science from the University of Southern California (USC), Los Angeles, CA, USA. Before joining Korea University as an assistant professor in 2019, he was with LG Electronics Seocho R&D Campus as a research engineer (Seoul, Korea, 2006–2009), InterDigital as an intern (San Diego, CA, USA, 2012), Intel Corporation as a systems engineer (Santa Clara in Silicon Valley Area, CA, USA, 2013–2016), and ChungAng University as an assistant professor of computer science and engineering (Seoul, Korea, 2016–2019). He is a senior member of the IEEE. He was a recipient of the Annenberg Graduate Fellowship with his Ph.D. admission from USC (2009), Intel Corporation Next Generation and Standards (NGS) Division Recognition Award (2015), KICS Haedong Young Scholar Award (2018), IEEE Vehicular Technology Society (VTS) Seoul Chapter Award (2019), KICS Outstanding Contribution Award (2019), Gold Prize from IEEE Seoul Section Student Paper Contest (2019), and IEEE Systems Journal Best Paper Award (2020). 
SeongLyun Kim is currently a Professor and Head of the School of Electrical & Electronic Engineering, Yonsei University, Seoul, Korea, leading the Robotic & Mobile Networks Laboratory (RAMO) and the Center for Flexible Radio (CFR+). He is codirecting H2020 EUK PriMO5G project, and the chair of Smart Factory Committee of 5G Forum, Korea. He was an Assistant Professor of Radio Communication Systems at the Department of Signals, Sensors & Systems, Royal Institute of Technology (KTH), Stockholm, Sweden. He was a Visiting Professor at the Control Engineering Group, Helsinki University of Technology (now Aalto), Finland, the KTH Center for Wireless Systems, and the Graduate School of Informatics, Kyoto University, Japan. He served as a technical committee member or a chair for various conferences, and an editorial board member of IEEE Transactions on Vehicular Technology, IEEE Communications Letters, Elsevier Control Engineering Practice, Elsevier ICT Express, and Journal of Communications and Network. His research interest includes radio resource management, information theory in wireless networks, collective intelligence, and robotic networks. 
Mérouane Debbah (S’01M’04SM’08F’15) received the M.Sc. and Ph.D. degrees from the Ecole Normale Supérieure ParisSaclay, France. He was with Motorola Labs, Saclay, France, from 1999 to 2002, and also with the Vienna Research Center for Telecommunications, Vienna, Austria, until 2003. From 2003 to 2007, he was an Assistant Professor with the Mobile Communications Department, Institut Eurecom, Sophia Antipolis, France. From 2007 to 2014, he was the Director of the AlcatelLucent Chair on Flexible Radio. Since 2007, he has been a Full Professor with CentraleSupelec, GifsurYvette, France. Since 2014, he has been a VicePresident of the Huawei France Research Center and the Director of the Mathematical and Algorithmic Sciences Lab. He has managed 8 EU projects and more than 24 national and international projects. His research interests lie in fundamental mathematics, algorithms, statistics, information, and communication sciences research. He is an IEEE Fellow, a WWRF Fellow, and a Membre émérite SEE. He was a recipient of the ERC Grant MORE (Advanced Mathematical Tools for Complex Network Engineering) from 2012 to 2017. He was a recipient of the Mario Boella Award in 2005, the IEEE Glavieux Prize Award in 2011, and the Qualcomm Innovation Prize Award in 2012. He received 20 best paper awards, among which the 2007 IEEE GLOBECOM Best Paper Award, the WiOpt 2009 Best Paper Award, the 2010 Newcom++ Best Paper Award, the WUN CogCom Best Paper 2012 and 2013 Award, the 2014 WCNC Best Paper Award, the 2015 ICC Best Paper Award, the 2015 IEEE Communications Society Leonard G. Abraham Prize, the 2015 IEEE Communications Society Fred W. Ellersick Prize, the 2016 IEEE Communications Society Best Tutorial Paper Award, the 2016 European Wireless Best Paper Award, the 2017 Eurasip Best Paper Award, the 2018 IEEE Marconi Prize Paper Award, the 2019 IEEE Communications Society Young Author Best Paper Award and the Valuetools 2007, Valuetools 2008, CrownCom 2009, Valuetools 2012, SAM 2014, and 2017 IEEE Sweden VTCOMIT Joint Chapter best student paper awards. He is an Associate EditorinChief of the journal Random Matrix: Theory and Applications. He was an Associate Area Editor and Senior Area Editor of the IEEE TRANSACTIONS ON SIGNAL PROCESSING from 2011 to 2013 and from 2013 to 2014, respectively. 
References
 [1] M. Bennis, M. Debbah, and V. Poor, “Ultrareliable and lowlatency wireless communication: Tail, risk and scale,” Proceedings of the IEEE, vol. 106, pp. 1834–1853, Oct. 2018.
 [2] J. Park, S. Samarakoon, H. Shiri, M. K. AbdelAziz, T. Nishio, A. Elgabli, and M. Bennis, “Extreme URLLC: Vision, challenges, and key enablers,” arXiv preprint arXiv:2001.09683, 2020.
 [3] I. WP5D, “Minimum requirements related to technical performance for IMT2020 radio interface(s),” 2017.
 [4] S. R. Pokhrel, J. Ding, J. Park, O.S. Park, and J. Choi, “Towards enabling critical mMTC: A review of URLLC within mMTC,” IEEE Access, vol. 8, pp. 131796–131813, 2020.
 [5] M. Latvaaho and K. Leppänen, “Key drivers and research challenges for 6g ubiquitous wireless intelligence,” in white paper, University of Oulu, 2019.
 [6] Alen Space, “A basic guide to nanosatellites.” [online, Accessed: 20200730]. https://alen.space/basicguidenanosatellites.
 [7] STARLINK, “High speed internet access across the globe.” [online, Accessed: 20200802]. https://www.starlink.com.
 [8] Amazon, “Project Kuiper.” [online, Accessed: 20200802]. https://www.amazon.jobs/en/teams/projectkuiper.
 [9] OneWeb, “How OneWeb is changing global communications.” [online, Accessed: 20200802]. https://www.oneweb.world.
 [10] J.H. Lee, J. Park, M. Bennis, and Y.C. Ko, “Integrating LEO satellite and UAV relaying via reinforcement learning for nonterrestrial networks,” arXiv preprint arXiv:2005.12521, 2020.

[11]
M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah, “Machine learning for wireless networks with artificial intelligence: A tutorial on neural networks,”
IEEE Communications Surveys & Tutorials, vol. 21, no. 4, pp. 3039–3071, 2019.  [12] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” Proceedings of the IEEE, vol. 107, pp. 2204–2239, October 2019.
 [13] S. Dörner, S. Cammerer, and J. Hoydis, “Deep learningbased communication over the air,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, pp. 132–143, Feb. 2018.
 [14] X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, and M. Chen, “Inedge AI: Intelligentizing mobile edge computing, caching and communication by federated learning,” ArXiv preprint, vol. abs/1809.07857, Sept. 2018.
 [15] Ericsson Blog, “TinyML as a service and the challenges of machine learning at the edge.” [online, Accessed: 20200730]. https://www.ericsson.com/en/blog/2019/12/tinymlasaservice.
 [16] J. Konecny, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, and D. Bacon, “Federated learning: strategies for improving communication efficiency,” in Proc. of NIPS Wksp. PMPML, (Barcelona, Spain), Dec. 2016.
 [17] R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton, “Large scale distributed neural network training through online distillation,” ArXiv preprint, vol. abs//1804.03235, Apr. 2018.
 [18] J. Ahn, O. Simeone, and J. Kang, “Wireless federated distillation for distributed edge learning with heterogeneous data,” IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Istanbul, Turkey, Sep. 2019.
 [19] P. Vepakomma, O. Gupta, T. Swedish, and R. Raskar, “Split learning for health: Distributed deep learning without sharing raw patient data,” Arxiv preprint, vol. abs/1812.00564, Dec. 2018.
 [20] S. Haykin, An introduction to analog and digital communication. John Wiley, 1994.
 [21] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine learning. MIT press, 2018.
 [22] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated Learning with NonIID Data,” [Online]: arXiv preprint arXiv:1806.00582, 2018.
 [23] M. FridAdar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan, “GANbased synthetic medical image augmentation for increased CNN performance in liver lesion classification,” Neurocomputing, vol. 321, pp. 321–331, 2018.
 [24] C. Bowles, L. Chen, R. Guerrero, P. Bentley, R. Gunn, A. Hammers, D. A. Dickie, M. V. Hernández, J. Wardlaw, and D. Rueckert, “Gan augmentation: Augmenting training data using generative adversarial networks,” arXiv preprint arXiv:1810.10863, 2018.

[25]
W.N. Hsu, Y. Zhang, and J. Glass, “Unsupervised domain adaptation for robust speech recognition via variational autoencoderbased data augmentation,” in
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
, pp. 16–23, IEEE, 2017. 
[26]
H. Nishizaki, “Data augmentation and feature extraction using variational autoencoder for acoustic modeling,” in
2017 AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1222–1227, IEEE, 2017.  [27] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al., “Advances and open problems in federated learning,” arXiv preprint arXiv:1912.04977, 2019.
 [28] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that exploit confidence information and basic countermeasures,” in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1322–1333, 2015.
 [29] S. Ahn, J. Kim, E. Lim, W. Choi, A. Mohaisen, and S. Kang, “ShmCaffe: a distributed deep learning platform with shared memory buffer for HPC architecture,” in Proc. 2018 IEEE International Conference on Distributed Computing Systems (ICDCS), pp. 1118–1128, 2018.
 [30] K. Lee, “Nvidia GeForce RTX 2080 Ti review.” [online, Accessed: 20200730]. https://www.techradar.com/reviews/nvidiageforcertx2080tireview.
 [31] N. Wang, C.Y. Chen, and K. Gopalakrishnan, “Ultralowprecision training of deep neural networks.” [online, Accessed: 20200802]. https://www.ibm.com/blogs/research/2019/05/ultralowprecisiontraining/.
 [32] A. Rodriguez, “Lowering numerical precision to increase deep learning performance.” [online, Accessed: 20200802]. https://www.intel.com/content/www/us/en/artificialintelligence/posts/loweringnumericalprecisionincreasedeeplearningperformance.html.
 [33] T. Murovič and A. Trost, “Massively parallel combinational binary neural networks for edge processing,” Elektrotehniski Vestnik, vol. 86, no. 1/2, pp. 47–53, 2019.
 [34] X. Wang, L. Kong, F. Kong, F. Qiu, M. Xia, S. Arnon, and G. Chen, “Millimeter wave communication: A comprehensive survey,” IEEE Communications Surveys & Tutorials, vol. 20, no. 3, pp. 1616–1653, 2018.
 [35] D. S. Lun, M. Médard, R. Koetter, and M. Effros, “On coding for reliable communication over packet networks,” Physical Communication, vol. 1, no. 1, pp. 3–20, 2008.
 [36] T. Yang, G. Andrew, H. Eichner, H. Sun, W. Li, N. Kong, D. Ramage, and F. Beaufays, “Applied federated learning: Improving google keyboard query suggestions,” arXiv preprint arXiv:1812.02903, 2018.
 [37] S. Samarakoon, M. Bennis, W. Saad, and M. Debbah, “Distributed federated learning for ultrareliable lowlatency vehicular communications,” IEEE Transactions on Communications, vol. 68, no. 2, pp. 1146–1159, 2019.
 [38] J. Konečnỳ, B. McMahan, and D. Ramage, “Federated optimization: Distributed optimization beyond the datacenter,” arXiv preprint arXiv:1511.03575, 2015.
 [39] V. Smith, C.K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multitask learning,” in Proc. of NIPS (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), (Long Beach, USA), pp. 4424–4434, Curran Associates, Inc., Dec. 2017.
 [40] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with nonIID data,” ArXiv preprint, vol. abs/1806.00582, June 2018.
 [41] F. Sattler, S. Wiedemann, K.R. Müller, and W. Samek, “Robust and communicationefficient federated learning from noniid data,” IEEE transactions on neural networks and learning systems, 2019.
 [42] A. Lalitha, O. C. Kilinc, T. Javidi, and F. Koushanfar, “Peertopeer federated learning on graphs,” arXiv preprint arXiv:1901.11173, 2019.
 [43] H. Kim, J. Park, M. Bennis, and S.L. Kim, “Blockchained ondevice federated learning,” to appear in IEEE Communications Letters [Online]. ArXiv preprint: abs/1808.03949.
 [44] A. Elgabli, J. Park, A. S. Bedi, M. Bennis, and V. Aggarwal, “GADMM: Fast and communication efficient framework for distributed machine learning,” Journal of Machine Learning Research, vol. 21, no. 76, pp. 1–39, 2020.
 [45] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are fewshot learners,” arXiv preprint arXiv:2005.14165, 2020.
 [46] H. Cha, J. Park, H. Kim, M. Bennis, and S.L. Kim, “Federated reinforcement distillation with proxy experience replay memory,” to appear in IEEE Intelligent Systems.
 [47] S. Oh, J. Park, E. Jeong, , H. Kim, M. Bennis, and S.L. Kim, “Mix2FLD: Downlink federated learning after uplink federated distillation with twoway mixup,” to appear in IEEE Communications Letters.
 [48] J. Ahn, O. Simeone, and J. Kang, “Cooperative learning via federated distillation over fading channels,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, May 2020.
 [49] P. Vepakomma, O. G. A. Dubey, and R. Raskar, “Reducing leakage in distributed deep learning for sensitive health data,” in Proc. of ICLR, (New Orleans, USA), May 2019.
 [50] J. Jeon, J. Kim, J. Kim, K. Kim, A. Mohaisen, and J.K. Kim, “Privacypreserving deep learning computation for geodistributed medical bigdata platforms,” in 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks–Supplemental Volume (DSNS), pp. 3–4, IEEE, 2019.
 [51] Y. Koda, J. Park, M. Bennis, K. Yamamoto, T. Nishio, and M. Morikura, “One pixel image and RF signal based split learning for mmWave received power prediction,” in Proceedings of the 15th International Conference on emerging Networking EXperiments and Technologies, pp. 54–56, 2019.
 [52] Y. Koda, J. Park, M. Bennis, K. Yamamoto, T. Nishio, M. Morikura, and K. Nakashima, “Communicationefficient multimodal split learning for mmWave received power prediction,” IEEE Communications Letters, vol. 24, no. 6, pp. 1284–1288, 2020.
 [53] Y. Koda, J. Park, M. Bennis, K. Yamamoto, T. Nishio, and M. Morikura, “Distributed heteromodal split learning for vision aided mmWave received power prediction,” arXiv preprint arXiv:2007.08208, 2020.
 [54] A. Singh, P. Vepakomma, O. Gupta, and R. Raskar, “Detailed comparison of communication efficiency of split learning and federated learning,” arXiv preprint arXiv:1909.09145, 2019.
 [55] K. Zhang, Z. Yang, and T. Başar, “Multiagent reinforcement learning: A selective overview of theories and algorithms,” arXiv preprint arXiv:1911.10635, 2019.
 [56] A. Khan, C. Zhang, D. D. Lee, V. Kumar, and A. Ribeiro, “Scalable centralized deep multiagent reinforcement learning via policy gradients,” arXiv preprint arXiv:1805.08776, 2018.
 [57] X. Wang and T. Sandholm, “Reinforcement learning to play an optimal Nash equilibrium in team Markov games,” in Advances in neural information processing systems, pp. 1603–1610, 2003.
 [58] T. Chen, G. Giannakis, T. Sun, and W. Yin, “LAG: Lazily aggregated gradient for communicationefficient distributed learning,” in Advances in Neural Information Processing Systems, pp. 5050–5060, 2018.
 [59] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for distributed optimization: Convergence analysis and network scaling,” IEEE Transactions on Automatic control, vol. 57, no. 3, pp. 592–606, 2011.
 [60] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communicationefficient sgd via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, pp. 1709–1720, 2017.
 [61] S. Magnússon, H. ShokriGhadikolaei, and N. Li, “On maintaining linear convergence of distributed learning and optimization under limited communication,” in 2019 53rd Asilomar Conference on Signals, Systems, and Computers, pp. 432–436, IEEE, 2019.
 [62] J. Bernstein, Y.X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signsgd: Compressed optimisation for nonconvex problems,” arXiv preprint arXiv:1802.04434, 2018.
 [63] K. Mishchenko, E. Gorbunov, M. Takáč, and P. Richtárik, “Distributed learning with compressed gradient differences,” arXiv preprint arXiv:1901.09269, 2019.
 [64] J. Wu, W. Huang, J. Huang, and T. Zhang, “Error compensated quantized sgd and its applications to largescale distributed optimization,” arXiv preprint arXiv:1806.08054, 2018.
 [65] H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang, “Zipml: Training linear models with endtoend low precision, and a little bit of deep learning,” in International Conference on Machine Learning, pp. 4035–4043, 2017.
 [66] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” in Advances in neural information processing systems, pp. 1509–1519, 2017.
 [67] A. Elgabli, J. Park, A. S. Bedi, C. B. Issaid, M. Bennis, and V. Aggarwal, “Qgadmm: Quantized group ADMM for communication efficient decentralized machine learning,” 2019.
 [68] R. Zamir and M. Feder, “On universal quantization by randomized uniform/lattice quantizers,” IEEE Transactions on Information Theory, vol. 38, no. 2, pp. 428–436, 1992.
 [69] N. Shlezinger, M. Chen, Y. C. Eldar, H. V. Poor, and S. Cui, “Uveqfed: Universal vector quantization for federated learning,” arXiv preprint arXiv:2006.03262, 2020.
 [70] G. Durisi, T. Koch, and P. Popovski, “Toward massive, ultrareliable, and lowlatency wireless communication with short packets,” Proceedings of the IEEE, vol. 104, no. 9, pp. 1711–1726, 2016.
 [71] Y. Polyanskiy, H. V. Poor, and S. Verdú, “Channel coding rate in the finite blocklength regime,” IEEE Transactions on Information Theory, vol. 56, no. 5, pp. 2307–2359, 2010.
 [72] P. Popovski, J. J. Nielsen, C. Stefanovic, E. De Carvalho, E. Strom, K. F. Trillingsgaard, A.S. Bana, D. M. Kim, R. Kotaba, J. Park, et al., “Wireless access for ultrareliable lowlatency communication: Principles and building blocks,” Ieee Network, vol. 32, no. 2, pp. 16–23, 2018.
 [73] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On largebatch training for deep learning: Generalization gap and sharp minima,” arXiv preprint arXiv:1609.04836, 2016.
 [74] T. Nishio and R. Yonetani, “Client selection for federated learning with heterogeneous resources in mobile edge,” In Proc. Int’l Conf. Commun. (ICC), Shanghai, China, May 2019.
 [75] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge computing systems,” IEEE Journal on Selected Areas in Communications, vol. 37, pp. 1205–1221, Jun. 2019.
 [76] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor, “Scheduling policies for federated learning in wireless networks,” arXiv preprint arXiv: 1908.06287, 2019.
 [77] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” arXiv preprint arXiv: 1909.07972, 20019.
 [78] M. M. Amiri and D. Gunduz, “Overtheair machine learning at the wireless edge,” Proc. IEEE International Workshop on Signal Processing Advances in Wireless Communications (SPWAC), Cannes, France, July 2019.
 [79] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for lowlatency federated edge learning,” arXiv preprint arXiv: 1812.11494.
 [80] T. Sery and K. Cohen, “On analog gradient descent learning over multiple access fading channels,” arXiv preprint arXiv: 1908.07463.
 [81] G. Zhu, Y. Du, D. Dunduz, and K. Huang, “Onebit overtheair aggregation for communicationefficient federated edge learning: Design and convergence analysis,” arXiv preprint arXiv: 2001.05713.
 [82] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communicationefficient learning of deep networks from decentralized data,” in Proc. of AISTATS, (Fort Lauderdale, FL, USA), Apr. 2017.
 [83] M. M. Amiri and D. Gündüz, “Federated learning over wireless fading channels,” IEEE Transactions on Wireless Communications, vol. 19, no. 5, pp. 3546–3557, 2020.
 [84] A. Elgabli, J. Park, C. B. Issaid, and M. Bennis, “Harnessing wireless channels for scalable and privacypreserving federated learning,” arXiv preprint arXiv:2007.01790, 2020.
 [85] M. M. Amiri and D. Gündüz, “Computation scheduling for distributed machine learning with straggling workers,” IEEE Transactions on Signal Processing, vol. 67, no. 24, pp. 6270–6284, 2019.
 [86] M. S. Elbamby, C. Perfecto, C. Liu, J. Park, S. Samarakoon, X. Chen, and M. Bennis, “Wireless edge computing with latency and reliability guarantees,” Proceedings of the IEEE, vol. 107, pp. 1717–1737, Aug. 2019.
 [87] M. Polese, R. Jana, V. Kounev, K. Zhang, S. Deb, and M. Zorzi, “Machine learning at the edge: A datadriven architecture with applications to 5G cellular networks,” IEEE Transactions on Mobile Computing, 2020.
 [88] Y. Zhang, B. Di, P. Wang, J. Lin, and L. Song, “HetMEC: Heterogeneous multilayer mobile edge computing in the 6 G era,” IEEE Transactions on Vehicular Technology, vol. 69, no. 4, pp. 4388–4400, 2020.
 [89] S. Li, M. A. MaddahAli, and A. S. Avestimehr, “Communicationaware computing for edge processing,” in 2017 IEEE International Symposium on Information Theory (ISIT), pp. 2885–2889, IEEE, 2017.
 [90] T. Chanyour, M. El Ghmary, Y. Hmimz, and M. O. Cherkaoui Malki, “Energyefficient and delayaware multitask offloading for mobile edge computing networks,” Transactions on Emerging Telecommunications Technologies, p. e3673, 2019.
 [91] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge computing systems,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 6, pp. 1205–1221, 2019.
 [92] U. Mohammad and S. Sorour, “Adaptive task allocation for mobile edge learning,” in 2019 IEEE Wireless Communications and Networking Conference Workshop (WCNCW), pp. 1–6, IEEE, 2019.
 [93] R. Xu, B. Palanisamy, and J. Joshi, “QueryGuard: Privacypreserving latencyaware query optimization for edge computing,” in 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), pp. 1097–1106, IEEE, 2018.
 [94] H. H. Yang, Z. Liu, T. Q. Quek, and H. V. Poor, “Scheduling policies for federated learning in wireless networks,” IEEE Transactions on Communications, vol. 68, no. 1, pp. 317–333, 2019.
 [95] M. M. Wadu, S. Samarakoon, and M. Bennis, “Federated learning under channel uncertainty: Joint client scheduling and resource allocation,” in 2020 IEEE Wireless Communications and Networking Conference (WCNC), pp. 1–6, 2020.
 [96] M. S. H. Abad, E. Ozfatura, D. Gunduz, and O. Ercetin, “Hierarchical federated learning across heterogeneous cellular networks,” in ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8866–8870, IEEE, 2020.
 [97] Y. Jiang, S. Wang, B. J. Ko, W.H. Lee, and L. Tassiulas, “Model pruning enables efficient federated learning on edge devices,” arXiv preprint arXiv:1909.12326, 2019.
 [98] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in neural information processing systems, pp. 1135–1143, 2015.
 [99] O. Yadan, K. Adams, Y. Taigman, and M. Ranzato, “Multigpu training of convnets,” arXiv preprint arXiv:1312.5853, 2013.
 [100] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” ACM SIGARCH Computer Architecture News, vol. 45, no. 1, pp. 615–629, 2017.

[101]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in neural information processing systems, pp. 1097–1105, 2012. 
[102]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  [103] D. P. Kingma and M. Welling, “An introduction to variational autoencoders,” Foundations and Trends in Machine Learning, vol. 12, no. 4, pp. 307–392, 2019.
 [104] E. Jeong, S. Oh, H. Kim, J. Park, M. Bennis, and S.L. Kim, “Communicationefficient ondevice machine learning: Federated distillation and augmentation under noniid private data,” arXiv preprint arXiv:1811.11479, 2018.
 [105] E. Jeong, S. Oh, J. Park, H. Kim, M. Bennis, and S.L. Kim, “Multihop federated private data augmentation with sample compression,” arXiv preprint arXiv:1907.06426, 2019.
 [106] J. Jeon, J. Kim, J. Kim, K. Kim, A. Mohaisen, and J. Kim, “Privacypreserving deep learning computation for geodistributed medical bigdata platforms,” in Proc. IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 3–4, 2019.
 [107] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in Proc. of NIPS Wksp. Deep Learning, (Montréal, Canada), pp. 1–9, Dec. 2014.
 [108] M. Phuong and C. H. Lampert, “Towards understanding knowledge distillation,” in International Conference on Machine Learning (ICML), Long Beach, California, (Long Beach, CA, USA), June 2019.
 [109] R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton, “Large scale distributed neural network training through online distillation,” arXiv preprint arXiv:1804.03235, 2018.
 [110] J.H. Ahn, O. Simeone, and J. Kang, “Wireless federated distillation for distributed edge learning with heterogeneous data,” in 2019 IEEE 30th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), pp. 1–6, IEEE, 2019.
 [111] H. Cha, J. Park, H. Kim, S.L. Kim, and M. Bennis, “Federated reinforcement distillation with proxy experience memory,” arXiv preprint arXiv:1907.06536, 2019.
 [112] V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. LopezPaz, and Y. Bengio, “Manifold mixup: Better representations by interpolating hidden states,” in International Conference on Machine Learning, pp. 6438–6447, 2019.
 [113] J. Park, S. Wang, A. Elgabli, S. Oh, E. Jeong, H. Cha, H. Kim, S.L. Kim, and M. Bennis, “Distilling ondevice intelligence at the network edge,” [Online]. Arxiv preprint: http://arxiv.org/abs/1907.02745, Dec. 2018.
 [114] E. Jeong, S. Oh, H. Kim, J. Park, M. Bennis, and S.L. Kim, “XOR Mixup: Privacypreserving data augmentation for oneshot federated learning,” presented at 2020 ICML Wksp. Federated Learning for User Privacy and Data Confidentiality (ICMLFL),, May 2020.
 [115] C. E. Rasmussen, “Gaussian processes in machine learning,” in Summer School on Machine Learning, pp. 63–71, Springer, 2003.
 [116] A. Girard, C. E. Rasmussen, J. Q. Candela, and R. MurraySmith, “Gaussian process priors with uncertain inputs application to multiplestep ahead time series forecasting,” in Advances in neural information processing systems, pp. 545–552, 2003.
 [117] D. M. Kreps, “Nash equilibrium,” in Game Theory, pp. 167–177, Springer, 1989.
 [118] J. Park, S. Jung, S.L. Kim, M. Bennis, and M. Debbah, “Usercentric mobility management in ultradense cellular networks under spatiotemporal dynamics,” in Proc. of IEEE GLOBECOM, (Washington, DC, USA), Dec. 2016.
 [119] H. Kim, J. Park, M. Bennis, S. Kim, and M. Debbah, “Ultradense edge caching under spatiotemporal demand and network dynamics,” in Proc. of IEEE ICC, (Paris, France), pp. 1–7, May 2017.
 [120] H. Kim, J. Park, M. Bennis, and S.L. Kim, “Massive UAVtoground communication and its stable movement control: A meanfield approach,” in Proc. of IEEE SPAWC, (Kalamata, Greece), June 2018.
 [121] H. Shiri, J. Park, and M. Bennis, “Massive autonomous uav path planning: A neural network based meanfield game theoretic approach,” in 2019 IEEE Global Communications Conference (GLOBECOM), pp. 1–6, IEEE, 2019.
 [122] H. Shiri, J. Park, and M. Bennis, “Communicationefficient massive uav online path control: Federated learning meets meanfield game theory,” arXiv preprint arXiv:2003.04451, 2020.
 [123] J.M. Lasry and P.L. Lions, “Mean field games,” Japanese journal of mathematics, vol. 2, no. 1, pp. 229–260, 2007.
 [124] R. Courant, K. Friedrichs, and H. Lewy, “On the partial difference equations of mathematical physics,” IBM journal of Research and Development, vol. 11, no. 2, pp. 215–234, 1967.
 [125] Z. Li, D. Kovalev, X. Qian, and P. Richtárik, “Acceleration for compressed gradient descent in distributed and federated optimization,” arXiv preprint arXiv:2002.11364, 2020.
 [126] A. Nedić, A. Olshevsky, and M. G. Rabbat, “Network topology and communicationcomputation tradeoffs in decentralized optimization,” Proceedings of the IEEE, vol. 106, no. 5, pp. 953–976, 2018.
 [127] E. Farhi, J. Goldstone, and S. Gutmann, “A quantum approximate optimization algorithm,” arXiv preprint arXiv:1411.4028, 2014.
 [128] L. Zhou, S.T. Wang, S. Choi, H. Pichler, and M. D. Lukin, “Quantum approximate optimization algorithm: performance, mechanism, and implementation on nearterm devices,” arXiv preprint arXiv:1812.01041, 2018.
 [129] J. Kim, G. Caire, and A. F. Molisch, “Qualityaware streaming and scheduling for devicetodevice video delivery,” IEEE/ACM Transactions on Networking, vol. 24, no. 4, pp. 2319–2331, 2016.
 [130] X. Ma, H. Sun, and R. Q. Hu, “Scheduling policy and power allocation for federated learning in NOMA based MEC,” ArXiv, vol. abs/2006.13044, 2020.
 [131] M. Broughton, G. Verdon, T. McCourt, A. J. Martinez, J. H. Yoo, et al., “TensorFlow quantum: A software framework for quantum machine learning,” arXiv:2003.02989, 2020.
 [132] M. Karaca, T. Alpcan, and O. Ercetin, “Smart scheduling and feedback allocation over nonstationary wireless channels,” in 2012 IEEE International Conference on Communications (ICC), pp. 6586–6590, IEEE, 2012.
 [133] A. Bensoussan and J. F. nad Phillip Yam, Mean Field Games and Mean Field Type Control Theory. SpringerBriefs in Mathematics, SpringerVerlag New York, 1 ed., 2013.
 [134] J. Park, S.L. Kim, and J. Zander, “Tractable resource management with uplink decoupled millimeterwave overlay in ultradense cellular networks,” IEEE Transactions on Wireless Communications, vol. 15, pp. 4362–4379, June 2016.
 [135] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with noniid data,” arXiv preprint arXiv:1806.00582, 2018.
 [136] H. Zhang, M. Cisse, Y. N. Dauphin, and D. LopezPaz, “mixup: Beyond Empirical Risk Minimization,” in Proc. of 6th International Conference on Learning Representations (ICLR), 2018.
 [137] N. Guha, A. Talwalkar, and V. Smith, “Oneshot federated learning,” [Online]. Arxiv preprint: http://arxiv.org/abs/1902.11175.
 [138] N. Yoshida, T. Nishio, M. Morikura, K. Yamamoto, and R. Yonetani, “Hybridfl: Cooperative learning mechanism using noniid data in wireless networks,” arXiv preprint arXiv:1905.07210, 2019.
 [139] J. Chen, R. Monga, S. Bengio, and R. Józefowicz, “Revisiting distributed synchronous SGD,” CoRR, vol. abs/1604.00981, 2016.
 [140] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. CiureaIlcus, C. Chute, H. Marklund, B. Haghgoo, R. L. Ball, K. S. Shpanskaya, J. Seekins, D. A. Mong, S. S. Halabi, J. K. Sandberg, R. Jones, D. B. Larson, C. P. Langlotz, B. N. Patel, M. P. Lungren, and A. Y. Ng, “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” CoRR, vol. abs/1901.07031, 2019.
 [141] Y. Koda, K. Yamamoto, T. Nishio, and M. Morikura, “Differentially private aircomp federated learning with power adaptation harnessing receiver noise,” [Online]. Arxiv preprint: https://arxiv.org/abs/2004.06337.
 [142] K. Khoshelham and S. Oude Elberink, “Accuracy and resolution of kinect depth data for indoor mapping applications,” Sensors (US), vol. 12, no. 2, pp. 1437–1454, 2012.
Comments
There are no comments yet.