Revisiting Analog Over-the-Air Machine Learning: The Blessing and Curse of Interference

We study a distributed machine learning problem carried out by an edge server and multiple agents in a wireless network. The objective is to minimize a global function that is a sum of the agents' local loss functions. And the optimization is conducted by analog over-the-air model training. Specifically, each agent modulates its local gradient onto a set of waveforms and transmits to the edge server simultaneously. From the received analog signal the edge server extracts a noisy aggregated gradient which is distorted by the channel fading and interference, and uses it to update the global model and feedbacks to all the agents for another round of local computing. Since the electromagnetic interference generally exhibits a heavy-tailed intrinsic, we use the α-stable distribution to model its statistic. In consequence, the global gradient has an infinite variance that hinders the use of conventional techniques for convergence analysis that rely on second-order moments' existence. To circumvent this challenge, we take a new route to establish the analysis of convergence rate, as well as generalization error, of the algorithm. Our analyses reveal a two-sided effect of the interference on the overall training procedure. On the negative side, heavy tail noise slows down the convergence rate of the model training: the heavier the tail in the distribution of interference, the slower the algorithm converges. On the positive side, heavy tail noise has the potential to increase the generalization power of the trained model: the heavier the tail, the better the model generalizes. This perhaps counterintuitive conclusion implies that the prevailing thinking on interference – that it is only detrimental to the edge learning system – is outdated and we shall seek new techniques that exploit, rather than simply mitigate, the interference for better machine learning in wireless networks.

READ FULL TEXT VIEW PDF
04/15/2022

Server Free Wireless Federated Learning: Architecture, Algorithm, and Analysis

We demonstrate that merely analog transmissions and match filtering can ...
01/16/2020

One-Bit Over-the-Air Aggregation for Communication-Efficient Federated Edge Learning: Design and Convergence Analysis

Federated edge learning (FEEL) is a popular framework for model training...
08/20/2019

On Analog Gradient Descent Learning over Multiple Access Fading Channels

We consider a distributed learning problem over multiple access channel ...
07/26/2021

Accelerated Gradient Descent Learning over Multiple Access Fading Channels

We consider a distributed learning problem in a wireless network, consis...
02/20/2021

Convergence Rates of Stochastic Gradient Descent under Infinite Noise Variance

Recent studies have provided both empirical and theoretical evidence ill...
05/02/2021

AirMixML: Over-the-Air Data Mixup for Inherently Privacy-Preserving Edge Machine Learning

Wireless channels can be inherently privacy-preserving by distorting the...
04/06/2022

Nonlinear gradient mappings and stochastic optimization: A general framework with applications to heavy-tail noise

We introduce a general framework for nonlinear stochastic gradient desce...

I Introduction

We consider a distributed machine learning problem conducted in a mobile edge network. Particularly, a group of agents communicate over the spectrum to an edge server, whereas each agent has a local objective function , and the goal is to minimize the global loss function:

(1)

Due to privacy concerns, the agents do not share their data, and the minimization can only be carried out in a decentralized manner. To that end, we adopt analog over-the-air model training [SerCoh:20TSP], which is mainly based on the gradient descent (GD) method, to accomplish this task. Specifically, each agent calculates its local gradient and modulates it onto

orthonormal waveforms, one for each element of the gradient vector. Then, the agents send out their analog signals simultaneously. The edge server receives the superposition of the analog signals, which represents a noisy global gradient distorted by the channel fading and interference. Based on this noisy gradient, the edge server updates the global parameter and feeds it back to all the agents for another round of local computation. The procedure repeats until the model converges.

We inspect this algorithm in a more pragmatic and complicated setting where interference follows an -stable distribution [ClaPedRod:20]. In that context, the variance of the aggregated gradient is infinite, and the effects of such a phenomenon on the training procedure, or the algorithm can even converge or not, remain unknown. The central thrust of the present article is to fill this research gap.

I-a Main Contributions

This paper builds upon the model in [SerCoh:20TSP] but differs from it by considering the heavy-tailed nature of the electromagnetic interference [Mid:77]. Specifically, we adopt the symmetric -stable distribution – a widely used model in wireless networks [Hae:09, WinPin:09] – to model the statistics of interference. The parameter is commonly known as the tail index where smaller the means heavier the tail in the distribution. Under such a setting, the aggregated global gradient admits diverging variance, and the conventional approaches that heavily rely on the existence of second moments for convergence analysis fail to function. In that respect, we take a new route toward the convergence analysis and verify that even the intermediate gradients are severely distorted by channel fading and interference, GD-based methods can ultimately reach the optimal solution.

Our main contributions are summarized as follows:

  • We derive analytical expression to characterize the convergence rate of the analog over-the-air GD algorithm, which encompasses key sytem parameters such as the number of agents, channel fading, and interference. Particularly, the convergence rate is in the order of , where stands for the communication round. This result also implies that heavier tailed interference leads to slower convergence of the algorithm.

  • We show that analog over-the-air GD can be run in conjunction with momentum. We also derive the convergence rate by taking into account the momentum weight. Our result reveals that the momentum based model training also converges in the order of , while the momentum weight affects the multiplication factor.

  • We analytically characterize the generalization error of analog over-the-air GD by resorting to a continuous time proxy of the update trajectory. The analysis shows that heavy tail can potentially improve the algorithm’s generalization capability. More precisely, with a certain probability, the generalization error decreases along with the tail index.

  • We conduct extensive simulations on the MNIST data set to examine the algorithm under different system parameters. The experiments demonstrate that an increase in the number of agents, learning rate, or tail index of the interference leads to a faster convergence rate. It also shows that occasionally a smaller tail index results in better prediction accuracy of the trained model, which confirms that heavy tail has the potential to improve the generalization capability.

I-B Prior Art

Distributed optimizations in wireless networks have garnered considerable attention in recent years, especially with the rise of federated learning [MaMMooRam:17AISTATS, LiSahTal:20SPM, ParSamBen:19]. The typical system is generally constituted by an edge server and a number of agents, where the goal is to collaboratively optimize an objective function via the orchestration amongst the network elements. Particularly, each agent conducts on-device training based on its local dataset, and uploads the intermediate result, e.g., the gradient, to the server for model improvement. Then, they download the new model for another round of local computing. This procedure repeats multiple rounds until the training converges. Upon each global iteration, the transmissions of model parameters need to go over the spectrum, which is resource-limited and unreliable. Recognizing the conventional schemes that hinged on the separated communication-and-computation principle can encounter difficulty in accomondating massive access and stringent latency requirements, a recent line of studies [ZhuXuHua:21MAG] proposed utilizing the over-the-air computing to enable efficient model aggregation and hence achieve faster machine learning over many devices.

The essence of over-the-air computing is to exploit the waveform superposition property of multi access channel, where agents modulate the gradient on the waveform and use the air as an auto aggregator. In the presence of channel fading, it is suggested to invert the channel via power control at the end-user devices where the nodes that encounter deep fades suspend their transmissions [ZhuWanHua:19TWC, YanJiaShi:20]. And the server shall adopt better scheduling methods in each communication round to rev up the model training process. To reduce communication overheads, the devices can compress the gradient vectors by sending out a sparse [AmiGun:20TSP], or even a one-bit quantized [ZhuDuGun:21TWC], version, followed by QAM modulation. At the edge server side, it can expand the antenna array to further mitigate the effects of channel fading, where the fading vanishes as the spatial dimension approaches infinity [AmiDumGun:21TWC]. Furthermore, [SerShlCoh:20] devise a precoding scheme that gradually amplifies the model updates as the training progresses to handle the performance degradation incurred by the additive noise. With the help of feedbacks, [GuoLiuLau:20] optimizes the transceiver parameters by jointly accounting for the data and channel states to cope with the nonstationality of the gradient updates. Inspired by the fact that machine learning algorithms need not to operate under impeccably precise parameters, the authors of [SerCoh:20TSP] suggest the agents directly transmit the analog gradient signals without any power control or beamforming to invert the channel whilst the server updates the global model based on the noisy gradient. They also show that the convergence is guaranteed. This approach substantially simplifies the system design while achieves virtually zero access latency [CaiLau:17]. What is more appealing, the data privacy is in fact enhanced by implicitly harnessing the randomness of wireless medium and the training procedure can be accelerated by adopting an analog ADMM-type algorithm [ElgParIss:21]

. Despite the wealth of work in this area, a significant restriction in almost all the previous results lies at the presumption that the interference follows a normal distribution. While convenient, this assumption hardly holds in practice as the constructive property of the electromagnetic waves often results in heavy tails in the distribution of interference

[Mid:77, Hae:09, WinPin:09]. In consequence, there is a non-negligible chance that the magnitude of interference sheers to a humungous value in some communication rounds which wreaks havoc on the global model. Understanding the impact of such a phenomenon on the performance of the learning algorithm is the focus of this work.

The remainder of this paper is organized as follows. We introduce the system model in Section II. In Section III, we derive the convergence rate of analog over-the-air GD. We also present the convergence rate of analog over-the-air GD with momentum. In Section IV, we analyze the generalization error of the analog over-the-air GD algorithm. Then, we show the simulation results in Section V to validate the analyses and obtain design insights. We conclude the paper in Section VI.

Fig. 1: An illustration of the model training process: (A) agents evaluate local gradients based on their own dataset and upload to the server via analog transmissions, (B) the server receives the aggregated noisy gradient and use it to update the global model, (C) the new model is sent back to the agents, and the process is repeated.

Ii System Model

Ii-a Setting

Let us consider an edge learning system consisting of one server and agents. Each agent holds a local dataset with size , and we assume the local datasets are statistically independent across the clients. The goal of all the entities in this system is to jointly learn a statistical model constituted from all the data samples of the clients. More precisely, they need to find a vector that minimizes a global loss given as follows:

(2)

where is the local empirical risk of agent , given by

(3)

The solution is commonly known as the empirical risk minimizer, denoted by

(4)

Due to privacy concerns, the agents are unwilling to share their local dataset and hence the minimization of (2) needs to be conducted by means of distributed learning. Particularly, the agents minimize their local loss and upload the intermediate gradients to the server, with which the server conducts a global aggregation to improve the model and feeds it back to the agents for another round of local training. Such interactions between the server and agents repeat until the model converges. During this process, we consider the communications amongst the server and the agents are taken place over the spectrum, which is by nature resource-limited and unreliable. In light of its efficacy in spectral utilization, we adopt the analog over-the-air computing [NazGas:07IT] for the training of the statistical model, which is detailed in the sequel.

Ii-B Analog Over-the-Air Model Training

Let be the global model broadcasted by the server at communication round . Owing to the high transmit power of the edge server, we assume the global model can be successfully received by all the agents. Then, each client calculates its gradient and constructs the following analog signal:

(5)

where denotes the inner product between two vectors and , , is a set of orthonormal baseband waveforms that satisfies:

(6)
(7)

According to (5), the signal is essentially a superposition of the analog waveforms whereas the magnitude of equals to the -th element of .111Note that the magnitude of the waveforms can also be set at the quantized values of the gradients to reduce implementation complexity.

Once the transmit waveforms have been assembled, the agents send them out concurrently into the spectrum. And the signal received at the edge server can be expressed as follows:

(8)

where is the channel fading experienced by agent , stands for the corresponding transmit power, and represents the interference. Without loss of generality, we assume the channel fading is i.i.d. across the agents and communication rounds, with mean and variance . And the transmit power is set to compensate for the large-scale path loss. In order to characterize the heavy-tailed nature of wireless interference, we consider follows a symmetric -stable distribution, whereas the properties will be elaborated in the next section.

The received signal will be past to a set of matched filters, where each of them is tuned as , and output the following vector:

(9)

where is a -dimensional random vector with each entry being i.i.d. and following an -stable distribution. The server then updates global parameter as follows

(10)

where is the learning rate.

Remark 1

The analog over-the-air gradient aggregation boasts two unique advantages: (a) high spectral utilization, as the agents do not need to vie for radio access but can simultaneously upload their local parameters to the server, and (b) low hardware cost, as the agents do not need correct the channel gain and hence they can transmit at a relatively constant power level.