We consider a distributed machine learning problem conducted in a mobile edge network. Particularly, a group of agents communicate over the spectrum to an edge server, whereas each agent has a local objective function , and the goal is to minimize the global loss function:
Due to privacy concerns, the agents do not share their data, and the minimization can only be carried out in a decentralized manner. To that end, we adopt analog over-the-air model training [SerCoh:20TSP], which is mainly based on the gradient descent (GD) method, to accomplish this task. Specifically, each agent calculates its local gradient and modulates it onto
orthonormal waveforms, one for each element of the gradient vector. Then, the agents send out their analog signals simultaneously. The edge server receives the superposition of the analog signals, which represents a noisy global gradient distorted by the channel fading and interference. Based on this noisy gradient, the edge server updates the global parameter and feeds it back to all the agents for another round of local computation. The procedure repeats until the model converges.
We inspect this algorithm in a more pragmatic and complicated setting where interference follows an -stable distribution [ClaPedRod:20]. In that context, the variance of the aggregated gradient is infinite, and the effects of such a phenomenon on the training procedure, or the algorithm can even converge or not, remain unknown. The central thrust of the present article is to fill this research gap.
I-a Main Contributions
This paper builds upon the model in [SerCoh:20TSP] but differs from it by considering the heavy-tailed nature of the electromagnetic interference [Mid:77]. Specifically, we adopt the symmetric -stable distribution – a widely used model in wireless networks [Hae:09, WinPin:09] – to model the statistics of interference. The parameter is commonly known as the tail index where smaller the means heavier the tail in the distribution. Under such a setting, the aggregated global gradient admits diverging variance, and the conventional approaches that heavily rely on the existence of second moments for convergence analysis fail to function. In that respect, we take a new route toward the convergence analysis and verify that even the intermediate gradients are severely distorted by channel fading and interference, GD-based methods can ultimately reach the optimal solution.
Our main contributions are summarized as follows:
We derive analytical expression to characterize the convergence rate of the analog over-the-air GD algorithm, which encompasses key sytem parameters such as the number of agents, channel fading, and interference. Particularly, the convergence rate is in the order of , where stands for the communication round. This result also implies that heavier tailed interference leads to slower convergence of the algorithm.
We show that analog over-the-air GD can be run in conjunction with momentum. We also derive the convergence rate by taking into account the momentum weight. Our result reveals that the momentum based model training also converges in the order of , while the momentum weight affects the multiplication factor.
We analytically characterize the generalization error of analog over-the-air GD by resorting to a continuous time proxy of the update trajectory. The analysis shows that heavy tail can potentially improve the algorithm’s generalization capability. More precisely, with a certain probability, the generalization error decreases along with the tail index.
We conduct extensive simulations on the MNIST data set to examine the algorithm under different system parameters. The experiments demonstrate that an increase in the number of agents, learning rate, or tail index of the interference leads to a faster convergence rate. It also shows that occasionally a smaller tail index results in better prediction accuracy of the trained model, which confirms that heavy tail has the potential to improve the generalization capability.
I-B Prior Art
Distributed optimizations in wireless networks have garnered considerable attention in recent years, especially with the rise of federated learning [MaMMooRam:17AISTATS, LiSahTal:20SPM, ParSamBen:19]. The typical system is generally constituted by an edge server and a number of agents, where the goal is to collaboratively optimize an objective function via the orchestration amongst the network elements. Particularly, each agent conducts on-device training based on its local dataset, and uploads the intermediate result, e.g., the gradient, to the server for model improvement. Then, they download the new model for another round of local computing. This procedure repeats multiple rounds until the training converges. Upon each global iteration, the transmissions of model parameters need to go over the spectrum, which is resource-limited and unreliable. Recognizing the conventional schemes that hinged on the separated communication-and-computation principle can encounter difficulty in accomondating massive access and stringent latency requirements, a recent line of studies [ZhuXuHua:21MAG] proposed utilizing the over-the-air computing to enable efficient model aggregation and hence achieve faster machine learning over many devices.
The essence of over-the-air computing is to exploit the waveform superposition property of multi access channel, where agents modulate the gradient on the waveform and use the air as an auto aggregator. In the presence of channel fading, it is suggested to invert the channel via power control at the end-user devices where the nodes that encounter deep fades suspend their transmissions [ZhuWanHua:19TWC, YanJiaShi:20]. And the server shall adopt better scheduling methods in each communication round to rev up the model training process. To reduce communication overheads, the devices can compress the gradient vectors by sending out a sparse [AmiGun:20TSP], or even a one-bit quantized [ZhuDuGun:21TWC], version, followed by QAM modulation. At the edge server side, it can expand the antenna array to further mitigate the effects of channel fading, where the fading vanishes as the spatial dimension approaches infinity [AmiDumGun:21TWC]. Furthermore, [SerShlCoh:20] devise a precoding scheme that gradually amplifies the model updates as the training progresses to handle the performance degradation incurred by the additive noise. With the help of feedbacks, [GuoLiuLau:20] optimizes the transceiver parameters by jointly accounting for the data and channel states to cope with the nonstationality of the gradient updates. Inspired by the fact that machine learning algorithms need not to operate under impeccably precise parameters, the authors of [SerCoh:20TSP] suggest the agents directly transmit the analog gradient signals without any power control or beamforming to invert the channel whilst the server updates the global model based on the noisy gradient. They also show that the convergence is guaranteed. This approach substantially simplifies the system design while achieves virtually zero access latency [CaiLau:17]. What is more appealing, the data privacy is in fact enhanced by implicitly harnessing the randomness of wireless medium and the training procedure can be accelerated by adopting an analog ADMM-type algorithm [ElgParIss:21]
. Despite the wealth of work in this area, a significant restriction in almost all the previous results lies at the presumption that the interference follows a normal distribution. While convenient, this assumption hardly holds in practice as the constructive property of the electromagnetic waves often results in heavy tails in the distribution of interference[Mid:77, Hae:09, WinPin:09]. In consequence, there is a non-negligible chance that the magnitude of interference sheers to a humungous value in some communication rounds which wreaks havoc on the global model. Understanding the impact of such a phenomenon on the performance of the learning algorithm is the focus of this work.
The remainder of this paper is organized as follows. We introduce the system model in Section II. In Section III, we derive the convergence rate of analog over-the-air GD. We also present the convergence rate of analog over-the-air GD with momentum. In Section IV, we analyze the generalization error of the analog over-the-air GD algorithm. Then, we show the simulation results in Section V to validate the analyses and obtain design insights. We conclude the paper in Section VI.
Ii System Model
Let us consider an edge learning system consisting of one server and agents. Each agent holds a local dataset with size , and we assume the local datasets are statistically independent across the clients. The goal of all the entities in this system is to jointly learn a statistical model constituted from all the data samples of the clients. More precisely, they need to find a vector that minimizes a global loss given as follows:
where is the local empirical risk of agent , given by
The solution is commonly known as the empirical risk minimizer, denoted by
Due to privacy concerns, the agents are unwilling to share their local dataset and hence the minimization of (2) needs to be conducted by means of distributed learning. Particularly, the agents minimize their local loss and upload the intermediate gradients to the server, with which the server conducts a global aggregation to improve the model and feeds it back to the agents for another round of local training. Such interactions between the server and agents repeat until the model converges. During this process, we consider the communications amongst the server and the agents are taken place over the spectrum, which is by nature resource-limited and unreliable. In light of its efficacy in spectral utilization, we adopt the analog over-the-air computing [NazGas:07IT] for the training of the statistical model, which is detailed in the sequel.
Ii-B Analog Over-the-Air Model Training
Let be the global model broadcasted by the server at communication round . Owing to the high transmit power of the edge server, we assume the global model can be successfully received by all the agents. Then, each client calculates its gradient and constructs the following analog signal:
where denotes the inner product between two vectors and , , is a set of orthonormal baseband waveforms that satisfies:
According to (5), the signal is essentially a superposition of the analog waveforms whereas the magnitude of equals to the -th element of .111Note that the magnitude of the waveforms can also be set at the quantized values of the gradients to reduce implementation complexity.
Once the transmit waveforms have been assembled, the agents send them out concurrently into the spectrum. And the signal received at the edge server can be expressed as follows:
where is the channel fading experienced by agent , stands for the corresponding transmit power, and represents the interference. Without loss of generality, we assume the channel fading is i.i.d. across the agents and communication rounds, with mean and variance . And the transmit power is set to compensate for the large-scale path loss. In order to characterize the heavy-tailed nature of wireless interference, we consider follows a symmetric -stable distribution, whereas the properties will be elaborated in the next section.
The received signal will be past to a set of matched filters, where each of them is tuned as , and output the following vector:
where is a -dimensional random vector with each entry being i.i.d. and following an -stable distribution. The server then updates global parameter as follows
where is the learning rate.
The analog over-the-air gradient aggregation boasts two unique advantages: (a) high spectral utilization, as the agents do not need to vie for radio access but can simultaneously upload their local parameters to the server, and (b) low hardware cost, as the agents do not need correct the channel gain and hence they can transmit at a relatively constant power level.