I Introduction
Ia Motivation
A federated learning (FL) system [MaMMooRam:17AISTATS, LiSahTal:20SPM, ParSamBen:19]
generally consists of an edge server and a group of user equipments (UEs). The entities collaboratively optimize a common loss function. The training process constitutes three steps: (
) each UE conducts local training and uploads the intermediate parameters, e.g., the gradients, to the server, () the server aggregates the gradients to improve the global model, and () the server broadcasts the global parameter back to the UEs for another round of local computing. This procedure repeats until the model converges. However, edge services are yet widely available in wireless networks as deploying such computing resources on the access points (APs) is costly to the operators. And even if possible, using the precious edge computing unit to perform global model improvement – which are simply additions and/or multiplications – results in significant resource underuse. That leads to a natural question:Question1: Can we run FL in a wireless network without an edge server while maintaining scalability and efficiency?
A possible strategy is to change the network topology from a star connection into a decentralized one [LiCenChe:20AISTATS, ChePooSaa:20CMAG]. In this fashion, every UE only exchanges intermediate parameters with its geographically proximal neighbors in each communication round. If the network is fully connected, i.e., any pair of UEs can reach each other via finite hops, the training algorithm is able to eventually converge. Nonetheless, such an approach bears two critical setbacks: () the communication efficiency is low because UEs’ parameters can only be exchanged within local clusters in each global iteration, which results in a large number of communication rounds before the model can reach a satisfactory performance level; and () the privacy issue is severe, as UEs may send their information to a deceitful neighbor without the authentication from a centralized entity. Therefore, completely decentralizing the network is not a desirable solution to the posed question.
Apart from putting the server in petty use, another disadvantage of the conventional FL algorithm is that once the local parameters are uploaded to the edge, UEs need to wait for the results before they can proceed to the next round of local computing.
Since the UEs are obliged to freeze their local training during each global communication, where the latter can be orders of magnitudes slower than the former [LanLeeZho:17], the system’s processing power is highly underutilized.
As such, the second question arises:
Question2: Can UEs continue their local computing during global communication and use these extra calculations to reduce the system run time?
IB Main Contributions
In light of the above challenges, we propose a new architecture, as well as the model training algorithm, that () attains a similar convergence rate of FL under the masterslave framework but without the help of an edge server and () allows local computations to be executed in parallel with global communication, therefore enhance the system’s tolerance to high network latency. The main contributions of the present paper are summarized as follows:

We develop a distributed learning paradigm that in each communication round, allows all the UEs to simultaneously upload and aggregate their local parameters at the AP without utilizing an edge server, and later use the global model to rectify and improve the local results. This is accomplished through analog gradient aggregation [SerCoh:20TSP] and replacing the locally accumulated gradients with the globally averaged ones.

We derive the convergence rate for the proposed training algorithm. The result reveals that the convergence rate is primarily dominated by the level of heavy tailedness in the interference’s statistical distribution. Specifically, if the interference obeys a Gaussian distribution, the proposed algorithm retrieves the convergence rate of a conventional serverbased FL. When the interference distribution is heavytailed, then the heavier the tail, the slower the algorithm converges.

We improve the developed algorithm by enabling UEs to continue their local computing in concurrence with the global parameter updating. We also derive the convergence rate for the new scheme. The analysis shows that the proposed method is able to reduce the system run time, and the gain is particularly pronounced in the presence of high communication latency.

We carry out extensive simulations on the MNIST and CIFAR10 data set to examine the algorithm under different system parameters. The experiments validate that SFWFL achieves a similar, or even outperforms the convergence rate of a serverbased FL, if the interference follows a Gaussian distribution. It also confirms that the convergence performance of SFWFL is sensitive to the heavy tailedness of interference distribution, where the convergence rate deteriorates quickly as the tail index decreases. Yet, as opposed to conventional FL, under the SFWFL framework, an increase in the number of UEs is instrumental in accelerating the convergence. And the system run time is shown to be drastically reduced via pipelining computing with communication.
IC Outline
The remainder of this paper is organized as follows. We survey the related works in Section II. In Section III, we introduce the system model. We present the design and analysis of a serverfree FL paradigm in Section IV. We develop an enhanced version of the training algorithm in Section V, that allows UEs to execute local computations in parallel with global communications. Then, we show the simulation results in Section VI to validate the analyses and obtain design insights. We conclude the paper in Section VII.
In this paper, we use bold lower case letters to denote column vectors. For any vector
, we use and to denote the 2 norm and the transpose of a column vector, respectively. The main notations used throughout the paper are summarized in Table I.Ii Related Works
The design and analysis of this work stem from two prior arts: Analog gradient descent and delayed gradient averaging. In the following, we elaborate on these two aspects’ related works.
Ii1 Analog gradient descent
This method capitalizes on the superposition property of electromagnetic waves for fast and scalable FL tasks [ZhuWanHua:19TWC, GuoLiuLau:20, SerCoh:20TSP, YanCheQue:21JSTSP]: Specifically, during each global iteration, the edge server sends the global parameter to all the UEs. After receiving the global parameter, each UE conducts a round of local computing and, once finished, transmits an analog function of its gradient using a set of common shaping waveforms, one for each element in the gradient vector. The edge server receives a superposition of the analog transmitted signals, representing a distorted version of the global gradient. The server then updates the global model and feedbacks the update to all the UEs. This procedure repeats for a sufficient number of rounds until the training converges – the convergence is guaranteed if the loss function has nice structures (i.e., strong convexity and smoothness), even if the aggregated parameters are severely jeopardized by channel fading and interference noise [YanCheQue:21JSTSP]. The main advantage of analog gradient descent is that the bandwidth requirement does not depend on the number of UEs. As a result, the system not only scales easily but also attains significant energy saving [SerCoh:20TSP]. Moreover, the induced interference noise can be harnessed for accelerating convergence [ZhaWanLi:22TWC], enhancing privacy [ElgParIss:21], efficient sampling [LiuSim:21], or improving generalization [YanCheQue:21JSTSP]. In addition to these benefits, the present paper unveils another blessing from the analog gradient descent, that using this method, we can get rid of the edge server – as the old saying goes, “Render unto Caesar the things which are Caesar’s, and unto God the things that are God’s.”
Ii2 Delayed gradient averaging
On a separate track, delayed gradient averaging [ZhuLinLu:21] is devised by recognizing that the gradient averaging in FL can be postponed to a future iteration without violating the federated computing paradigm. Under delayed gradient averaging, the UEs send their parameters to each other at the end of each computing round and immediately start the next round of local training. The averaging step is belated to a later iteration when the aggregated result is received, upon which a gradient correction term is adopted to compensate for the staleness. In this manner, the communication can be pipelined with computation, hence endowing the system with a high tolerance to communication latency. However, [ZhuLinLu:21] requires each UE to pass its parameter to every other UE for gradient aggregation, which incurs hefty communication overhead, especially when the network grows in size. Even by adopting a server at the edge to take over the aggregation task, communication efficiency remains a bottleneck for the scheme. Toward this end, we incorporate analog gradient descent to circumvent the communication bottleneck of delayed gradient averaging, and show that such a marriage yields very fruitful outcomes.
Notation  Definition 

;  Number of UEs in the network; a set of orthonormal waveforms 
;  Number of SGD iterations in one local computing round; number of local computing rounds in a global communication round 
;  Global loss function; and its gradient 
;  Local loss function of UE ; and its gradient 
;  Analog signal sent out by UE in the th communication round; analog signal received by the AP in the th communication round 
;  Transmit power of UE ; channel fading experience by UE 
;  Noisy gradient received at the AP; electromagnetic interference that follows stable distribution 
Learning rate of the algorithm  
Tail index of the heavytailed interference  
;  Signed power of a vector ; norm of a vector 
Iii System Model
We consider a wireless network consisting of one AP and UEs, as depicted in Fig. 1(). Each UE holds a loss function that is constructed based on its local dataset. The goal of all the UEs is to jointly minimize a global objective function. More formally, they need to cooperatively find a vector that satisfies the following:
(1) 
The solution to (1) is commonly known as the empirical risk minimizer, denoted by
(2) 
In order to obtain the minimizer, the UEs need to conduct local training and periodically exchange the parameters for a global update. Because the AP is not equipped with a computing unit, conventional FL training schemes that rely on an edge server to perform the intermediate global aggregation and model improvement seem inapplicable in this context. That said, we will show in the sequel that by adopting analog overtheair computing [NazGas:07IT], one can devise an FLlike model training method that is communication efficient, highly scalable, and has the same convergence rate as the paradigms that have an edge server.
th round stochastic gradient descent.
(3) 
(4) 
(5) 
Iv Server Free Federated Model Training: Vanilla Version
In this section, we detail the design and analysis of a model training paradigm that achieves similar performance to FL without the help of an edge server. Owing to such a salient feature, we coin this scheme as Server Free Wireless Federated Learning (SFWFL). We summarize the general procedures of SFWFL in Algorithm 1 and elaborate on the major components below.
Iva Design
Similar to the conventional FL, SFWFL requires local trainings at the UEs, global communications of intermediate parameters, and feedback of the aggregated results.
IvA1 Local Training
Before the training commences, all UEs negotiate amongst each other on an initial parameter that is randomly generated. Then, every UE conducts steps of SGD iteration based on its own dataset and updates the locally aggregated gradient to the AP. The AP (automatically) aggregates the UEs’ gradients by means of analog overtheair computing–which will be elucidated soon–and feeds back the resultant parameters to all the UEs. Upon receiving the globally aggregated gradient, every UE replaces the locally aggregated gradient by this global parameter, as per (5), and proceeds to the next round of local computing in accordance with (4).
It is important to note that by replacing the local gradients with the global one, the UEs’ model parameters are aligned at the beginning of each local computing stage. As such, if the model training converges, every UE will have its parameters approach the same value.
IvA2 Global Communication
During the th round of global communication, UE gathers the stochastic gradients calculated in the current computing round as , and constructs the following analog signal:
(6) 
where denotes the inner product between two vectors and , , is a set of orthonormal baseband waveforms that satisfies:
(7)  
(8) 
In essence, operation (6) modulates the amplitude of according to the th entry of and superpositions the signals into an analog waveform. Once the transmit waveforms have been assembled, the UEs send them out concurrently into the spectrum. We consider the UEs employ power control to compensate for the largescale path loss while the instantaneous channel fading are unknown.
Notably, since the waveform basis are independent to the number of UEs, this architecture is highly scalable. In other words, all the UEs can participate in every round of local training and global communication regardless of how many UEs are there in the network.
IvA3 Gradient Aggregation
The analog signals go through wireless medium and accumulated at the AP’s RF front end. The received waveform can be expressed as follows:^{1}^{1}1In this paper, we consider the waveforms of different UEs are synchronized. Note that the issue of signal misalignment can be addressed via [ShaGunLie:21TWC].
(9) 
where is the channel fading experienced by UE and
stands for the interference. Without loss of generality, we assume the channel fading is independent and identically distributed (i.i.d.) across the agents and communication rounds, with a unit mean and variance
. Furthermore, we consider follows a symmetric stable distribution [ClaPedRod:20], which is widely used in characterizing interference’s statistical property in wireless networks [Mid:77, WinPin:09, YanPet:03].The AP passes the analog signal to a bank of matched filters, where each branch is tuned to one element of the waveform basis, and outputs the vector in (5), where is a dimensional random vector with each entry being i.i.d. and following an stable distribution. The AP then broadcasts back to all the UEs. Owing to the high transmit power of the AP, we assume the global parameters can be received without error by all the UEs. Then, the UEs move to step1) and launch a new round of local computing.
The most remarkable feature of this model training paradigm is that it does not require an edge server to conduct global aggregation and/or model improvement. Instead, the AP exploits the superposition property of wireless signals to achieve fast gradient aggregation through a bank of match filters. At the UEs side, they replace the locally aggregated gradient with the global one at the beginning of each local computing round to align the model parameters. As will be shown next, the training algorithm converges albeit the global gradients are highly distorted. Apart from server free, it is noteworthy that since the UEs do not need to compensate for the channel fading, they can transmit at a relatively constant power level to save the hardware cost. Additionally, the random perturbation from fading and interference provides inherent privacy protection to the UEs’ gradient information [ElgParIss:21].
IvB Analysis
In this part, we derive the convergence rate to quantify the training efficiency of SFWFL.
IvB1 Preliminary Assumptions
To facilitate the analysis, we make the following assumptions.
Assumption 1
The objective functions are strongly convex, i.e., for any it is satisfied:
(10) 
Assumption 2
The objective functions are smooth, i.e., for any it is satisfied:
(11) 
Assumption 3
The stochastic gradients are unbiased and have bounded second moments, i.e., there exists a constant
such that the following holds:(12) 
Because the interference follows an stable distribution, which has finite moments only up to the order , the variance of the globally aggregated gradient in (5) may be unbounded. As such, conventional approaches that rely on the existence of second moments cannot be directly applied. In order to establish a universally applicable convergence analysis, we opt for the norm as an alternative. Based on this metric, we introduce two concepts, i.e., the signed power and positive definite matrix [WanGurZhu:21], in below.
Definition 1
For a vector , we define its signed power as follows
(13) 
where takes the sign of the variable .
Definition 2
A symmetric matrix is said to be positive definite if for all with .
Armed with the above definitions, we make another assumption as follows.
Assumption 4
For any given vector , the Hessian matrix of , i.e., , is positive definite.
Furthermore, since each element of has a finite moment, we consider the moment of is upper bounded by a constant , i.e., .
IvB2 Convergence Rate of SFWFL
We lay out two technical lemmas that we would use extensively in the derivation.
Lemma 1
Given , for any , the following holds:
(14) 
Proof:
Please refer to [Kar:69].
Lemma 2
Let be an positive definite matrix, for , there exists , such that
(15) 
Proof:
Please see Theorem 10 of [WanGurZhu:21].
Since the UEs’ model parameters are aligned at the beginning of each local computing round, i.e., , we denote such a quantity as and present the first theoretical finding below.
Theorem 1
Under the employed wireless system, if the learning rate is set as where , then Algorithm1 converges as:
(16) 
Proof:
See Appendix LABEL:Apndx:ConvAnals_proof.
We highlight a few important observations from this result.
Remark 1
If the interference follows a Gaussian distribution, i.e., , the SFWFL converges in the order of , which is identical to those run under federated edge learning systems [SerCoh:20TSP]. As such, the proposed model training algorithm can attain the same efficacy of conventional edge learning without the requirement of a server.