Server Free Wireless Federated Learning: Architecture, Algorithm, and Analysis

We demonstrate that merely analog transmissions and match filtering can realize the function of an edge server in federated learning (FL). Therefore, a network with massively distributed user equipments (UEs) can achieve large-scale FL without an edge server. We also develop a training algorithm that allows UEs to continuously perform local computing without being interrupted by the global parameter uploading, which exploits the full potential of UEs' processing power. We derive convergence rates for the proposed schemes to quantify their training efficiency. The analyses reveal that when the interference obeys a Gaussian distribution, the proposed algorithm retrieves the convergence rate of a server-based FL. But if the interference distribution is heavy-tailed, then the heavier the tail, the slower the algorithm converges. Nonetheless, the system run time can be largely reduced by enabling computation in parallel with communication, whereas the gain is particularly pronounced when communication latency is high. These findings are corroborated via excessive simulations.

READ FULL TEXT VIEW PDF
10/30/2020

Fast Convergence Algorithm for Analog Federated Learning

In this paper, we consider federated learning (FL) over a noisy fading m...
07/25/2021

Revisiting Analog Over-the-Air Machine Learning: The Blessing and Curse of Interference

We study a distributed machine learning problem carried out by an edge s...
09/27/2020

Over-the-Air Federated Learning from Heterogeneous Data

Federated learning (FL) is a framework for distributed learning of centr...
05/25/2022

Scalable and Low-Latency Federated Learning with Cooperative Mobile Edge Networking

Federated learning (FL) enables collaborative model training without cen...
09/22/2021

In-network Computation for Large-scale Federated Learning over Wireless Edge Networks

Most conventional Federated Learning (FL) models are using a star networ...
09/22/2021

Enabling Large-Scale Federated Learning over Wireless Edge Networks

Major bottlenecks of large-scale Federated Learning(FL) networks are the...
03/10/2021

Deploying Federated Learning in Large-Scale Cellular Networks: Spatial Convergence Analysis

The deployment of federated learning in a wireless network, called feder...

I Introduction

I-a Motivation

A federated learning (FL) system [MaMMooRam:17AISTATS, LiSahTal:20SPM, ParSamBen:19]

generally consists of an edge server and a group of user equipments (UEs). The entities collaboratively optimize a common loss function. The training process constitutes three steps: (

) each UE conducts local training and uploads the intermediate parameters, e.g., the gradients, to the server, () the server aggregates the gradients to improve the global model, and () the server broadcasts the global parameter back to the UEs for another round of local computing. This procedure repeats until the model converges. However, edge services are yet widely available in wireless networks as deploying such computing resources on the access points (APs) is costly to the operators. And even if possible, using the precious edge computing unit to perform global model improvement – which are simply additions and/or multiplications – results in significant resource underuse. That leads to a natural question:

Question-1: Can we run FL in a wireless network without an edge server while maintaining scalability and efficiency?

A possible strategy is to change the network topology from a star connection into a decentralized one [LiCenChe:20AISTATS, ChePooSaa:20CMAG]. In this fashion, every UE only exchanges intermediate parameters with its geographically proximal neighbors in each communication round. If the network is fully connected, i.e., any pair of UEs can reach each other via finite hops, the training algorithm is able to eventually converge. Nonetheless, such an approach bears two critical setbacks: () the communication efficiency is low because UEs’ parameters can only be exchanged within local clusters in each global iteration, which results in a large number of communication rounds before the model can reach a satisfactory performance level; and () the privacy issue is severe, as UEs may send their information to a deceitful neighbor without the authentication from a centralized entity. Therefore, completely decentralizing the network is not a desirable solution to the posed question.

Apart from putting the server in petty use, another disadvantage of the conventional FL algorithm is that once the local parameters are uploaded to the edge, UEs need to wait for the results before they can proceed to the next round of local computing. Since the UEs are obliged to freeze their local training during each global communication, where the latter can be orders of magnitudes slower than the former [LanLeeZho:17], the system’s processing power is highly underutilized. As such, the second question arises:

Question-2: Can UEs continue their local computing during global communication and use these extra calculations to reduce the system run time?

I-B Main Contributions

In light of the above challenges, we propose a new architecture, as well as the model training algorithm, that () attains a similar convergence rate of FL under the master-slave framework but without the help of an edge server and () allows local computations to be executed in parallel with global communication, therefore enhance the system’s tolerance to high network latency. The main contributions of the present paper are summarized as follows:

  • We develop a distributed learning paradigm that in each communication round, allows all the UEs to simultaneously upload and aggregate their local parameters at the AP without utilizing an edge server, and later use the global model to rectify and improve the local results. This is accomplished through analog gradient aggregation [SerCoh:20TSP] and replacing the locally accumulated gradients with the globally averaged ones.

  • We derive the convergence rate for the proposed training algorithm. The result reveals that the convergence rate is primarily dominated by the level of heavy tailedness in the interference’s statistical distribution. Specifically, if the interference obeys a Gaussian distribution, the proposed algorithm retrieves the convergence rate of a conventional server-based FL. When the interference distribution is heavy-tailed, then the heavier the tail, the slower the algorithm converges.

  • We improve the developed algorithm by enabling UEs to continue their local computing in concurrence with the global parameter updating. We also derive the convergence rate for the new scheme. The analysis shows that the proposed method is able to reduce the system run time, and the gain is particularly pronounced in the presence of high communication latency.

  • We carry out extensive simulations on the MNIST and CIFAR-10 data set to examine the algorithm under different system parameters. The experiments validate that SFWFL achieves a similar, or even outperforms the convergence rate of a server-based FL, if the interference follows a Gaussian distribution. It also confirms that the convergence performance of SFWFL is sensitive to the heavy tailedness of interference distribution, where the convergence rate deteriorates quickly as the tail index decreases. Yet, as opposed to conventional FL, under the SFWFL framework, an increase in the number of UEs is instrumental in accelerating the convergence. And the system run time is shown to be drastically reduced via pipelining computing with communication.

(a)
(b)
Fig. 1: Examples of () server-based and () server free FL systems. In (), the edge server aggregates the gradients from a portion of the associated UEs to improve the global model. In (), all the UEs concurrently send analog functions of their gradients to the AP, and the AP passes the received signal to a bank of match filters to obtain the automatically aggregated (but noisy) gradients.

I-C Outline

The remainder of this paper is organized as follows. We survey the related works in Section II. In Section III, we introduce the system model. We present the design and analysis of a server-free FL paradigm in Section IV. We develop an enhanced version of the training algorithm in Section V, that allows UEs to execute local computations in parallel with global communications. Then, we show the simulation results in Section VI to validate the analyses and obtain design insights. We conclude the paper in Section VII.

In this paper, we use bold lower case letters to denote column vectors. For any vector

, we use and to denote the -2 norm and the transpose of a column vector, respectively. The main notations used throughout the paper are summarized in Table I.

Ii Related Works

The design and analysis of this work stem from two prior arts: Analog gradient descent and delayed gradient averaging. In the following, we elaborate on these two aspects’ related works.

Ii-1 Analog gradient descent

This method capitalizes on the superposition property of electromagnetic waves for fast and scalable FL tasks [ZhuWanHua:19TWC, GuoLiuLau:20, SerCoh:20TSP, YanCheQue:21JSTSP]: Specifically, during each global iteration, the edge server sends the global parameter to all the UEs. After receiving the global parameter, each UE conducts a round of local computing and, once finished, transmits an analog function of its gradient using a set of common shaping waveforms, one for each element in the gradient vector. The edge server receives a superposition of the analog transmitted signals, representing a distorted version of the global gradient. The server then updates the global model and feedbacks the update to all the UEs. This procedure repeats for a sufficient number of rounds until the training converges – the convergence is guaranteed if the loss function has nice structures (i.e., strong convexity and smoothness), even if the aggregated parameters are severely jeopardized by channel fading and interference noise [YanCheQue:21JSTSP]. The main advantage of analog gradient descent is that the bandwidth requirement does not depend on the number of UEs. As a result, the system not only scales easily but also attains significant energy saving [SerCoh:20TSP]. Moreover, the induced interference noise can be harnessed for accelerating convergence [ZhaWanLi:22TWC], enhancing privacy [ElgParIss:21], efficient sampling [LiuSim:21], or improving generalization [YanCheQue:21JSTSP]. In addition to these benefits, the present paper unveils another blessing from the analog gradient descent, that using this method, we can get rid of the edge server – as the old saying goes, “Render unto Caesar the things which are Caesar’s, and unto God the things that are God’s.”

Ii-2 Delayed gradient averaging

On a separate track, delayed gradient averaging [ZhuLinLu:21] is devised by recognizing that the gradient averaging in FL can be postponed to a future iteration without violating the federated computing paradigm. Under delayed gradient averaging, the UEs send their parameters to each other at the end of each computing round and immediately start the next round of local training. The averaging step is belated to a later iteration when the aggregated result is received, upon which a gradient correction term is adopted to compensate for the staleness. In this manner, the communication can be pipelined with computation, hence endowing the system with a high tolerance to communication latency. However, [ZhuLinLu:21] requires each UE to pass its parameter to every other UE for gradient aggregation, which incurs hefty communication overhead, especially when the network grows in size. Even by adopting a server at the edge to take over the aggregation task, communication efficiency remains a bottleneck for the scheme. Toward this end, we incorporate analog gradient descent to circumvent the communication bottleneck of delayed gradient averaging, and show that such a marriage yields very fruitful outcomes.

Notation                       Definition
; Number of UEs in the network; a set of orthonormal waveforms
; Number of SGD iterations in one local computing round; number of local computing rounds in a global communication round
; Global loss function; and its gradient
; Local loss function of UE ; and its gradient
; Analog signal sent out by UE in the -th communication round; analog signal received by the AP in the -th communication round
; Transmit power of UE ; channel fading experience by UE
; Noisy gradient received at the AP; electromagnetic interference that follows -stable distribution
Learning rate of the algorithm
Tail index of the heavy-tailed interference
; Signed power of a vector ; -norm of a vector
TABLE I: Notation Summary

Iii System Model

We consider a wireless network consisting of one AP and UEs, as depicted in Fig. 1(). Each UE holds a loss function that is constructed based on its local dataset. The goal of all the UEs is to jointly minimize a global objective function. More formally, they need to cooperatively find a vector that satisfies the following:

(1)

The solution to (1) is commonly known as the empirical risk minimizer, denoted by

(2)

In order to obtain the minimizer, the UEs need to conduct local training and periodically exchange the parameters for a global update. Because the AP is not equipped with a computing unit, conventional FL training schemes that rely on an edge server to perform the intermediate global aggregation and model improvement seem inapplicable in this context. That said, we will show in the sequel that by adopting analog over-the-air computing [NazGas:07IT], one can devise an FL-like model training method that is communication efficient, highly scalable, and has the same convergence rate as the paradigms that have an edge server.

1:  Parameters: = number of steps per local computing, = learning rate for the

-th round stochastic gradient descent.

2:  Initialize: Each agent sets where is a randomly generated vector and .
3:  for   do
4:     for  each UE in parallel  do
5:        Set the initial local parameter as follows:
(3)
in which denotes the -th iteration at round , is the parameters received from the AP, and is the locally aggregated gradient.
6:        for  = 1 to  do
7:           Sample uniformly at random, and update the local parameter as follows
(4)
8:        end for
9:        Compute the aggregated local gradients at round as , modulate onto a set of orthonormal waveforms and simultaneously send out to the AP.
10:     end for
11:     The AP passes the received signal to a bank of matched filters and arrives at the following
(5)
where is the channel fading experienced by UE and is the electromagnetic interference. The AP feeds back to all UEs in a broadcast manner.
12:  end for
13:  Output:
Algorithm 1 Server Free Wireless Federated Learning

Iv Server Free Federated Model Training: Vanilla Version

In this section, we detail the design and analysis of a model training paradigm that achieves similar performance to FL without the help of an edge server. Owing to such a salient feature, we coin this scheme as Server Free Wireless Federated Learning (SFWFL). We summarize the general procedures of SFWFL in Algorithm 1 and elaborate on the major components below.

Iv-a Design

Similar to the conventional FL, SFWFL requires local trainings at the UEs, global communications of intermediate parameters, and feedback of the aggregated results.

Iv-A1 Local Training

Before the training commences, all UEs negotiate amongst each other on an initial parameter that is randomly generated. Then, every UE conducts steps of SGD iteration based on its own dataset and updates the locally aggregated gradient to the AP. The AP (automatically) aggregates the UEs’ gradients by means of analog over-the-air computing–which will be elucidated soon–and feeds back the resultant parameters to all the UEs. Upon receiving the globally aggregated gradient, every UE replaces the locally aggregated gradient by this global parameter, as per (5), and proceeds to the next round of local computing in accordance with (4).

It is important to note that by replacing the local gradients with the global one, the UEs’ model parameters are aligned at the beginning of each local computing stage. As such, if the model training converges, every UE will have its parameters approach the same value.

Iv-A2 Global Communication

During the -th round of global communication, UE gathers the stochastic gradients calculated in the current computing round as , and constructs the following analog signal:

(6)

where denotes the inner product between two vectors and , , is a set of orthonormal baseband waveforms that satisfies:

(7)
(8)

In essence, operation (6) modulates the amplitude of according to the -th entry of and superpositions the signals into an analog waveform. Once the transmit waveforms have been assembled, the UEs send them out concurrently into the spectrum. We consider the UEs employ power control to compensate for the large-scale path loss while the instantaneous channel fading are unknown.

Notably, since the waveform basis are independent to the number of UEs, this architecture is highly scalable. In other words, all the UEs can participate in every round of local training and global communication regardless of how many UEs are there in the network.

Iv-A3 Gradient Aggregation

The analog signals go through wireless medium and accumulated at the AP’s RF front end. The received waveform can be expressed as follows:111In this paper, we consider the waveforms of different UEs are synchronized. Note that the issue of signal misalignment can be addressed via [ShaGunLie:21TWC].

(9)

where is the channel fading experienced by UE and

stands for the interference. Without loss of generality, we assume the channel fading is independent and identically distributed (i.i.d.) across the agents and communication rounds, with a unit mean and variance

. Furthermore, we consider follows a symmetric -stable distribution [ClaPedRod:20], which is widely used in characterizing interference’s statistical property in wireless networks [Mid:77, WinPin:09, YanPet:03].

The AP passes the analog signal to a bank of matched filters, where each branch is tuned to one element of the waveform basis, and outputs the vector in (5), where is a -dimensional random vector with each entry being i.i.d. and following an -stable distribution. The AP then broadcasts back to all the UEs. Owing to the high transmit power of the AP, we assume the global parameters can be received without error by all the UEs. Then, the UEs move to step-1) and launch a new round of local computing.

The most remarkable feature of this model training paradigm is that it does not require an edge server to conduct global aggregation and/or model improvement. Instead, the AP exploits the superposition property of wireless signals to achieve fast gradient aggregation through a bank of match filters. At the UEs side, they replace the locally aggregated gradient with the global one at the beginning of each local computing round to align the model parameters. As will be shown next, the training algorithm converges albeit the global gradients are highly distorted. Apart from server free, it is noteworthy that since the UEs do not need to compensate for the channel fading, they can transmit at a relatively constant power level to save the hardware cost. Additionally, the random perturbation from fading and interference provides inherent privacy protection to the UEs’ gradient information [ElgParIss:21].

Iv-B Analysis

In this part, we derive the convergence rate to quantify the training efficiency of SFWFL.

Iv-B1 Preliminary Assumptions

To facilitate the analysis, we make the following assumptions.

Assumption 1

The objective functions are -strongly convex, i.e., for any it is satisfied:

(10)
Assumption 2

The objective functions are -smooth, i.e., for any it is satisfied:

(11)
Assumption 3

The stochastic gradients are unbiased and have bounded second moments, i.e., there exists a constant

such that the following holds:

(12)

Because the interference follows an -stable distribution, which has finite moments only up to the order , the variance of the globally aggregated gradient in (5) may be unbounded. As such, conventional approaches that rely on the existence of second moments cannot be directly applied. In order to establish a universally applicable convergence analysis, we opt for the -norm as an alternative. Based on this metric, we introduce two concepts, i.e., the signed power and -positive definite matrix [WanGurZhu:21], in below.

Definition 1

For a vector , we define its signed power as follows

(13)

where takes the sign of the variable .

Definition 2

A symmetric matrix is said to be -positive definite if for all with .

Armed with the above definitions, we make another assumption as follows.

Assumption 4

For any given vector , the Hessian matrix of , i.e., , is -positive definite.

Furthermore, since each element of has a finite moment, we consider the moment of is upper bounded by a constant , i.e., .

Iv-B2 Convergence Rate of SFWFL

We lay out two technical lemmas that we would use extensively in the derivation.

Lemma 1

Given , for any , the following holds:

(14)
Proof:

Please refer to [Kar:69].

Lemma 2

Let be an -positive definite matrix, for , there exists , such that

(15)
Proof:

Please see Theorem 10 of [WanGurZhu:21].

Since the UEs’ model parameters are aligned at the beginning of each local computing round, i.e., , we denote such a quantity as and present the first theoretical finding below.

Theorem 1

Under the employed wireless system, if the learning rate is set as where , then Algorithm-1 converges as:

(16)
Proof:

See Appendix LABEL:Apndx:ConvAnals_proof.

We highlight a few important observations from this result.

Remark 1

If the interference follows a Gaussian distribution, i.e., , the SFWFL converges in the order of , which is identical to those run under federated edge learning systems [SerCoh:20TSP]. As such, the proposed model training algorithm can attain the same efficacy of conventional edge learning without the requirement of a server.