Next-generation computing networks will encounter a paradigm shift from a conventional cloud computing setting, which aggregates computational resources in a data center, to edge computing systems which largely deploy computational power to the network edges to meet the needs of applications that demand very high bandwidth and low latency, as well as supporting resource-constrained nodes reachable only over unreliable network connections [MaoYouZha:17, YanQue:17, DinTanLa:17, LeeLeeQue:19]. Along with the burgeoning development of machine learning, it is expected that by leveraging computing capability in the edge nodes, usually access points (APs), future networks will be able to utilize local data to conduct intelligent inference and control on many activities, e.g., learning activities of mobile phone users, predicting health events from wearable devices, or detecting burglaries within smart homes [ZhuLiuDu:18, ParSamBen:18]. Due to the sheer volume of data generated every day, as well as the growing capability of computational power and the increasing concerns about sharing private data at the end devices, it becomes more attractive to perform learning directly on the user equipments (UEs) as opposed to sending raw data to an AP. To this end, a new machine learning model has emerged, namely federated learning (FL), that allows decoupling of data acquisition and computation at the central unit [KonMcMRam:15, KonMcMBren:16, MaMMooRam:16]. Specifically, as illustrated by Fig. 1, an FL system optimizes a global model by repeating the following processes: ) the UEs perform local computing with their own data to minimize a predefined empirical risk function and update the trained weights to the AP, ) the AP collects the updates from UEs and consults the FL unit to produce an improved global model, and ) output from the FL model is redistributed to the UEs and the UEs conduct further local training by using the global model as a reference. In this fashion, the global unit, i.e., the AP, is able to train a statistical model from the data stored on a swarm of end devices, i.e., the UEs, without sacrificing their privacy. As such, the FL touts the trial as having smarter models, lower latency, and less power consumption, all while ensuring privacy. These properties identify the FL as one of the most promising technologies of future intelligent networks.
Nonetheless, to make FL possible, one needs to tackle new challenges that require a fundamental departure from the standard methods designed for distributed optimization [KonMcMRam:15]. In particular, different from traditional machine learning systems, where an algorithm runs on a large data set partitioned homogeneously across multiple servers in the cloud, the FL is usually trained from a large non-i.i.d., and often unbalanced, data set generated by distinct distributions across different UEs. Just as crucial is what could happen at the parameter update stage: While iterative algorithm run on FL requires very low latency and high throughput connection between computing units, the AP generally needs to link a vast number of UEs through a resource-constrained spectrum thus only allows a limited number of UEs to send their trained weights via unreliable channels per global aggregation. These challenges make issues such as stragglers and fault tolerance for FL significantly more prevalent than for the conventional training in data centers. To deliver a successful deployment of FL, it requires network operators to adopt new tools and a new way of thinking: model development and training with no direct access to the raw data, with communication cost as a limiting factor [NisYon:18, WanHanWan:18].
In response, considerable research has been carried out, which can be mainly categorized into two directions: algorithmic and communication. From an algorithmic perspective, the idea is to reduce the overhead in update uploading phase to make the model training communication efficient, where typical methods range from reducing the communication bandwidth by only updating the UEs with significant training improvement [CheGiaSun:18]
, compressing the gradient vectors via quantization[AjiHea:17], or adopting momentum method in the sparse update to accelerate the training process [LinHanMao:18]
. Recognizing that the unique properties of the wireless channel are not fully explored, another series of research have followed up from the communication perspective. Particularly, when the amount of training time is limited, solutions are taken by adapting the number of locally computing steps to the variance of global gradient[WanTuoSal:18, WanTuoSal:19JSAC, WanHanWan:18], or scheduling the maximum number of UEs in a given time frame [NisYon:18]. When spectrum resources become the communication bottleneck, there are new methods exploiting the compute-over-air mechanism and arrive at a jointly decode-and-average scheme at the edge computing unit [ZhuWanHua:18, YanJiaShi:18]. Moreover, if perfect channel state information (CSI) is not available at the receiver, the trade-off between delay and number of users selected for parameter update is also investigated [HaZhaSim:19]. Among the prior work, the setup of communication is assumed in the single-cell scenario where received signals are affected only by the additive noise and thus can be correctly decoded upon each global aggregation. However, to fully realize the potential of federated learning, it is necessary to scale up the deployment across a large distributed network. In this context, due to the shared nature of the wireless medium, communications are subjected to inter-cell interference and can encounter inevitable failure. Additionally, since the spectrum resource is generally limited, one needs to appropriately schedule the UEs for channel access upon each global update. To this end, for the successful delivery of FL in large-scale wireless networks, a complete understanding of its performance when operating under different scheduling schemes with unreliable communication links becomes essential.
I-a Approach and Summary of Contributions
In this paper, we develop an analytical framework to study the impact of different scheduling policies on the performance of FL in large-scale wireless networks. Specifically, we model the AP deployment and UE locations as independent Poisson point processes (PPPs), where every UE possesses a private data set and each AP needs to collaboratively learn a statistical model with its associated UEs through FL. Recognizing the potential inefficiency of the conventional FL training approach [WanTuoSal:19JSAC], we leverage methods from distributed coordinate descent [MaKonJag:17] and propose an algorithm that decouples the global averaging at the AP and local computing at each UE, whereas the partial solutions from UEs constitutes a proximal step toward the global optimal that implicitly accelarates the convergence. By leveraging tools from optimization theory and stochastic geometry [YanQue:ICC19, YanQue:19, YanGerQue:16], we derive tractable expressions for the FL convergence rate in a general setting that accounts for the employed scheduling policy and inter-cell interference that affects the data transmission phases. Our main contributions are summarized below.
We propose an algorithm to train an FL model in the context of wireless networks. The algorithm is able to decompose a globally statistical model into a number of local subproblems that can be efficiently solved using only the data set reside on each UE, and the solution of each local problem constitutes a proximal step toward the global optimum, which has the potential to accelerate the convergence rate. Moreover, the learning rate of each UE is set to be adjustable to the change of the communication environment.
We develop a formal framework to analyze the convergence performance of FL algorithms run in wireless networks. Our analysis provides a tractable expression of the convergence rate, which takes into account the key features of a wireless communication system, including the transmission scheduling policy, small-scale fading, large-scale path loss, and inter-cell interference.
We present the convergence rate of FL under three practical scheduling policies, i.e., random scheduling (RS), round robin (RR), and proportional fair (PF). We also analyze the convergence rate of FL in two special cases where ) only one UE can be scheduled upon each global aggregation, ) AP collects more updates by allowing multiple communications before each global aggregation, and ) all UEs send out the trained weights without scheduling in every communication round.
Through our analysis, we show that under high SINR threshold, running FL with PF outperforms RS and RR in terms of convergence rate, and RR is more preferable when the SINR threshold is low. Moreover, for networks operating under very low SINR threshold, sending trained weights without scheduling can achieve better FL convergence rate than any scheduling methods employed. The FL convergence rate is shown to decrease rapidly as the SINR threshold increases, thus confirming the importance of compression and quantization of the update parameters.
Our analysis also reveals that under a fixed amount of available spectrum, there exists a trade-off between the number of scheduled UEs and subchannel bandwidth in the optimization of FL convergence rate, which allows further design operations.
The remainder of this paper is organized as follows. We introduce the system model in Section II. In Section III, we detail the local computing and parameter update process to run FL in wireless networks. In Section IV, we analyze the convergence rate of federated learning under various scheduling policies. We show the numerical results in Section V to compare the effectiveness of different scheduling methods and obtain design insights. We conclude the paper in Section VI.
|;||PPP modeling the location of APs; the AP spatial deployment density|
|; ;||Number of associated UEs per AP; number of the subchannels; UE number over subchannel number ratio, i.e.,|
|;||UE transmit power; path loss exponent|
|;||SINR received from UE at communication round ; the SINR decoding threshold|
|;||Instantaneous SNR of UE at communication round ; time average SNR of UE till communication round|
|;||Data set of UE ; size of the data set|
|;||Loss function on data point ; regularization function|
|;||Conjugate function of ; conjugate function of|
|; ;||Smoothness of the loss function ; convexity of the regularizer ; partition difficulty of data set|
|;||The objective function; optimization vector of the primal problem|
|;||The dual form of the objective function; the dual variables|
|;||Local learning rate; error level of the local solution|
|;||Indicator of selection state of UE at communication round
, which takes value 1 if the UE is selected and 0 otherwise; parameter update success probability
Ii System Model
In this section, we introduce the network topology and propagation model, the generic procedure of FL, and the scheduling policies. The main notations used throughout the paper are summarized in Table I.
Ii-a Network Structure and Propagation Channel
Let us consider a wireless network that consists of APs and UEs, as depicted in Fig. 1. The locations of APs follow a homogeneous PPP with spatial density . We assume each AP has
associated UEs uniformly distributed within its Voronoi cell111 This is equivalent to the maximum average power association rule, and we fix the total number of UEs in each cell to simplify the notational complexity. Note that relaxing such assumption does not change conclusions drawn from this paper.. In this network, a fixed amount of spectrum is equally divided into radio access channels, where . We consider each AP is equipped with a single antenna and a computing processor. For a generic UE , we consider it is equipped with a single antenna and has a local data set with sample points, where denotes the cardinality of a set. Each UE also has the capability of performing local training.
In this network, all the UEs transmit with a constant power 222We unify the transmit power for notational simplicity. Nonetheless, note that the analysis of this paper can be extended to account for power control in a straightforward way [ElSHos:14].. We adopt a block-fading propagation model, where the channels between any pair of antennas are assumed independent and identically distributed (i.i.d.) and quasi-static, i.e., the channel is constant during one transmission block and varies independently from block to block. We consider all propagation channels are narrow-band and affected by two attenuation components, namely the small-scale Rayleigh fading with unit mean power, and the large-scale path loss that follows power law. Moreover, in light to its spectral efficiency, we assume the whole spectrum is reused in every cell.
Ii-B Federated Learning
At each AP, the goal is to learn a statistical model over data that reside on the associated UEs, i.e., the AP needs to fit a vector so as to minimize a particular loss function by using the whole data set from all the UEs under its service. Formally, such task can be expressed as
where is the size of the whole data set, is the regularizing parameter and a deterministic penalty function. Common choices for include the L-2 penalty , the L-1 penalty , or a family of folded concave functions [Zha:10]. The function represents the loss function associated with data point . Several examples of loss functions used in popular machine learning models are summarized in Table II.
If the data set is completely available at the AP, problem (1) can be easily solved via a number of machine learning algorithms. However, such data set is generally unavailable in a real-world setting because ) the amount of data at each UE is usually enormous and the data uploading task can be daunting due to energy consumption and limited bandwidth, and more importantly, ) the data from to each UE may contain highly sensitive information, e.g., the medical record, words typed in messager APPs, or web browsing history, and users are unwilling to share it. As such, the FL algorithm has emerged, where the data collection process is decoupled with the global model training. The general procedure of FL is summarized in Algorithm 1. Particularly, each UE downloads a global model,
, from the AP to conduct stochastic gradient descent (SGD) per equation (3), aiming to minimize the objective functionby only using information from the globally shared vector and data set (note that such data set is private). The AP periodically collects all the trained parameters from UEs to produce a global average and then redistributes the improved model back to the UEs. After a sufficient amount of training and update exchanges, usually termed as communication rounds, between the AP and its associated UEs, the objective function (1) is able to converge to the global optimal. When all the updates can be correctly received by the AP in every communication round, the convergence property of FL has been quantitatively demonstrated [KonMcMRam:15]. However, as the FL algorithm is generally run in a wireless setting where updates are sent through a shared spectrum, which is unreliable due to random fading and inter-cell interference, updates from some UEs can be lost during the data transmission phase. Moreover, the wireless medium is usually resource-constrained, the AP thus needs to select a subgroup of UEs for parameter updates in each communication round.
Apart from the scheduling issue, the vanilla training approach per Algorithm 1 suffers potential setback of slow convergence [ZhaFenYan:19WCM], especially when the loss function or regularizer has complicated forms. Furthermore, the duration of local training in Algorithm 1 needs to be carefully designed so as to ensure the local solutions do not diverge from the global model [WanTuoSal:19JSAC]. As a result, the local training period needs to be set at small and that may incur a large number of communication rounds which is not desirable. In that respect, we propose an algorithm, which will be elaborated in Section III, that presents a more suitable alternative to train FL in a wireless setup.
Ii-C Scheduling Policies
In many real-world systems, communicating data between machines is several orders of magnitude slower than reading data from main memory and perform local computing [LanLeeZho:17]. Hence, sequentially updating the trained parameters from all UEs before global aggregation as proposed in [NisYon:18] can lead to large overhead in the communication time and is not desirable. Instead, the AP shall only select a subgroup of UEs and update their parameters simultaneously so as to keep the communication time within an acceptable range. To this end, the scheduling policy plays a crucial role in assigning the resource-limited radio channels to the appropriate UEs. In the following, we denote the UE number over subchannel number ratio333For simplicity, we assume is a multiple of . In more general scenarios where is not an integer, we can choose . The denotes the ceiling function. and consider three practical policies as our scheduling criteria [YanWanQue:18, ChoBah:07]:
Random Scheduling (RS): In each communication round, the AP uniformly selects the associated UEs at random for parameter update, each selected UE is assigned a dedicated subchannel to transmit the trained parameter.
Round Robin (RR): The AP arranges all the UEs into groups and consecutively assigns each group to access the radio channels and update their parameters per communication round.
Proportional Fair (PF): During each communication round, the AP selects out of the associated UEs according to the following policy:
where is a length- vector and represents the indexes of the selected UEs. The notions and are the instantaneous and time average signal-to-noise ratio (SNR) of UE at the communication round , respectively [ChoBah:07].
The following sections are devoted to the design of algorithms to run federated learning in wireless networks, as well as the analysis that quantifies the running time of FL under different scheduling policies.
Iii Distributed Computing and Parameter Update
In this section, we detail the procedure that decomposes the problem from (1) into a number of subproblems which can be solved by using only the local data at each UE. We also describe how local training and update adapt to the scheduling policy. To facilitate the design and analysis, we make the following assumptions on the loss function and the regulator throughout this paper.
Function is -strongly convex, i.e., and it holds that
where denotes the gradient of the function 444In this paper, we follow the convention and write the definition of strong convexity using gradient [Bub:15]. Nevertheless, note that strongly convex functions may not be differentiable, and in that case, one shall replace the gradient by subgradient [Roc:70]..
Functions are -smooth, i.e., and it holds that
where denotes the gradient of the function .
Iii-a Local Decomposition
First of all, using the Fenchel-Rockafeller duality, we can express the local dual optimization problem of (1) in the following way.
The optimization problem (1) can be rewritten in the following duality form
where represents the set of the dual variables, is the total data set, and are the convex conjugate functions of and , respectively, given as follows
We first denote . By using the Lagrangian, we can write the original problem (1) equivalently as follows the following
Note that when is chosen so as to maximize (III-A), the value of is equivalent to (1) due to the first-order optimality condition [Roc:70]. As such, the result in (III-A) then follows from maximizing the above problem with respect to .
The advantage of using the dual formulation in (6) is that it allows us to better separate the global problem into a number of distributed subproblems solvable via federated computing across different UEs. In particular, we define and first decompose into the following form
where with being the coordinates of vector that corresponds to the data set and the other entries are set to zero. As such, for a randomly initialized vector , varying its value by will result in the following change to (III-A)
Notably, the changes in the second term of the above equation correspond to only the data set of each local UE , while the first term involves all the global variations. Because is -strongly convex, we know that is -smooth [HirJeaLem:12, Theorem 4.2.1] and can thus bound as follows
To this end, if each UE can optimize using its own data set so as to maximize the right hand side (R.H.S.) of (III-A), the resultant improvements can be combined to direct toward the optimal value555Instead of directly solving the original optimization problem, we solve for an approximated surrogate which is advantageous due to the savings per communication round and the fact that solutions with extremely high accuracy are not necessary for machine learning in practice. . To be more concrete, during any communication round , the AP produces by using updates received from the last round and broadcasts that to all the UEs. The task at any given UE is to solve for that maximizes the following
and then send the parameter to the AP. The AP then updates the global vector as . As such, by alteratively updating and on the global and local sides, respectively, it is expected that the solutions to the dual problem can be enhanced at every step and that guarantees the original problem converges to the optimal.
It is important to note that unlike (3), the subproblem (III-A) is simple in the sense that it is always a quadratic objective (apart from the term). The subproblem does not dependent on the function itself, but only its linearization at the shared vector . This property additionally simplifies the task of local solvers, especially when function takes complicated formats. Moreover, if the local problems were solved exactly, this can be interpreted as a data-dependent block separable proximal step, which is known as a method to accelerate the learning process.
The requirement for such decomposition method to work is that during each global aggregation, the changes in the local variables on each UE and that in the global vector are kept consistent . However, because the wireless channels are generally unreliable, updates can be lost during the data transmission phase which leads to misalignment in the global and local parameters. In the following, we will develop an algorithm that adapts the local training at each UE along with the communication condition in the global parameter updating phase.
Iii-B Parameter Updates
During a typical communication round , in order to update the parameter from a generic UE to the global AP, two conditions need to be simultaneously satisfied: ) the UE is selected by the AP, and ) the transmitted data is successfully decoded. In that respect, we first introduce as a selection indicator, with specifying the employed scheduling policy, where corresponds to the event that UE is chosen by the AP for transmission and otherwise.
Next, we characterize the transmission quality of the wireless links. Note that although the depicted wireless network contains infinitely many APs, thanks to the stationary property of PPP, the FL convergence rate of each AP is statistically equivalent. As such, by applying Slivnyak’s theorem to the stationary PPP of APs, it is sufficient to evaluate the SINR of a typical AP at the origin [BacBla:09, Hae:12]. For signals transmitted from UE that locates at , the SINR received at the typical AP takes the following form:
where is the path loss exponent, is the small scale fading, is the variance of Gaussian additive noise, and represents the locations of out of cell UEs that interfere the typical AP. In order for the AP to successfully decode the updates from UE , it is required that the received SINR exceeds a decoding threshold , i.e., . Since the updated parameters from each UE have the same size, we assume the APs adopt a unified SINR decoding threshold in this network.
In any typical communication round, the probability of a generic UE being selected by its tagged AP depends on the scheduling policy employed. On the other hand, since both the signal strength and the interference received at a given AP are governed by a number of stochastic processes, e.g., the random spatial distribution of AP/UE locations and small-scale fading, the resulting SINR is a random variable. As such, we define the following quantity, termed as the parameter update success probability, to characterize the transmission performance in each update
This variable fully captures the key aspects for the successful update of parameters in each UE, and, as we will show later on, plays a critical role in the convergence analysis.
Iii-C Federated Learning in Wireless Networks
Armed with the above preparation, we are now ready to present the FL algorithm in a wireless network, which is summarized in Algorithm 2 and illustrated by Fig. 2. We can see that the algorithm mainly consists of two parts:
At the AP side, it selects a subgroup of UEs for update collection, decodes the received packet, and performs a global aggregation according to (20). The new global parameter is redistributed to all the associated UEs using an error free channel.
Note that there is an incessant alternation between communication and computation during the training stage (cf. Fig. 2). In this regard, retransmissions of the failed packets may not be beneficial because each uplink transmission of local updates will be followed by a downlink transmission of the global average, and upon the reception of that the UEs will refresh their reference parameter and start to solve a new subproblem using the local data666If the transceivers are availed with full duplex communications, it is possible to boost up the convergence rate because that gives a potential to double the efficiency in both communication and computation aspects..
Note that Algorithm 2 is essentially coordinate ascent working in the wireless setup. The crucial property here is that the optimization algorithm on UE changes only the coordinates of the dual optimization variable corresponding the data set . Moreover, the factor acts as a time-averaging approach to calculate the parameter update success probability, which steadily learns the quantity through the update status from each transmission. As such, the update in (18) is able to adjust the local training along with the parameter update quality. To be more concrete, under good channel conditions, the update from UEs can be successfully received in each communication round, which leads to high value of the quantity , indicating that the local references can progress more aggressively. On the contrary, when the UEs are under disadvantageous communication environment, the local learning rate also declines automatically, making the progress of local training more conservative. This is because when communications are not reliable, the AP normally only receives a few amounts of updates from the UEs, which results in small changes in the global vector . In correspondence, local reference shall not change abruptly but rather maintain the changes in line with the global ones777Note that it is possible to grant the convergence of Algorithm 2 when is set as different constants. Nevertheless, the value of such quantity affects the ultimate rate of convergence [JagSmiJor:15]..
The main benefit of Algorithm 2 comes from three folds: ) it is based on local second-order information and does not require sending gradients and Hessian matrices to the AP, which would be a significant cost in terms of communication, ) the local subproblems are in the form of proximal step, which can potentially accelerate the convergence rate, and ) the local step size adjusts in accordance to the communication environment.