I introduction
In the pursuit of truly brainlike intelligence, federated edge learning (FEEL) has emerged as a popular framework for collaborative model training over wireless networks [5], which allows full exploitation of the rich distributed data at the edge devices without compromising the data privacy. This is achieved by distributing the learning task across edge devices and letting each of them to upload the computed learning updates (e.g., local gradients or models) instead of the raw data to the edge server, where local updates are aggregated to yield an updated global model for initiating the next round local model training. When the edge devices participating in the learning process share the same wireless medium to convey local updates to the edge server, the uncertain wireless environment and limited radio resources can cause severe congestion over the air interface, resulting in a communication bottleneck for FEEL. This thus prompts an active research area focusing on developing novel communication techniques (e.g., resource allocation, multiple access, source coding) for communicationefficient FEEL [13].
Among others, device scheduling in FEEL is an important issue that received increasing research interests recently while yet to be fully addressed. The research issue is prompted by the fact that, due to the constrained bandwidth, only a subset of devices can be selected to convey their updates at each communication round in FEEL. Simple uniform random scheduling is first considered in the seminal work [5] to determine the subset of devices for update uploading. Despite its simplicity, uniform random scheduling shows decent convergence performance as observed empirically in [5]. Subsequently, authors in [11]
compared three heuristic scheduling policies including uniform random scheduling, round robin, and proportional fair, where the convergence behavior of the three policies are theoretically characterized. Later on, a joint device scheduling and resource allocation policy was proposed in
[9] to maximize the model accuracy within a given total training time budget. Realizing that the device scheduling policy can crucially affect the total communication time of FEEL, recent research effort in this field focused on studying the optimal scheduling policy that minimizes the communication time. However, due to the difficulty in quantifying the exact communication time, prior work in this area can only tackle the problem partially by considering either the communication rounds or perround latency, while the total communication time is determined by both metrics. In terms of communicationround minimization, authors in [8] proposed an importanceaware scheduling policywhere the devices with larger gradient norms are deemed to be more important and thus assigned a higher probability to be scheduled. Similar policies are also proposed independently in parallel works
[3, 1, 2]. In terms of perround latency minimization, authors in [14] proposed a channel aware scheduling policy, where devices with stronger channels are scheduled with higher frequency than the those with weak channels. Most recently, authors in [7] proposed an indirect approach to reduce the total communication time by minimizing a heuristic objective function which is constructed to be a weighted sum of the gradient norm and perround communication time. The weighted factor therein crucially affects the performance of the derived solution, and thus need to be finetuned in an offline manner, which may not be feasible in practical FEEL systems.In this paper, we made the first attempt to directly formulate and solve the communication time minimization problem. We first derive a tight bound to approximate the communication time through crossdisciplinary effort involving both learning theory for convergence analysis and communication theory for perround latency analysis. Building on the analytical result, an optimized probabilistic scheduling policy is derived in closedform by solving the approximate problem. It is found that the optimized policy gradually turns its priority from suppressing the remaining communication rounds to reducing perround latency as the training process evolves. The effectiveness of the proposed scheme is demonstrated via a use case on collaborative 3D objective detection in autonomous driving.
Ii Learning and Communication Models
We consider a FEEL system, which consists of one edge server and different kinds of edge devices. The device set can be denoted by . The local dataset is denoted by . The size of is denoted by . We let denote the total dataset size. Due to the constrained bandwidth, only a subset of devices can be scheduled to participate in the global model updating process in each round. A scheduler determining the participating devices at each round need to be judiciously designed at each communication round for such a system.
Iia Learning Model
The learning process is to minimize the global loss function in a distributed manner. Particularly, the global loss function on the entire distributed dataset is defined below:
(1) 
where denotes the global model at the round , denotes the sample loss quantifying the prediction error of on the training data with respect to the true label . At each communication round, FEEL repeats the following procedure until the global model converges:

Global Model Broadcasting: The edge server broadcasts current global model to each device.

Local Model Training: Each device runs the stochastic gradient decent (SGD) algorithm using its local dataset and the latest global model
, and generates a local gradient estimate
. 
Probabilistic Device Scheduling: Each device is assigned a certain probability, denoted by for device
to be designed in the sequel, for being scheduled for update uploading. The scheduled devices are decided by sampling based on the probability distribution.

Local Gradient Uploading: Each scheduled device transmits a scaled version of the local gradient estimate to the edge server, which is given by .^{1}^{1}1The scaling factor is needed to ensure an unbiased gradient estimate at the edge server as explained in [7].

Global Model Updating: The edge server aggregates the uploaded local gradients, and then updates the global model by .
IiB Communication Model
Each scheduled device is assigned a dedicated subchannel of bandwidth for update uploading. For device , the uploading time at round is given by
(2) 
where is the number of bits needed for transmitting one gradient parameter, is the total number of gradient parameters, denotes the signaltonoise ratio (SNR) of the transmitted signal, which is defined as , with being the transmit power, the noise power, and the channel gain. The channel coefficients are assumed to be independent and identically distributed (i.i.d.) over time, following Rayleigh distribution, i.e.,
, where the channel variance varies in devices due to the heterogeneous path losses encountered.
Iii Problem Formulation
Consider the probabilistic scheduling scheme described in Section IIA. For each communication round, a scheduling problem is formulated to optimize the scheduling probability distribution for minimizing the remaining communication time. The problem instance for round is given by
(3) 
where and denote the remaining communication time and the remaining number of communication rounds when sitting at round , denotes the oneround communication time at round , with . Particularly, the global model at round is deemed converged if accuracy is achieved as defined below:
(4) 
According to the learning procedure in Section IIA, the communication time at round consists of two parts, i.e.,
(5) 
where and represent the global model broadcasting time and the local gradient uploading time at round , respectively. Since is independent with the scheduling decision, the optimization problem in (3) thus reduces to
(6) 
However, the objective above is not tractable for evaluation in practice as the involved and require noncausal future information (e.g., the gradient estimates, channel state information at future rounds) for exact calculation. The difficulty can be tackled by formulating an approximate problem using the lookahead model [6], which is widely used in stochastic optimization society, as follows
(7) 
where denotes the expected remaining communication rounds after round’s update, and denotes the expected communication time of future round. Compared with in (6), the one in problem requires only causal information and thus can be practically evaluated as shown shortly.
Iv Scheduling Optimization
In this section, we first derive the three key components in the objective of problem , i.e., , , and , as functions of scheduling probability distribution . Based on the analytical result, the optimized are then attained in closedform by solving problem .
Iva Communication Time Analysis for FEEL
IvA1 Expected Remaining Communication Rounds
Since the exact is not tractable to derive, we thus resort to deriving an upper bound as the approximation of . To this end, the following standard assumptions on the loss function are made.
Assumption 1.
The loss function is smooth, i.e., .
Assumption 2.
The loss function is stronglyconvex, i.e., .
Then the upper bound of can be derived as a function of as follows.
Proposition 1.
For FEEL with probabilistic scheduling, the upper bound of remaining communication rounds is given by
(8) 
where a diminishing stepsize is adopted with constants and ; is a constant term independent of the scheduling probability distribution; is the maximum expected uploaded gradient norm from the th round to the last round. Proof: See Appendix A.
Remark 1.
With Theorem 1 at hand, by Cauchy inequality, it can be proved that the scheduling probability distribution minimizing the remaining communication rounds is proportional to the product of local dataset size and gradient norm, i.e., . This aligns with the importanceaware scheduling proposed in [3] where the product defines the update importance of device .
IvA2 Uploading Time at Current Round
Due to the probabilistic scheduling, can only be estimated by its expectation over the scheduling probability distribution, yielding
(9) 
where has been defined in (2), which can be exactly computed using channel state information at the current round.
Proposition 2.
For FEEL with probabilistic scheduling, the uploading time of current round is given by
(10) 
where we have defined .
Remark 2.
It can be observed that for minimizing the current round uploading time, the optimal solution is to assign the device with the strongest channel (or largest ) with probability one. This reduces to a deterministic scheduling policy referred to be channelaware scheduling.
IvA3 Expected Future Round Communication Time
Due to the absence of the gradient information, channel state information, and scheduling decision at future rounds, the exact communication time at future founds is not tractable to compute and thus we resort to its approximation by its expectation, which is taken over both the random scheduling and random channel distributions. Then, , with being the index of current round, we have
(11) 
where the future rounds’ scheduling probability is unknown and assumed to be proportional to the size of local dataset, i.e., , .
To ensure efficient transmission, our scheduling policy assign nonzero scheduling probability to those devices with a channel gain larger than a predefined threshold . Only those with have chance to participate to the FEEL. Then can be computed by
(12) 
Proposition 3.
For FEEL with probabilistic scheduling, the expected time in future rounds is given by
(13) 
IvB Scheduling Optimization for FEEL
With the derived , , and at hand, the problem can be explicitly written below, where those constant terms independent of the scheduling probability distribution is dropped for simplifying the expression without harming the problem equivalence.
where we have defined . Next, the problem can be solved below.
Proposition 4.
For FEEL with probabilistic scheduling, the optimized scheduling probability distribution for minimizing the remaining communication time is given by
where is the Lagrange multiplier satisfying , and can be attained via bisection search.
Proof: It can be easily proved that is a convex problem with respect to the scheduling probability distribution . Thus, we can optimally solve this problem using the KarushKuhnTucker (KKT) conditions, yielding
(14) 
Further substituting and derived respectively in (10) and (2), and into (14), gives the desired result.
Remark 3.
As observed in Proposition 4, the optimized scheduling probability is governed by three terms, i.e., the one related to gradient importance , the one related to transmission rate , and a scalar factor regulating the tradeoff between them. This suggests that the optimized scheme tries to balance the consideration between the remaining communication rounds (related to gradient importance) and the uploading time in current round (related to transmission rate). Since is a decreasing function w.r.t. , it can be concluded that the optimized scheme weights more on suppressing the remaining communication rounds at the early training stage while gradually biases for reducing oneround uploading time as the training process evolves.
V Experimental Results
Unrealengine autonomous driving simulator. Recently, there is a strong demand for implementation of FEEL in autonomous driving vehicle perception to alleviate the communication overhead due to raw data transmission and data privacy issues. Unfortunately, physicalworld implementation of autonomous driving is hindered by the infrastructure costs and the difficulties of testing in dangerous scenarios. Car Learning to Act (CARLA) [4] is a widelyaccepted unrealengine driven benchmark system that provides complex urban driving scenarios and highquality 3D rendering such that FEELbased vehicle perception can be prototyped in virtualreality. In this section, all the training and testing procedures are implemented in CARLA.
Dataset. We employ CARLA to generate vehicles in the “Town02” map, among which, of them are autonomous driving vehicles that can generate the point cloud data at a frequency of frames/s. The entire dataset consists of frames in total with frames at each vehicle, where frames are used for training at each vehicle and frames are used for inference and testing. The left hand side of Fig. 1 illustrates the bird eye view of the simulated world and the locations of all vehicles; the right hand side of Fig. 1 illustrates the noni.i.d. distributed pointcloud data.
Model. The sparsely embedded convolutional detection
(SECOND) neural network proposed in
[10] is adopted for 3D object detection. The raw data generated from CARLA is processed into KITTI formats following the procedure in [12]. Each round of local training randomly selectsframes and the fraction decay learning rate is adopted. The federated model training is implemented by PyTorch using python 3.8 on a Linux server with four NVIDIA RTX 3090 GPUs.
Communication settings. The FEEL between edge server and the autonomous vehicles is executed during vehicle charging. The distance
in km between the edge server and any vehicle is uniformly distributed between
and , and the corresponding path loss is given by in dB. The bandwidth, the noise power density, the transmit power of each vehicle, and the number of bits for representing each element in gradients are set to be MHz, dBm/Hz, dBm, and , respectively.Benchmarks. We compare the proposed communication time minimization (CTM) scheme with three benchmark schemes including 1) the importanceaware (IA) scheme [8]; 2) the channelaware (CA) scheme, which schedules the vehicles with the strongest channel gain; and 3) the jointimportanceandchannel aware (ICA) scheme [7].
Performance. The average precisions, measured by intersection of union (IoU) between the prediction and the ground truth, at the communication time of s and s are shown in Fig. 2a, 2b, respectively. It is observed from all figures that the CA scheme leads to the worst precision at vehicle whose channel fading is the largest, and the IA scheme leads to the worst precision at vehicle whose data is less important than others. The ICA scheme aims to find a balance between CA and IA, but fails to compete with the proposed scheme due to its heuristic nature as mentioned in the introduction. Furthermore, the proposed CTM achieves slightly better performance at the early training stage, but significantly outperforms all the other schemes after sufficient training. This is due to the superiority of lookahead nature inherent in the proposed CTM scheme over the myopic nature in the existing CA, IA, and ICA schemes. The detection result of one data frame is also illustrated in Fig. 3. The red box and green box denote the ground truth and the predicted result, respectively. It can be seen that the CTM successfully detect two objects while other schemes only detect one of them. Moreover, the CTM achieves the largest IoU with the same time budget.
Vi Concluding Remarks
In this paper, we made the first attempt to directly formulate and solve the optimal device scheduling problem in FEEL for communication time minimization. In contrast with the existing solution that can only partially tackle the problem by considering either communication rounds and perround latency, the proposed solution can fully account for both metrics in the formulated optimization problem. The derived optimized solution shows that it is desired to put more weight in suppressing the remaining communication rounds at the early training stage while gradually bias for reducing perround latency in the latter stage. For future work, it is interesting to extend the current work to the modelaveraging based federated learning where model updates instead of gradient updates are transmitted.
Vii Acknowledgement
The work was supported in part by the Key Area R&D Program of Guangdong Province with grant No. 2018B030338001, by the National Key R&D Program of China with grant No. 2018YFB1800800, by National Natural Science Foundation of China (No. 62001310), by Shenzhen Outstanding Talents Training Fund, and by Guangdong Research Project No. 2017ZT07X152.
a Proof of Theorem 1
To start with, the following lemmas are needed.
Lemma 1 ([7]).
With Assumptions 1 and 2, the upper bound of loss function is given by:
(15) 
and .
Lemma 2.
After the round’s update, The expected optimality gap between with is bounded by
(16) 
References
 [1] (2020) Convergence of update aware device scheduling for federated learning at the wireless edge. [Online]. Available: https://arxiv.org/abs/2001.10402. Cited by: §I.
 [2] (2020) Convergence time optimization for federated learning over wireless networks. [Online]. Available: https://arxiv.org/abs/2001.07845/. Cited by: §I.
 [3] (2020) Optimal client sampling for federated learning. [Online]. Available: https://arxiv.org/abs/2010.13723/. Cited by: §I, Remark 1.
 [4] (2017) CARLA: an open urban driving simulator. In Conference on robot learning, pp. 1–16. Cited by: §V.
 [5] (2017) Communicationefficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §I, §I.
 [6] (2014) Clearing the jungle of stochastic optimization. In Bridging data and decisions, pp. 109–137. Cited by: §III.
 [7] (2020) Scheduling in cellular federated edge learning with importance and channel awareness. [Online]. Available: https://arxiv.org/abs/2004.00490/. Cited by: §I, §V, Lemma 1, footnote 1.
 [8] (2020) Optimal importance sampling for federated learning. [Online]. Available: https://arxiv.org/abs/2010.13600/. Cited by: §I, §V.
 [9] (2019) Adaptive federated learning in resource constrained edge computing systems. IEEE Journal on Selected Areas in Communications 37 (6), pp. 1205–1221. Cited by: §I.
 [10] (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §V.
 [11] (2019) Scheduling policies for federated learning in wireless networks. IEEE Transactions on Communications 68 (1), pp. 317–333. Cited by: §I.
 [12] (2021) Distributed dynamic map fusion via federated learning for intelligent networked vehicles. [Online]. Available: https://arxiv.org/abs/2103.03786/. Cited by: §V.

[13]
(2020)
Toward an intelligent edge: wireless communication meets machine learning
. IEEE communications magazine 58 (1), pp. 19–25. Cited by: §I.  [14] (2019) Broadband analog aggregation for lowlatency federated edge learning. IEEE Transactions on Wireless Communications 19 (1), pp. 491–506. Cited by: §I.