I Introduction
Extensive research efforts have been devoted to overcome the communication bottleneck in federated learning (FL) [1]. Lossy compression techniques, including quantization[2, 3, 4, 5] and sparsification[6, 7, 8, 9] have been developed to reduce the communication cost. However, these approaches treat the communication channel connecting the clients to the parameter server (PS) as an errorfree bit pipe, ignoring the wireless channel characteristics. However, efficient implementation of FL at the wireless edge, called federated edge learning (FEEL), requires jointly optimizing the learning framework with the underlying communication framework taking into account channel characteristics and constraints [10, 11].
As common in wireless networks, different clients may have distinct computational capabilities, channel statistics, power budgets, etc., resulting in the straggler effect in FEEL. By carefully designing the resource allocation policy, the balance between the training latency and energy consumption has been studied in [12, 13, 14]. By scheduling only a limited number of clients in each round in FEEL, the communication requirement can be reduced, and the straggler effect can also be alleviated [15, 16, 17, 18]. In [15], the authors propose to maximize the number of scheduled clients in each round to speed up training. In [16], a function of the ages of client updates is minimized to schedule all the clients as often as possible. Convergence rates of three different scheduling policies, namely, random, roundrobin, proportional fair scheduling, are analyzed in [17]. Different from the aforementioned works, [18] considers ‘updateaware’ client scheduling in FEEL, which takes the significance of model updates into account together with their channel states. Alternatively, in [19] clients are clustered around several access points to perform FL hierarchically in order to reduce the communication latency. Another approach to addressing the communication bottleneck is aggregating the local model updates using the superposition property of the wireless multiple access channel [20, 21, 22]. However, this approach requires accurate synchronization among the transmitting agents [23].
Unlike the previous literature on FEEL, we consider both the downlink and uplink channel variations, together with random computation delays. With the exception of [24], works on FEEL ignore the downlink communication channel and the associated latency. Our first contribution is to overlap the downlink and uplink transmissions with local computations at the clients. This is achieved by fountaincoded delivery of the global model update [25], such that clients with better downlink channel conditions can receive the global model quickly, and start computing right away. Then, the clients are scheduled for the uplink delivery of their model updates to the PS as soon as they complete their local computations. To further minimize the latency, we schedule the client with the minimial uplink latency at any point in time.
While this approach, called MRTP, minimizes the latency, it may result in a loss in test accuracy as clients close to the PS would dominate the training process thanks to their statistically better channel conditions. We mitigate this bias by introducing both shortterm and longterm fairness conditions. In particular, we utilize the age and frequency metrics, respectively, for short and longterm fairness, where age, in a broad sense, measures the staleness of the client model update, and frequency measures the long term participation statistics of the clients. Our numerical results show that the proposed overlapped and fair scheduling policy significantly speeds up FEEL without sacrificing the final test accuracy.
Ii System Model
Iia FL model
Consider
clients collaboratively training a model parameter vector of dimension
,, with periodic communication with a PS to minimize the empirical loss function,
, where is the loss function of client . Let denote the dataset at client with samples. The empirical loss function at client is , where is the taskdependent loss function measured with the model parameter vector and data . At each communication round , each participant client performsstep stochastic gradient descent (SGD) on its local dataset to minimize the loss function based on the received global model parameter
. At the th step of the local SGD of participant client in communication round , the local model is updated as(1) 
where is the learning rate, and is the gradient computed with the local model parameter and minibatch data , which is uniformly randomly selected from and follows .
Each participant client then forwards its updated local model to the PS. Denoting the model vector at client updated after steps by , , the global model is updated by the PS as .
IiB Communication Model
A block fading channel model is assumed, where the channels between the PS and the clients remain unchanged in each communication round of the FEEL process.
IiB1 Asynchronous global model transmission
Note that we have a multicast channel when transmitting the global model from the PS to the clients. To speed up the training process, we use fountain coded multicasting of the global model to the clients [25]. The downlink rate at client in communication round is given by , , where is the complexvalued downlink channel coefficient between the PS and client in round , is the transmit power at the PS, and
is the noise variance.
With asynchronous global model transmission in the downlink, a client recovers the global model with a latency dependent on its channel gain, and immediately starts local computations. Assuming that the global model is compressed into bits, it will take seconds for client to receive the model update. This will allow us to parallelize global model transmission and computations, instead of targeting the worst client to guarantee all the clients receive the global model simultaneously.
In the rest of the paper, to simplify notation we will drop the round index when it is clear from the context.
IiB2 Local model update
In the uplink, where clients upload their model parameters to the PS, we assume a timedivision framework; that is, only a single client is scheduled at any point in time. The instantaneous rate of each client is given by where is the transmit power at client , and is the uplink channel coefficient from client to the PS.
IiC Local computations
We assume that the computation speeds at the clients are also random following a distribution that is independent across clients and rounds. In each round, we denote by the client with the th smallest accumulated latency in downlink transmission and local computation, , and by the corresponding latency since the beginning of the communication round, with .
IiD Client scheduling
With asynchronous downlink transmission and heterogeneous local computation speeds, clients complete their local model updates in a sequential manner. The clients that have completed local computations and are thus available to upload their local models to the PS are referred to as idle clients. We denote the set of idle clients at time within each round by . Hence, client is added to the idle set at time , and remains there until it uploads its model to the PS.
Let denote the index of the client scheduled for transmission at time , where means no client is scheduled. We have if ; that is, a client cannot be scheduled for upload before it completes its local computations. Let denote the remaining size of model parameter vector (measured in bits) at time that has not yet been uploaded to the PS. We have , and
(2) 
where is the indicator function, which is 1 when holds, and otherwise.
Let denote the time client completes uploading its model to the PS, i.e., . Client is removed from the idle set at time . Each round continues until out of clients upload their models to the PS, and therefore, some clients may never be added to the idle set, or may not leave the set at the end of the round. We have for those clients. Let denote the set of clients that have completed their upload by time within round , i.e., . For a specific scheduling policy, the completion time of round is given by , while the set of clients scheduled in round are given, with slight abuse of notation, by .
Iii Scheduling Policies
Our goal is to come up with scheduling policies that would not only minimize the completion time of each round, but also lead to the fastest convergence in terms of the wallclock time.
Iiia Minimum remaining timebased policy (MRTP)
Note that, at any time instant within a round, we can schedule any client from the idle set. However, it is easy to see that there is no loss of optimality making scheduling decisions only when the idle set is updated, i.e., when a new client becomes idle, or one of the idle clients completes uploading its model.
In MRTP, each time the idle set is updated, we schedule the client with the minimum remaining time to upload its update. Specifically, at time , client is scheduled if
(3) 
Details of the MRTP can be found in Algorithm 1.
IiiB Ageaware MRTP (AMRTP)
While MRTP minimizes the upload time, clients with statistically better channel qualities are more likely to be scheduled, which results in nonuniform sampling of clients and overfitting due to excessive use of limited amount of data. This may drastically degrade the performance of FEEL, especially when the data is not i.i.d., which is common in practice. Therefore, to strike a balance between the latency and model accuracy, we propose two alternative schemes by taking the ‘age of update’ into consideration.
For the short term fairness, we utilize the age metric, where the age of a client at round , denoted by , represents the number of rounds since the last time it was scheduled. The age parameter evolves as follows:
(4) 
In AMRTP, clients are firstly selected to upload their models as in MRTP to minimize the latency, where is a tuning parameter. Then, to promote selecting clients that are less frequently scheduled, we select the client with the minimum ratio between the remaining time and the age of its update, which can be considered as the average latency for each timely update in the short term. Therefore, a balance between the efficiency in training and fairness is achieved by tuning .
IiiC Opportunistic Fair MRTP (OFMRTP)
The main drawback of AMRTP is that it utilizes only instantaneous rate for client scheduling. However, when the clients are located at different distances from the PS, their participation frequency will still depend on their locations. We propose an opportunistic policy that utilizes the relative channel condition, denoted by , which measures the ratio of instantaneous channel rate of client to its long term average value, ; that is, we have . Hence, instead of scheduling clients based on their instantaneous channel states, we use for scheduling. Further, we consider two metrics to ensure both shortterm and longterm fairness among the clients. We use the age metric for shortterm fairness, that is, to promote uniform participation of the clients, and define a frequency metric for longterm fairness, that is, to promote equal participation of clients. The frequency metric, , denotes the participation frequency of the th client at round . We define , where denotes the total number of rounds that client has participated until round .
In order to introduce longterm fairness, we consider a subset of the idle clients as follows
(5) 
where is a maximum frequency constraint. For the opportunistic policy, we further consider the following subset of clients
(6) 
where is the minimum age constraint introduced to ensure shortterm fairness and is the rate constraint for opportunistic scheduling. If there are multiple clients in , then client with the maximum value, that is the one with the best relative channel condition is scheduled.
Similarly to AMRTP, the proposed opportunistic policy consists of two steps. In the initial step, portion of the clients are scheduled from according to MRTP, while the remaining clients are scheduled from based on . However, if , then client is scheduled according to MRTP from . As shown in (5), clients with excessive participation frequency are excluded from scheduling, which increases the participation of less frequently selected clients, and thus, improves the fairness in scheduling. Overall, the proposed opportunistic scheduling strategy OFMRTP is defined by four system parameters , , , and .
Age threshold  MRTP fraction  frequency constraint  Accuracy (mean std)  Average per round latency 

seconds  
seconds  
seconds  
seconds  
seconds  
seconds  
seconds  
seconds 
MRTP fraction  Accuracy (mean std)  Average latency 

seconds  
seconds 
Iv Numerical Results
Iva Simulation Setup
IvA1 Objective and network setup
We consider image classification on the CIFAR10 dataset, which contains 50,000 training and 10,000 test images from 10 classes. We employ a convolutional neural network (CNN) architecture consisting of 4 convolutional layers followed by 4 fully connected layers. We set
local iterations using batchsize of 32.We consider client devices randomly distributed around the PS, and assume that the training dataset is distributed disjointly among the client devices in a noniid manner, such that each client has 500 distinct training images from at most 4 different classes. Finally, we set the participation ratio to , which means, at each round, devices, out of 100, are scheduled to upload their model to the PS.
IvA2 Computation latency
To model the computation latency at the clients, we consider the commonly employed shifted exponential distribution
[26], where the probability of completing
local updates by time is given by(7) 
where is the minimum computation latency to perform one local update and is the average delay for one local update that is , where is the mean computation time for one local update. In order to obtain practically relevant values for and , we measured the time for computation on a CPU using time.time() command, according to given batchsize and the network model, for over 100000 trials.
IvA3 Path loss and noise
clients are uniformly randomly distributed in the region within 500 meters around the PS. The path loss of client is given by , where is the distance of to the PS measured in kilometers. We set , and
. The channels of clients across different time slots are modeled as i.i.d. fading. The standard deviation of the channel gain of client
is , , and the channel coefficient is modeled as , wheredenotes an i.i.d. random variable accounting for Rayleigh fading of unit power in communication round
. The variances of noise at the PS and the clients are set as Watts.IvB Simulation Results
We first consider the AMRTP scheme. For the experiments, we consider . The average per round latency and test accuracy results are presented in Table II. As predicted, we observe that by decreasing ; that is, scheduling more clients according to the agemetric, both the test accuracy and the average latency increase.
We then consider OFMRTP by setting , , , and . The test accuracy and average per round latency results are presented in Table I. We note that, if a roundrobin scheduler is employed with participation ratio , then the maximum age will be , and similarly, the maximum participation frequency will be . Therefore, to setup the parameter values and , we consider these values as our reference points. One can easily observe from Table I that with OFMRTP the average latency as well as the variation on the latency significantly drops, since client is scheduled only if . Besides, thanks to the control on the participation frequency of the clients, we also observe a significant improvement in the test accuracy. We also want to remark that when , we observe similar test accuracy and latency results for other parameters. The reason is that since MRTP is used for client scheduling when , at the same time larger increases the probability of being empty, thus more clients are scheduled according to MRTP. This observation is also backed by the simulation results with , where the impact of the parameter is more visible. Not surprisingly the minimum latency is achieved when more clients are scheduled according to MRTP and we allow larger participation frequencies. Interestingly, in this case, where and , we do not observe a compromise on the test accuracy.
For comparison, we consider MRTP and random scheduling as benchmark. Note that these are the optimal strategies, respectively, from the latency and fairness perspectives. In Fig. 1, we compare the convergence behaviour of the OFMRTP, AMRTP and MRTP schemes. As one would expect, the test accuracy increases faster with MRTP; however, due to the noniid distribution of the data it converges to a suboptimal model and eventually diverges^{1}^{1}1The test accuracy is plotted until the divergence point.. While AMRTP eventually reaches an accuray level above MRTP (see Table II), it introduces significant latency due to scheduling the clients with poor channel conditions to reduce their age. The convergence behaviour of random scheduling is not included in the figure since the average perround latency is very high, instead, we compare OFMRTP and random scheduling based on the final test accuracy results which is averaged over 10 trails. We observe that the average final test accuracy with random scheduling is . In comparison with the test accuracy results in Table I, we can conclude that OFMRTP does not compromise the accuracy while significantly reducing the latency.
V Conclusion
We proposed novel global model transmission and client scheduling techniques to speed up wallclock training time for FEEL without sacrificing final test accuracy. In particular, we ensured fair participation of the clients to achieve high test accuracy, and reduced the overall latency, which includes the computation time and model uplink/downlink latencies. To this end, we first introduced a fountain coded asynchronous model downlink strategy to allow clients to start local computations without waiting for others to download the global model. We then introduced MRTP, which adaptively schedules the client that can upload its local model to the PS in the fastest manner. MRTP and asynchronous downlink strategy, together, help to overlap computation and communication time, thus reduce the overall latency. However, as we experimentally show, client selection that solely focuses on the latency may lead to divergence when certain clients participate in the model update more frequently than others. Hence, we further employed the ’update age’ and ‘update frequency’ as fairness metrics, which are opportunistically used to speed up training without sacrificing accuracy. Finally, through extensive simulations we show that it is possible to significantly reduce the overall latency without compromising the test accuracy.
References
 [1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communicationefficient learning of deep networks from decentralized data,” in Artificial Intel, and Stat. PMLR, 2017, pp. 1273–1282.
 [2] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1bit stochastic gradient descent and its application to dataparallel distributed training of speech DNNs,” in Conf. of Int’l Speech Comm. Assoc., 2014.
 [3] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communicationefficient SGD via gradient quantization and encoding,” arXiv preprint arXiv:1610.02132, 2016.
 [4] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” arXiv preprint arXiv:1705.07878, 2017.
 [5] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
 [6] N. Strom, “Scalable distributed DNN training using commodity GPU cloud computing,” in Annual Conf. of Int’l Speech Comm. Assoc., 2015.

[7]
N. Dryden, T. Moon, S. Jacobs, and B. Van Essen, “Communication quantization for dataparallel training of deep neural networks,” in
Wrkshp. on Mach. Learn. in HPC Environs. (MLHPC), 2016, pp. 1–8.  [8] A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” arXiv preprint arXiv:1704.05021, 2017.
 [9] H. Wang, S. Sievert, Z. Charles, S. Liu, S. Wright, and D. Papailiopoulos, “Atomo: Communicationefficient learning via atomic sparsification,” arXiv preprint arXiv:1806.04090, 2018.
 [10] D. Gunduz, D. B. Kurka, M. Jankowski, M. M. Amiri, E. Ozfatura, and S. Sreekumar, “Communicate to learn at the edge,” IEEE Commun. Mag., vol. 58, no. 12, pp. 14–19, 2020.
 [11] M. Chen, D. Gunduz, K. Huang, W. Saad, M. Bennis, A. V. Feljan, and H. V. Poor, “Distributed learning in wireless networks: Recent progress and future challenges,” arXiv cs.LG:2104.02151, 2021.
 [12] Q. Zeng, Y. Du, K. Huang, and K. K. Leung, “Energyefficient radio resource allocation for federated edge learning,” in IEEE Int’l Conf. on Comm. Workshops, 2020, pp. 1–6.
 [13] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. ShikhBahaei, “Energy Efficient Federated Learning Over Wireless Communication Networks,” arXiv, pp. 1–30, 2019.
 [14] M. Chen, H. V. Poor, W. Saad, and S. Cui, “Convergence Time Optimization for Federated Learning over Wireless Networks,” IEEE Trans. Wirel. Commun., pp. 1–30, 2020.
 [15] T. Nishio and R. Yonetani, “Client selection for federated learning with heterogeneous resources in mobile edge,” in IEEE Int’l Conf. on Communications (ICC), 2019, pp. 1–7.
 [16] H. Yang, A. Arafa, T. Quek, and H. V. Poor, “AgeBased Scheduling Policy for Federated Learning in Mobile Edge Networks,” IEEE Int. Conf. Acoust. Speech Signal Proc. (ICASSP), pp. 8743–8747, 2020.
 [17] H. H. Yang, Z. Liu, T. Q. Quek, and H. V. Poor, “Scheduling Policies for Federated Learning in Wireless Networks,” IEEE Trans. Commun., vol. 68, no. 1, pp. 317–333, 2020.
 [18] M. M. Amiri, D. Gündüz, S. R. Kulkarni, and H. Vincent Poor, “Convergence of update aware device scheduling for federated learning at the wireless edge,” IEEE Trans. Wireless Comm., pp. 1–1, 2021.
 [19] M. S. H. Abad, E. Ozfatura, D. Gunduz, and O. Ercetin, “Hierarchical federated learning across heterogeneous cellular networks,” in IEEE Int’l Conf. Acous., Speech and Sig. Proc. (ICASSP), 2020, pp. 8866–8870.

[20]
M. M. Amiri and D. Gunduz, “Machine Learning at the Wireless Edge: Distributed Stochastic Gradient Descent OvertheAir,”
IEEE Trans. Signal Proc., vol. 68, no. 4, pp. 2155–2169, 2020.  [21] G. Zhu, Y. Wang, and K. Huang, “Broadband Analog Aggregation for LowLatency Federated Edge Learning,” IEEE Trans. Wirel. Commun., vol. 19, no. 1, pp. 491–506, 2020.
 [22] M. M. Amiri and D. Gündüz, “Federated learning over wireless fading channels,” IEEE Trans. Wireless Comm., vol. 19, no. 5, pp. 3546–3557, 2020.
 [23] Y. Shao, D. Gündüz, and S. C. Liew, “Federated edge learning with misaligned overtheair computation,” arXiv cs.IT:2102.13604, 2021.
 [24] M. M. Amiri, D. Gündüz, S. R. Kulkarni, and H. V. Poor, “Convergence of federated learning over a noisy downlink,” arXiv 2008.11141, 2020.
 [25] J. Castura and Y. Mao, “Rateless coding over fading channels,” IEEE Communications Letters, vol. 10, no. 1, pp. 46–48, 2006.
 [26] N. Ferdinand and S. C. Draper, “Hierarchical coded computation,” in 2018 IEEE Int. Symp. Inf. Theory (ISIT), June 2018, pp. 1620–1624.
Comments
There are no comments yet.