Extensive research efforts have been devoted to overcome the communication bottleneck in federated learning (FL) . Lossy compression techniques, including quantization[2, 3, 4, 5] and sparsification[6, 7, 8, 9] have been developed to reduce the communication cost. However, these approaches treat the communication channel connecting the clients to the parameter server (PS) as an error-free bit pipe, ignoring the wireless channel characteristics. However, efficient implementation of FL at the wireless edge, called federated edge learning (FEEL), requires jointly optimizing the learning framework with the underlying communication framework taking into account channel characteristics and constraints [10, 11].
As common in wireless networks, different clients may have distinct computational capabilities, channel statistics, power budgets, etc., resulting in the straggler effect in FEEL. By carefully designing the resource allocation policy, the balance between the training latency and energy consumption has been studied in [12, 13, 14]. By scheduling only a limited number of clients in each round in FEEL, the communication requirement can be reduced, and the straggler effect can also be alleviated [15, 16, 17, 18]. In , the authors propose to maximize the number of scheduled clients in each round to speed up training. In , a function of the ages of client updates is minimized to schedule all the clients as often as possible. Convergence rates of three different scheduling policies, namely, random, round-robin, proportional fair scheduling, are analyzed in . Different from the aforementioned works,  considers ‘update-aware’ client scheduling in FEEL, which takes the significance of model updates into account together with their channel states. Alternatively, in  clients are clustered around several access points to perform FL hierarchically in order to reduce the communication latency. Another approach to addressing the communication bottleneck is aggregating the local model updates using the superposition property of the wireless multiple access channel [20, 21, 22]. However, this approach requires accurate synchronization among the transmitting agents .
Unlike the previous literature on FEEL, we consider both the downlink and uplink channel variations, together with random computation delays. With the exception of , works on FEEL ignore the downlink communication channel and the associated latency. Our first contribution is to overlap the downlink and uplink transmissions with local computations at the clients. This is achieved by fountain-coded delivery of the global model update , such that clients with better downlink channel conditions can receive the global model quickly, and start computing right away. Then, the clients are scheduled for the uplink delivery of their model updates to the PS as soon as they complete their local computations. To further minimize the latency, we schedule the client with the minimial uplink latency at any point in time.
While this approach, called MRTP, minimizes the latency, it may result in a loss in test accuracy as clients close to the PS would dominate the training process thanks to their statistically better channel conditions. We mitigate this bias by introducing both short-term and long-term fairness conditions. In particular, we utilize the age and frequency metrics, respectively, for short- and long-term fairness, where age, in a broad sense, measures the staleness of the client model update, and frequency measures the long term participation statistics of the clients. Our numerical results show that the proposed overlapped and fair scheduling policy significantly speeds up FEEL without sacrificing the final test accuracy.
Ii System Model
Ii-a FL model
clients collaboratively training a model parameter vector of dimension,
, with periodic communication with a PS to minimize the empirical loss function,, where is the loss function of client . Let denote the dataset at client with samples. The empirical loss function at client is , where is the task-dependent loss function measured with the model parameter vector and data . At each communication round , each participant client performs
-step stochastic gradient descent (SGD) on its local dataset to minimize the loss function based on the received global model parameter. At the -th step of the local SGD of participant client in communication round , the local model is updated as
where is the learning rate, and is the gradient computed with the local model parameter and mini-batch data , which is uniformly randomly selected from and follows .
Each participant client then forwards its updated local model to the PS. Denoting the model vector at client updated after -steps by , , the global model is updated by the PS as .
Ii-B Communication Model
A block fading channel model is assumed, where the channels between the PS and the clients remain unchanged in each communication round of the FEEL process.
Ii-B1 Asynchronous global model transmission
Note that we have a multicast channel when transmitting the global model from the PS to the clients. To speed up the training process, we use fountain coded multicasting of the global model to the clients . The downlink rate at client in communication round is given by , , where is the complex-valued downlink channel coefficient between the PS and client in round , is the transmit power at the PS, and
is the noise variance.
With asynchronous global model transmission in the downlink, a client recovers the global model with a latency dependent on its channel gain, and immediately starts local computations. Assuming that the global model is compressed into bits, it will take seconds for client to receive the model update. This will allow us to parallelize global model transmission and computations, instead of targeting the worst client to guarantee all the clients receive the global model simultaneously.
In the rest of the paper, to simplify notation we will drop the round index when it is clear from the context.
Ii-B2 Local model update
In the uplink, where clients upload their model parameters to the PS, we assume a time-division framework; that is, only a single client is scheduled at any point in time. The instantaneous rate of each client is given by where is the transmit power at client , and is the uplink channel coefficient from client to the PS.
Ii-C Local computations
We assume that the computation speeds at the clients are also random following a distribution that is independent across clients and rounds. In each round, we denote by the client with the -th smallest accumulated latency in downlink transmission and local computation, , and by the corresponding latency since the beginning of the communication round, with .
Ii-D Client scheduling
With asynchronous downlink transmission and heterogeneous local computation speeds, clients complete their local model updates in a sequential manner. The clients that have completed local computations and are thus available to upload their local models to the PS are referred to as idle clients. We denote the set of idle clients at time within each round by . Hence, client is added to the idle set at time , and remains there until it uploads its model to the PS.
Let denote the index of the client scheduled for transmission at time , where means no client is scheduled. We have if ; that is, a client cannot be scheduled for upload before it completes its local computations. Let denote the remaining size of model parameter vector (measured in bits) at time that has not yet been uploaded to the PS. We have , and
where is the indicator function, which is 1 when holds, and otherwise.
Let denote the time client completes uploading its model to the PS, i.e., . Client is removed from the idle set at time . Each round continues until out of clients upload their models to the PS, and therefore, some clients may never be added to the idle set, or may not leave the set at the end of the round. We have for those clients. Let denote the set of clients that have completed their upload by time within round , i.e., . For a specific scheduling policy, the completion time of round is given by , while the set of clients scheduled in round are given, with slight abuse of notation, by .
Iii Scheduling Policies
Our goal is to come up with scheduling policies that would not only minimize the completion time of each round, but also lead to the fastest convergence in terms of the wall-clock time.
Iii-a Minimum remaining time-based policy (MRTP)
Note that, at any time instant within a round, we can schedule any client from the idle set. However, it is easy to see that there is no loss of optimality making scheduling decisions only when the idle set is updated, i.e., when a new client becomes idle, or one of the idle clients completes uploading its model.
In MRTP, each time the idle set is updated, we schedule the client with the minimum remaining time to upload its update. Specifically, at time , client is scheduled if
Details of the MRTP can be found in Algorithm 1.
Iii-B Age-aware MRTP (A-MRTP)
While MRTP minimizes the upload time, clients with statistically better channel qualities are more likely to be scheduled, which results in non-uniform sampling of clients and over-fitting due to excessive use of limited amount of data. This may drastically degrade the performance of FEEL, especially when the data is not i.i.d., which is common in practice. Therefore, to strike a balance between the latency and model accuracy, we propose two alternative schemes by taking the ‘age of update’ into consideration.
For the short term fairness, we utilize the age metric, where the age of a client at round , denoted by , represents the number of rounds since the last time it was scheduled. The age parameter evolves as follows:
In A-MRTP, clients are firstly selected to upload their models as in MRTP to minimize the latency, where is a tuning parameter. Then, to promote selecting clients that are less frequently scheduled, we select the client with the minimum ratio between the remaining time and the age of its update, which can be considered as the average latency for each timely update in the short term. Therefore, a balance between the efficiency in training and fairness is achieved by tuning .
Iii-C Opportunistic Fair MRTP (OF-MRTP)
The main drawback of A-MRTP is that it utilizes only instantaneous rate for client scheduling. However, when the clients are located at different distances from the PS, their participation frequency will still depend on their locations. We propose an opportunistic policy that utilizes the relative channel condition, denoted by , which measures the ratio of instantaneous channel rate of client to its long term average value, ; that is, we have . Hence, instead of scheduling clients based on their instantaneous channel states, we use for scheduling. Further, we consider two metrics to ensure both short-term and long-term fairness among the clients. We use the age metric for short-term fairness, that is, to promote uniform participation of the clients, and define a frequency metric for long-term fairness, that is, to promote equal participation of clients. The frequency metric, , denotes the participation frequency of the th client at round . We define , where denotes the total number of rounds that client has participated until round .
In order to introduce long-term fairness, we consider a subset of the idle clients as follows
where is a maximum frequency constraint. For the opportunistic policy, we further consider the following subset of clients
where is the minimum age constraint introduced to ensure short-term fairness and is the rate constraint for opportunistic scheduling. If there are multiple clients in , then client with the maximum value, that is the one with the best relative channel condition is scheduled.
Similarly to A-MRTP, the proposed opportunistic policy consists of two steps. In the initial step, portion of the clients are scheduled from according to MRTP, while the remaining clients are scheduled from based on . However, if , then client is scheduled according to MRTP from . As shown in (5), clients with excessive participation frequency are excluded from scheduling, which increases the participation of less frequently selected clients, and thus, improves the fairness in scheduling. Overall, the proposed opportunistic scheduling strategy OF-MRTP is defined by four system parameters , , , and .
|Age threshold||MRTP fraction||frequency constraint||Accuracy (mean std)||Average per round latency|
|MRTP fraction||Accuracy (mean std)||Average latency|
Iv Numerical Results
Iv-a Simulation Setup
Iv-A1 Objective and network setup
We consider image classification on the CIFAR-10 dataset, which contains 50,000 training and 10,000 test images from 10 classes. We employ a convolutional neural network (CNN) architecture consisting of 4 convolutional layers followed by 4 fully connected layers. We setlocal iterations using batchsize of 32.
We consider client devices randomly distributed around the PS, and assume that the training dataset is distributed disjointly among the client devices in a non-iid manner, such that each client has 500 distinct training images from at most 4 different classes. Finally, we set the participation ratio to , which means, at each round, devices, out of 100, are scheduled to upload their model to the PS.
Iv-A2 Computation latency
To model the computation latency at the clients, we consider the commonly employed shifted exponential distribution
, where the probability of completinglocal updates by time is given by
where is the minimum computation latency to perform one local update and is the average delay for one local update that is , where is the mean computation time for one local update. In order to obtain practically relevant values for and , we measured the time for computation on a CPU using time.time() command, according to given batchsize and the network model, for over 100000 trials.
Iv-A3 Path loss and noise
clients are uniformly randomly distributed in the region within 500 meters around the PS. The path loss of client is given by , where is the distance of to the PS measured in kilometers. We set , and
. The channels of clients across different time slots are modeled as i.i.d. fading. The standard deviation of the channel gain of clientis , , and the channel coefficient is modeled as , where
denotes an i.i.d. random variable accounting for Rayleigh fading of unit power in communication round. The variances of noise at the PS and the clients are set as Watts.
Iv-B Simulation Results
We first consider the A-MRTP scheme. For the experiments, we consider . The average per round latency and test accuracy results are presented in Table II. As predicted, we observe that by decreasing ; that is, scheduling more clients according to the age-metric, both the test accuracy and the average latency increase.
We then consider OF-MRTP by setting , , , and . The test accuracy and average per round latency results are presented in Table I. We note that, if a round-robin scheduler is employed with participation ratio , then the maximum age will be , and similarly, the maximum participation frequency will be . Therefore, to setup the parameter values and , we consider these values as our reference points. One can easily observe from Table I that with OF-MRTP the average latency as well as the variation on the latency significantly drops, since client is scheduled only if . Besides, thanks to the control on the participation frequency of the clients, we also observe a significant improvement in the test accuracy. We also want to remark that when , we observe similar test accuracy and latency results for other parameters. The reason is that since MRTP is used for client scheduling when , at the same time larger increases the probability of being empty, thus more clients are scheduled according to MRTP. This observation is also backed by the simulation results with , where the impact of the parameter is more visible. Not surprisingly the minimum latency is achieved when more clients are scheduled according to MRTP and we allow larger participation frequencies. Interestingly, in this case, where and , we do not observe a compromise on the test accuracy.
For comparison, we consider MRTP and random scheduling as benchmark. Note that these are the optimal strategies, respectively, from the latency and fairness perspectives. In Fig. 1, we compare the convergence behaviour of the OF-MRTP, A-MRTP and MRTP schemes. As one would expect, the test accuracy increases faster with MRTP; however, due to the non-iid distribution of the data it converges to a sub-optimal model and eventually diverges111The test accuracy is plotted until the divergence point.. While A-MRTP eventually reaches an accuray level above MRTP (see Table II), it introduces significant latency due to scheduling the clients with poor channel conditions to reduce their age. The convergence behaviour of random scheduling is not included in the figure since the average per-round latency is very high, instead, we compare OF-MRTP and random scheduling based on the final test accuracy results which is averaged over 10 trails. We observe that the average final test accuracy with random scheduling is . In comparison with the test accuracy results in Table I, we can conclude that OF-MRTP does not compromise the accuracy while significantly reducing the latency.
We proposed novel global model transmission and client scheduling techniques to speed up wall-clock training time for FEEL without sacrificing final test accuracy. In particular, we ensured fair participation of the clients to achieve high test accuracy, and reduced the overall latency, which includes the computation time and model uplink/downlink latencies. To this end, we first introduced a fountain coded asynchronous model downlink strategy to allow clients to start local computations without waiting for others to download the global model. We then introduced MRTP, which adaptively schedules the client that can upload its local model to the PS in the fastest manner. MRTP and asynchronous downlink strategy, together, help to overlap computation and communication time, thus reduce the overall latency. However, as we experimentally show, client selection that solely focuses on the latency may lead to divergence when certain clients participate in the model update more frequently than others. Hence, we further employed the ’update age’ and ‘update frequency’ as fairness metrics, which are opportunistically used to speed up training without sacrificing accuracy. Finally, through extensive simulations we show that it is possible to significantly reduce the overall latency without compromising the test accuracy.
-  B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intel, and Stat. PMLR, 2017, pp. 1273–1282.
-  F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs,” in Conf. of Int’l Speech Comm. Assoc., 2014.
-  D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” arXiv preprint arXiv:1610.02132, 2016.
-  W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” arXiv preprint arXiv:1705.07878, 2017.
-  S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
-  N. Strom, “Scalable distributed DNN training using commodity GPU cloud computing,” in Annual Conf. of Int’l Speech Comm. Assoc., 2015.
N. Dryden, T. Moon, S. Jacobs, and B. Van Essen, “Communication quantization for data-parallel training of deep neural networks,” inWrkshp. on Mach. Learn. in HPC Environs. (MLHPC), 2016, pp. 1–8.
-  A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” arXiv preprint arXiv:1704.05021, 2017.
-  H. Wang, S. Sievert, Z. Charles, S. Liu, S. Wright, and D. Papailiopoulos, “Atomo: Communication-efficient learning via atomic sparsification,” arXiv preprint arXiv:1806.04090, 2018.
-  D. Gunduz, D. B. Kurka, M. Jankowski, M. M. Amiri, E. Ozfatura, and S. Sreekumar, “Communicate to learn at the edge,” IEEE Commun. Mag., vol. 58, no. 12, pp. 14–19, 2020.
-  M. Chen, D. Gunduz, K. Huang, W. Saad, M. Bennis, A. V. Feljan, and H. V. Poor, “Distributed learning in wireless networks: Recent progress and future challenges,” arXiv cs.LG:2104.02151, 2021.
-  Q. Zeng, Y. Du, K. Huang, and K. K. Leung, “Energy-efficient radio resource allocation for federated edge learning,” in IEEE Int’l Conf. on Comm. Workshops, 2020, pp. 1–6.
-  Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy Efficient Federated Learning Over Wireless Communication Networks,” arXiv, pp. 1–30, 2019.
-  M. Chen, H. V. Poor, W. Saad, and S. Cui, “Convergence Time Optimization for Federated Learning over Wireless Networks,” IEEE Trans. Wirel. Commun., pp. 1–30, 2020.
-  T. Nishio and R. Yonetani, “Client selection for federated learning with heterogeneous resources in mobile edge,” in IEEE Int’l Conf. on Communications (ICC), 2019, pp. 1–7.
-  H. Yang, A. Arafa, T. Quek, and H. V. Poor, “Age-Based Scheduling Policy for Federated Learning in Mobile Edge Networks,” IEEE Int. Conf. Acoust. Speech Signal Proc. (ICASSP), pp. 8743–8747, 2020.
-  H. H. Yang, Z. Liu, T. Q. Quek, and H. V. Poor, “Scheduling Policies for Federated Learning in Wireless Networks,” IEEE Trans. Commun., vol. 68, no. 1, pp. 317–333, 2020.
-  M. M. Amiri, D. Gündüz, S. R. Kulkarni, and H. Vincent Poor, “Convergence of update aware device scheduling for federated learning at the wireless edge,” IEEE Trans. Wireless Comm., pp. 1–1, 2021.
-  M. S. H. Abad, E. Ozfatura, D. Gunduz, and O. Ercetin, “Hierarchical federated learning across heterogeneous cellular networks,” in IEEE Int’l Conf. Acous., Speech and Sig. Proc. (ICASSP), 2020, pp. 8866–8870.
M. M. Amiri and D. Gunduz, “Machine Learning at the Wireless Edge: Distributed Stochastic Gradient Descent Over-the-Air,”IEEE Trans. Signal Proc., vol. 68, no. 4, pp. 2155–2169, 2020.
-  G. Zhu, Y. Wang, and K. Huang, “Broadband Analog Aggregation for Low-Latency Federated Edge Learning,” IEEE Trans. Wirel. Commun., vol. 19, no. 1, pp. 491–506, 2020.
-  M. M. Amiri and D. Gündüz, “Federated learning over wireless fading channels,” IEEE Trans. Wireless Comm., vol. 19, no. 5, pp. 3546–3557, 2020.
-  Y. Shao, D. Gündüz, and S. C. Liew, “Federated edge learning with misaligned over-the-air computation,” arXiv cs.IT:2102.13604, 2021.
-  M. M. Amiri, D. Gündüz, S. R. Kulkarni, and H. V. Poor, “Convergence of federated learning over a noisy downlink,” arXiv 2008.11141, 2020.
-  J. Castura and Y. Mao, “Rateless coding over fading channels,” IEEE Communications Letters, vol. 10, no. 1, pp. 46–48, 2006.
-  N. Ferdinand and S. C. Draper, “Hierarchical coded computation,” in 2018 IEEE Int. Symp. Inf. Theory (ISIT), June 2018, pp. 1620–1624.