I Introduction
Federated learning (FL) is a decentralized machine learning (ML) framework with multiple devices collaboratively participating in a common training process over locally distributed data [1]. In contrast to centralized ML where the entire training data are stored in a central unit, in an FL system, the local training data can be kept private in each device without being uploaded to a cloud/server. Compared to prior distributed optimization frameworks that assume evenly distributed data, FL considers more practical settings where the devices might be massively distributed, with nonIID and unbalanced data.
One common problem in FL systems using the original Federated Averaging (FedAvg) algorithm proposed in [2] and its different variations is the straggler issue. This problem originates from the fact that, due to synchronized training and updating, the time duration of one communication round is strictly limited by the slowest participating device [3]
. In a practical environment with heterogeneous devices in terms of their computation capabilities, the straggler issue has a significant impact on the completion time of an FL process. One possible solution to tackle this problem is to shift from the synchronous setting in FedAvg to asynchronous training and updating, to avoid waiting for straggling devices before update aggregation. Several deep learning algorithms with asynchronous FL
[4, 5]have been studied in the literature. Moreover, various heuristic aggregation policies have been proposed to deal with the increased variation of local updates caused by the asynchronous structure
[6, 7].In a wireless FL system, the uploading of local updates takes place over the wireless uplink, which will be the part that is most affected by the scarcity of wireless resources. To reduce the communication load in this procedure, one solution is to allow only a fraction of participating devices to upload their local updates in each communication round. Device scheduling and resource allocation have therefore become an important design aspect for FL over wireless networks [8, 9]. In traditional cellular networks, the purpose of device scheduling is usually associated with maximizing spectral efficiency or network throughput. However, for distributed learning systems such as FL, the system objective is to optimize the parameters in the training model. Device scheduling for FL requires learningoriented instead of rateoriented design, which makes this problem fundamentally different from the existing solutions in conventional cellular networks. Intuitively, a device with a higher potential impact on the learningrelated system performance should be given higher priority to be scheduled, and possibly a larger amount of communication resources to convey their information. Several existing works consider different metrics to indicate the significance of local updates, such as norm of the model updates [10]
, signaltonoiseratio and data uncertainty
[11], success probability of update transmission
[12], and AgeofUpdate (AoU) [13]. However, all of them consider synchronous FL based on the original form of the FedAvg algorithm. Few existing works have considered device scheduling and resource allocation for asynchronous FL [14], which makes it a highly underexplored area.The main purpose of this work is to answer the following questions:

For asynchronous FL with heterogeneous devices, what is the most appropriate scheduling policy given limited communication resources?

Under a certain scheduling policy, how should we design the update aggregation rule?

What are the fundamental differences between synchronous and asynchronous FL systems in terms of the joint design of scheduling and aggregation policy?
To answer these questions, we investigate several schemes for device scheduling and model update aggregation in asynchronous FL systems under device heterogeneity in their computation capacity and training data distribution.^{1}^{1}1Analytical evaluation on convergence of the proposed schemes will be addressed in an extended version of this paper.
The performance of the proposed schemes are evaluated and compared based on a classification problem using the MNIST data set
[15].Ii System Model
We consider an FL system with devices participating in training a shared global learning model, parameterized by a
dimensional parameter vector
. Denote as the set of device indices in the system. Each device holds a set of local training data with size . Let represent the entire data set in the system with size , where , . The objective of the system is to find the optimal parameter vectorthat minimizes an empirical loss function defined by
(1) 
where is the samplewise loss function computed over the data sample . We define the local loss function at device as
(2) 
which is averaged over the local training data set. Then, we can rewrite as
(3) 
Iia FedAvg with Synchronous Training and Aggregation
In a typical FL system, all the devices participate in the global training process following a synchronized procedure. FedAvg[2] is widely considered as a representative scheme with this synchronous structure of local training and global aggregation. In FedAvg algorithm, the entire training process is divided into many global iterations (communication rounds), where during every global iteration, the server aggregates the received stochastic gradient updates from the participating devices, computed over their locally available data.
We consider a modified version of the FedAvg algorithm, where an extra step of device scheduling is added after local training, as illustrated in Fig. 1. The main motivation behind this consideration is to reduce the communication costs and delay, especially in a wireless network with limited data rates. In the th global iteration with , the following steps are executed:

The server broadcasts the current global model to the device set .

Each device runs
times of stochastic gradient descent (SGD) iteration and the update rule follows
(4) with being the local iteration index, , representing the learning rate and being the gradient computed based on a randomly selected minibatch . After completing the local training, is the local update from device .

After receiving the local updates from the scheduled devices, the server aggregates the received information and updates the global model as
(5) where the update from each nonscheduled device is , thus is omitted in (5).
Such iterative procedure continues until the system converges.
IiB Asynchronous FL with Periodic Aggregation
In an FL system with synchronous training and updating, the update aggregation is feasible only after all the involved devices finish their local training (SGD computation) step. This implies those with inferior computation capability introduce the straggler issue and slow the training process. To address the issue, asynchronous FL has been proposed in [5, 6], which proves the effectiveness of resolving the issue. However, fully asynchronous FL with sequential updating can face the problem of high communication costs caused by frequent model updating and transmission of local updates. To tackle the aforementioned concerns, we propose an asynchronous FL framework with periodic aggregation. The general idea is to allow asynchronous training at different devices, with the server periodically collecting updates from those devices that have completed their computation, while the rest continue their local training without being interrupted or dropped. Fig. 2 shows an example of the training and updating timeline of the original synchronous FL, fully asynchronous FL [5], and our proposed scheme.
We consider the scenario where the participating devices have different computation capabilities. After each device finishes its local training, it sends a signal to the server indicating its readiness for update reporting. After every time duration , the server schedules a subset of readytoupdate devices. The received local updates will be aggregated at the server by applying some weighted averaging rule. The updated global model will be again distributed to all the ready devices, which will then continue their local SGD steps based on the newly received global model. The key design factors in this particular asynchronous FL setup lie in two folds:

At each aggregation time, the set of readytoupdate devices might be different. Given the communication resource constraint, how should we schedule the subset of available devices for update reporting?

The updated local models from different devices might be obtained from different previously received global models, some are more recent and some are more outdated. How should we design an appropriate aggregation and weighting policy taking into account the freshness of model updates in this asynchronous setting?
In the remainder of this paper, we will investigate several scheduling and aggregation policies for FL with asynchronous training and periodic data aggregation. We define as the set of all readytoupdate devices in the th global iteration. Let be the set of scheduled devices, with , indicating up to devices are scheduled.^{2}^{2}2We consider a simplified system model with limited communication budget and uniform gradient update precision for the scheduled devices. More realistic channel models and corresponding compression scheme will be further investigated in extended studies. For any device , its local update is computed according to (4), with the initial model parameter vector in the first local iteration being
(6) 
where
(7) 
specifically indicates the latest global iteration index of which device has received an updated global model. We adopt a regularization technique proposed in [7] to alleviate potential model imbalance caused by asynchronous training. In (4), is computed based on the following regularized local loss
(8) 
where is introduced in (1) and is the regularization coefficient. Inspired by the concept of Age of Information (AoI) [16], we define a metric "Age of Local Update" (ALU) as
(9) 
which shows the elapsed time since the last reception of an updated global model.^{3}^{3}3Note that this agebased definition is different from the Age of Update (AoU) proposed in [13], which measures the elapsed time at each device since its last participation in model aggregation. After receiving the scheduled updates, the model aggregation at the server is conducted as
(10) 
with . Here, each weight coefficient can be related to the training data size , or the ALU , or the combination of both. Two different weighting designs for nonscheduled devices may be considered, depending on the assumption on the statistical distribution of local updates: for any nonscheduled device :

. This design considers the nonupdated local models from the set of nonscheduled devices in the aggregation step, as in [2].
Iii Scheduling and Aggregation Policies for Asynchronous FL
As mentioned in Section II, in every global iteration, only a subset of all available devices will be scheduled for uploading their model updates. Among the scheduled devices in , their local updates might have different levels of data freshness, since their local gradients are computed based on different previously received global models. Moreover, the aggregationinvolving rate of each device might be different, as a result of device scheduling and asynchronous training, which places an unbalanced contribution to the global model. These issues suggest a joint consideration of device scheduling and aggregation policy to leverage data freshness and significance in an asynchronous FL setting.
Iiia Device Scheduling Policies
We consider three different scheduling policies, namely random, significancebased and frequencybased scheduling, which are described as follows.
IiiA1 Random Scheduling
We select devices randomly from the set of readytoupdate devices without replacement. This often serves as a baseline policy in the literature of FL.
IiiA2 Significancebased Scheduling
Under the preference of devices with more influential updates, we first sort the norm of gradient update from all available devices,
(11) 
and then select those with largest values. Note that this approach requires the quantity in (11) to be shared as side information to the server.
IiiA3 Frequencybased Scheduling
To maintain the balance among the devices in terms of their contribution to the global model, this policy assigns higher preference to devices with lower aggregationinvolving rate during previous communication rounds. To explain the idea, we define a counting metric
(12) 
which characterizes how many times a device has been previously scheduled for uploading its local updates. After sorting of all available devices, those with the smallest values are selected. If multiple devices have the same counting metric value, random selection will be performed accordingly.
IiiB Update Aggregation Policies
To address the model aggregation with asynchronous updates and various data proportions, we consider two types of aggregation policies, namely equal weight and ageaware aggregation.
IiiB1 Equal Weight
With this policy, the model updates from all selected devices are aggregated with uniform weighting, which means that the weights are assigned purely based on their data proportion, regardless of their last received global model. Hence, the aggregation weight of device in the th global iteration is determined by
(13) 
where or , depending on the assumption on the data distributions of the nonscheduled devices, as discussed in Section IIB. Note that this is often considered as a baseline aggregation rule for FL.
IiiB2 Ageaware Aggregation
We consider two options for the ageaware aggregation policy. The first option is to assign a higher weight to the older local updates, i.e., those with larger . This choice might balance the participation rate among different devices and potentially reduce the risk of model training excessively biased to those with stronger computation capacity. However, it also creates the doubt of applying outdated updates on an already evolved global model. On the contrary, the second option is to assign higher weights to those with smaller . By favoring fresher local updates it might help the global model to converge smoothly with time, at the risk of converging to an imbalanced model, especially in the scenario with nonIID data distribution.
The ageaware weighting design is given by
(14) 
where or . Here, is a realvalued constant factor, which can be divided into two cases:

, the system favors older local updates.

, the system favors fresher local updates.
Iv Simulation Results
In this section, we evaluate the performance of the combination of different scheduling and aggregation policies using the MNIST training data set for the handwritten digit classification problem [15]. The data set has samples that are allocated evenly to devices. The general system setting is described as follows.

The local training duration of each device in every global iteration, denoted by
, is generated by a uniform distribution
, where represents the longest computing time due to straggling issue. 
For the asynchronous FL scheme, we choose as the aggregation period.

In the case with IID data distribution, each device possesses an equal amount of disjoint samples randomly picked from . Under the nonIID setting, the data allocation is determined by using the same method as in [2]. Under this setting, each device contains data samples of at most two different digits.

In every global iteration, up to devices are scheduled for uploading their local model updates.

The learning rate is and , .

The regularization coefficient is .
In the aggregation process, we consider the case with for , meaning that the updated global model is purely computed based on the received updates from the scheduled devices, as explained in Section IIB. The implementation of FedAvg is also modified accordingly with the same weighting design.
Figs. 3 and 4 show the test accuracy comparison between different scheduling and aggregation policies for asynchronous FL under IID and nonIID data distributions, respectively. The performance of FedAvg is also presented as a baseline scheme. Since the aggregation period in the asynchronous FL case is , the model aggregation in asynchronous FL is four times more frequent than in the case with FedAvg. The abbreviations in the legends are summarized in Table I.
From the simulation results, we first observe that the asynchronous FL scheme generally outperforms FedAvg in both IID and nonIID data scenarios. However, this advantage comes at the cost of more frequent local update uploading and aggregation, which leads to higher communication costs. Besides, among the considered scheduling and aggregation policies for asynchronous FL, the combination of random scheduling and ageaware aggregation favoring fresher local updates shows superior performance compared to the others. We have observed that with other data distribution scenarios, especially when slower devices possess some unique training data, favoring older updates sometimes performs better. Particularly, we observe that significancebased scheduling leads to fluctuating test performance due to the increased variation in the model aggregation. This suggests that for asynchronous FL, normbased significanceaware update scheduling might not be an appropriate option. Another observation is that, among all the scheduling policies, frequencybased scheme has generally the worst performance, which shows that imposing equal participation rate among the agents is not an efficient choice for asynchronous FL with heterogeneous devices.
In summary, we conclude that the joint design of scheduling and aggregation for asynchronous FL requires different considerations than the synchronous FL case. Furthermore, our proposed asynchronous FL scheme with periodic aggregation provides an efficient and flexible structure to resolve the straggler issue in synchronous FL systems.
V Conclusions
In this work, we proposed an FL framework with asynchronous local training and periodic update aggregation. Specifically, we considered an asynchronous FL system over a resourcelimited network where only a fraction of devices are allowed to upload their local model updates to the server in every communication round. Several device scheduling and update aggregation polices were investigated and compared through simulations. We observed that random scheduling performs surprisingly better than the alternative options for our proposed asynchronous FL scheme, especially under nonIID data distribution. Due to different levels of data freshness caused by asynchronous training, an appropriate ageaware model aggregation design can also greatly affect the system performance.
References
 [1] J. Konečnỳ, H. B. McMahan, D. Ramage, and P. Richtárik, “Federated optimization: Distributed machine learning for ondevice intelligence,” arXiv preprint arXiv:1610.02527, 2016.
 [2] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communicationefficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
 [3] J. Chen, R. Monga, S. Bengio, and R. Jozefowicz, “Revisiting distributed synchronous SGD,” in International Conference on Learning Representations Workshop Track, 2016.
 [4] W. Zhang, S. Gupta, X. Lian, and J. Liu, “Stalenessaware asyncsgd for distributed deep learning,” in Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, 2016.
 [5] C. Xie, S. Koyejo, and I. Gupta, “Asynchronous federated optimization,” in NeurIPS workshop on Optimization for Machine Learning, 2020.
 [6] Y. Chen, Y. Ning, M. Slawski, and H. Rangwala, “Asynchronous online federated learning for edge devices with noniid data,” in 2020 IEEE International Conference on Big Data (Big Data), 2020, pp. 15–24.
 [7] Z. Chai, Y. Chen, L. Zhao, Y. Cheng, and H. Rangwala, “FedAT: A communicationefficient federated learning method with asynchronous tiers under noniid data,” arXiv preprint arXiv:2010.05958, 2020.
 [8] T. Gafni, N. Shlezinger, K. Cohen, Y. C. Eldar, and H. V. Poor, “Federated learning: A signal processing perspective,” arXiv preprint arXiv:2103.17150, 2021.
 [9] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor, “Scheduling policies for federated learning in wireless networks,” IEEE Trans. on Commun., vol. 68, no. 1, pp. 317–333, 2020.
 [10] M. M. Amiri, D. Gündüz, S. R. Kulkarni, and H. Vincent Poor, “Convergence of update aware device scheduling for federated learning at the wireless edge,” IEEE Trans. on Wireless Commun., pp. 1–1, 2021.
 [11] D. Liu, G. Zhu, J. Zhang, and K. Huang, “Dataimportance aware user scheduling for communicationefficient edge machine learning,” IEEE Trans. on Cognitive Communications and Networking, pp. 1–1, 2020.
 [12] M. Salehi and E. Hossain, “Federated learning in unreliable and resourceconstrained cellular wireless networks,” arXiv preprint arXiv:2012.05137, 2020.
 [13] H. H. Yang, A. Arafa, T. Q. Quek, and H. V. Poor, “Agebased scheduling policy for federated learning in mobile edge networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 8743–8747.
 [14] H.S. Lee and J.W. Lee, “Adaptive transmission scheduling in wireless networks for asynchronous federated learning,” arXiv preprint arXiv:2103.01422, 2021.
 [15] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010. [Online]. Available: http://yann.lecun.com/exdb/mnist/
 [16] A. Kosta, N. Pappas, and V. Angelakis, “Age of information: A new concept, metric, and tool,” Foundations and Trends in Networking, vol. 12, no. 3, pp. 162–259, 2017.
 [17] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of fedavg on noniid data,” in International Conference on Learning Representations, 2020.