Device Scheduling and Update Aggregation Policies for Asynchronous Federated Learning

07/23/2021
by   Chung-Hsuan Hu, et al.
Linköping University
0

Federated Learning (FL) is a newly emerged decentralized machine learning (ML) framework that combines on-device local training with server-based model synchronization to train a centralized ML model over distributed nodes. In this paper, we propose an asynchronous FL framework with periodic aggregation to eliminate the straggler issue in FL systems. For the proposed model, we investigate several device scheduling and update aggregation policies and compare their performances when the devices have heterogeneous computation capabilities and training data distributions. From the simulation results, we conclude that the scheduling and aggregation design for asynchronous FL can be rather different from the synchronous case. For example, a norm-based significance-aware scheduling policy might not be efficient in an asynchronous FL setting, and an appropriate "age-aware" weighting design for the model aggregation can greatly improve the learning performance of such systems.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/11/2021

Federated Learning with Buffered Asynchronous Aggregation

Federated Learning (FL) trains a shared model across distributed devices...
11/16/2021

HADFL: Heterogeneity-aware Decentralized Federated Learning Framework

Federated learning (FL) supports training models on geographically distr...
10/31/2019

Age-Based Scheduling Policy for Federated Learning in Mobile Edge Networks

Federated learning (FL) is a machine learning model that preserves data ...
06/07/2022

Decentralized Aggregation for Energy-Efficient Federated Learning via Overlapped Clustering and D2D Communications

Federated learning (FL) has emerged as a distributed machine learning (M...
03/02/2021

Adaptive Transmission Scheduling in Wireless Networks for Asynchronous Federated Learning

In this paper, we study asynchronous federated learning (FL) in a wirele...
11/01/2019

Energy-Aware Analog Aggregation for Federated Learning with Redundant Data

Federated learning (FL) enables workers to learn a model collaboratively...
08/05/2021

User Scheduling for Federated Learning Through Over-the-Air Computation

A new machine learning (ML) technique termed as federated learning (FL) ...

I Introduction

Federated learning (FL) is a decentralized machine learning (ML) framework with multiple devices collaboratively participating in a common training process over locally distributed data [1]. In contrast to centralized ML where the entire training data are stored in a central unit, in an FL system, the local training data can be kept private in each device without being uploaded to a cloud/server. Compared to prior distributed optimization frameworks that assume evenly distributed data, FL considers more practical settings where the devices might be massively distributed, with non-IID and unbalanced data.

One common problem in FL systems using the original Federated Averaging (FedAvg) algorithm proposed in [2] and its different variations is the straggler issue. This problem originates from the fact that, due to synchronized training and updating, the time duration of one communication round is strictly limited by the slowest participating device [3]

. In a practical environment with heterogeneous devices in terms of their computation capabilities, the straggler issue has a significant impact on the completion time of an FL process. One possible solution to tackle this problem is to shift from the synchronous setting in FedAvg to asynchronous training and updating, to avoid waiting for straggling devices before update aggregation. Several deep learning algorithms with asynchronous FL

[4, 5]

have been studied in the literature. Moreover, various heuristic aggregation policies have been proposed to deal with the increased variation of local updates caused by the asynchronous structure

[6, 7].

In a wireless FL system, the uploading of local updates takes place over the wireless uplink, which will be the part that is most affected by the scarcity of wireless resources. To reduce the communication load in this procedure, one solution is to allow only a fraction of participating devices to upload their local updates in each communication round. Device scheduling and resource allocation have therefore become an important design aspect for FL over wireless networks [8, 9]. In traditional cellular networks, the purpose of device scheduling is usually associated with maximizing spectral efficiency or network throughput. However, for distributed learning systems such as FL, the system objective is to optimize the parameters in the training model. Device scheduling for FL requires learning-oriented instead of rate-oriented design, which makes this problem fundamentally different from the existing solutions in conventional cellular networks. Intuitively, a device with a higher potential impact on the learning-related system performance should be given higher priority to be scheduled, and possibly a larger amount of communication resources to convey their information. Several existing works consider different metrics to indicate the significance of local updates, such as norm of the model updates [10]

, signal-to-noise-ratio and data uncertainty

[11]

, success probability of update transmission

[12], and Age-of-Update (AoU) [13]. However, all of them consider synchronous FL based on the original form of the FedAvg algorithm. Few existing works have considered device scheduling and resource allocation for asynchronous FL [14], which makes it a highly under-explored area.

The main purpose of this work is to answer the following questions:

  • For asynchronous FL with heterogeneous devices, what is the most appropriate scheduling policy given limited communication resources?

  • Under a certain scheduling policy, how should we design the update aggregation rule?

  • What are the fundamental differences between synchronous and asynchronous FL systems in terms of the joint design of scheduling and aggregation policy?

To answer these questions, we investigate several schemes for device scheduling and model update aggregation in asynchronous FL systems under device heterogeneity in their computation capacity and training data distribution.111Analytical evaluation on convergence of the proposed schemes will be addressed in an extended version of this paper.

The performance of the proposed schemes are evaluated and compared based on a classification problem using the MNIST data set

[15].

Ii System Model

We consider an FL system with devices participating in training a shared global learning model, parameterized by a

-dimensional parameter vector

. Denote as the set of device indices in the system. Each device holds a set of local training data with size . Let represent the entire data set in the system with size , where , . The objective of the system is to find the optimal parameter vector

that minimizes an empirical loss function defined by

(1)

where is the sample-wise loss function computed over the data sample . We define the local loss function at device as

(2)

which is averaged over the local training data set. Then, we can rewrite as

(3)

Ii-a FedAvg with Synchronous Training and Aggregation

Fig. 1: The FL process and information exchange between the server and the participating devices.

In a typical FL system, all the devices participate in the global training process following a synchronized procedure. FedAvg[2] is widely considered as a representative scheme with this synchronous structure of local training and global aggregation. In FedAvg algorithm, the entire training process is divided into many global iterations (communication rounds), where during every global iteration, the server aggregates the received stochastic gradient updates from the participating devices, computed over their locally available data.

We consider a modified version of the FedAvg algorithm, where an extra step of device scheduling is added after local training, as illustrated in Fig. 1. The main motivation behind this consideration is to reduce the communication costs and delay, especially in a wireless network with limited data rates. In the -th global iteration with , the following steps are executed:

  1. The server broadcasts the current global model to the device set .

  2. Each device runs

    times of stochastic gradient descent (SGD) iteration and the update rule follows

    (4)

    with being the local iteration index, , representing the learning rate and being the gradient computed based on a randomly selected mini-batch . After completing the local training, is the local update from device .

  3. Due to limited wireless resources, only a subset of devices is eligible for uploading their local model updates to the server. Similar consideration can be found in [10] and [11]. Note that such update scheduling is not considered in FedAvg, i.e. .

  4. After receiving the local updates from the scheduled devices, the server aggregates the received information and updates the global model as

    (5)

    where the update from each non-scheduled device is , thus is omitted in (5).

Such iterative procedure continues until the system converges.

Ii-B Asynchronous FL with Periodic Aggregation

In an FL system with synchronous training and updating, the update aggregation is feasible only after all the involved devices finish their local training (SGD computation) step. This implies those with inferior computation capability introduce the straggler issue and slow the training process. To address the issue, asynchronous FL has been proposed in [5, 6], which proves the effectiveness of resolving the issue. However, fully asynchronous FL with sequential updating can face the problem of high communication costs caused by frequent model updating and transmission of local updates. To tackle the aforementioned concerns, we propose an asynchronous FL framework with periodic aggregation. The general idea is to allow asynchronous training at different devices, with the server periodically collecting updates from those devices that have completed their computation, while the rest continue their local training without being interrupted or dropped. Fig. 2 shows an example of the training and updating timeline of the original synchronous FL, fully asynchronous FL [5], and our proposed scheme.

Fig. 2: Illustration of the conceptional differences between synchronous FL, fully asynchronous FL in [5], and our proposed asynchronous FL with periodic aggregation. represents the model parameter in the -th global iteration.

We consider the scenario where the participating devices have different computation capabilities. After each device finishes its local training, it sends a signal to the server indicating its readiness for update reporting. After every time duration , the server schedules a subset of ready-to-update devices. The received local updates will be aggregated at the server by applying some weighted averaging rule. The updated global model will be again distributed to all the ready devices, which will then continue their local SGD steps based on the newly received global model. The key design factors in this particular asynchronous FL setup lie in two folds:

  1. At each aggregation time, the set of ready-to-update devices might be different. Given the communication resource constraint, how should we schedule the subset of available devices for update reporting?

  2. The updated local models from different devices might be obtained from different previously received global models, some are more recent and some are more outdated. How should we design an appropriate aggregation and weighting policy taking into account the freshness of model updates in this asynchronous setting?

In the remainder of this paper, we will investigate several scheduling and aggregation policies for FL with asynchronous training and periodic data aggregation. We define as the set of all ready-to-update devices in the -th global iteration. Let be the set of scheduled devices, with , indicating up to devices are scheduled.222We consider a simplified system model with limited communication budget and uniform gradient update precision for the scheduled devices. More realistic channel models and corresponding compression scheme will be further investigated in extended studies. For any device , its local update is computed according to (4), with the initial model parameter vector in the first local iteration being

(6)

where

(7)

specifically indicates the latest global iteration index of which device has received an updated global model. We adopt a regularization technique proposed in [7] to alleviate potential model imbalance caused by asynchronous training. In (4), is computed based on the following regularized local loss

(8)

where is introduced in (1) and is the regularization coefficient. Inspired by the concept of Age of Information (AoI) [16], we define a metric "Age of Local Update" (ALU) as

(9)

which shows the elapsed time since the last reception of an updated global model.333Note that this age-based definition is different from the Age of Update (AoU) proposed in [13], which measures the elapsed time at each device since its last participation in model aggregation. After receiving the scheduled updates, the model aggregation at the server is conducted as

(10)

with . Here, each weight coefficient can be related to the training data size , or the ALU , or the combination of both. Two different weighting designs for non-scheduled devices may be considered, depending on the assumption on the statistical distribution of local updates: for any non-scheduled device :

  1. . This design inherently assumes that all devices in have identical training data distribution such that averaging over is statistically equal to averaging over . Note that this approach has been considered in [10], [17].

  2. . This design considers the non-updated local models from the set of non-scheduled devices in the aggregation step, as in [2].

Iii Scheduling and Aggregation Policies for Asynchronous FL

As mentioned in Section II, in every global iteration, only a subset of all available devices will be scheduled for uploading their model updates. Among the scheduled devices in , their local updates might have different levels of data freshness, since their local gradients are computed based on different previously received global models. Moreover, the aggregation-involving rate of each device might be different, as a result of device scheduling and asynchronous training, which places an unbalanced contribution to the global model. These issues suggest a joint consideration of device scheduling and aggregation policy to leverage data freshness and significance in an asynchronous FL setting.

Iii-a Device Scheduling Policies

We consider three different scheduling policies, namely random, significance-based and frequency-based scheduling, which are described as follows.

Iii-A1 Random Scheduling

We select devices randomly from the set of ready-to-update devices without replacement. This often serves as a baseline policy in the literature of FL.

Iii-A2 Significance-based Scheduling

Under the preference of devices with more influential updates, we first sort the norm of gradient update from all available devices,

(11)

and then select those with largest values. Note that this approach requires the quantity in (11) to be shared as side information to the server.

Iii-A3 Frequency-based Scheduling

To maintain the balance among the devices in terms of their contribution to the global model, this policy assigns higher preference to devices with lower aggregation-involving rate during previous communication rounds. To explain the idea, we define a counting metric

(12)

which characterizes how many times a device has been previously scheduled for uploading its local updates. After sorting of all available devices, those with the smallest values are selected. If multiple devices have the same counting metric value, random selection will be performed accordingly.

Iii-B Update Aggregation Policies

To address the model aggregation with asynchronous updates and various data proportions, we consider two types of aggregation policies, namely equal weight and age-aware aggregation.

Iii-B1 Equal Weight

With this policy, the model updates from all selected devices are aggregated with uniform weighting, which means that the weights are assigned purely based on their data proportion, regardless of their last received global model. Hence, the aggregation weight of device in the -th global iteration is determined by

(13)

where or , depending on the assumption on the data distributions of the non-scheduled devices, as discussed in Section II-B. Note that this is often considered as a baseline aggregation rule for FL.

Iii-B2 Age-aware Aggregation

We consider two options for the age-aware aggregation policy. The first option is to assign a higher weight to the older local updates, i.e., those with larger . This choice might balance the participation rate among different devices and potentially reduce the risk of model training excessively biased to those with stronger computation capacity. However, it also creates the doubt of applying outdated updates on an already evolved global model. On the contrary, the second option is to assign higher weights to those with smaller . By favoring fresher local updates it might help the global model to converge smoothly with time, at the risk of converging to an imbalanced model, especially in the scenario with non-IID data distribution.

The age-aware weighting design is given by

(14)

where or . Here, is a real-valued constant factor, which can be divided into two cases:

  • , the system favors older local updates.

  • , the system favors fresher local updates.

Iv Simulation Results

In this section, we evaluate the performance of the combination of different scheduling and aggregation policies using the MNIST training data set for the hand-written digit classification problem [15]. The data set has samples that are allocated evenly to devices. The general system setting is described as follows.

  • The local training duration of each device in every global iteration, denoted by

    , is generated by a uniform distribution

    , where represents the longest computing time due to straggling issue.

  • For the asynchronous FL scheme, we choose as the aggregation period.

  • In the case with IID data distribution, each device possesses an equal amount of disjoint samples randomly picked from . Under the non-IID setting, the data allocation is determined by using the same method as in [2]. Under this setting, each device contains data samples of at most two different digits.

  • In every global iteration, up to devices are scheduled for uploading their local model updates.

  • The learning rate is and , .

  • The regularization coefficient is .

In the aggregation process, we consider the case with for , meaning that the updated global model is purely computed based on the received updates from the scheduled devices, as explained in Section II-B. The implementation of FedAvg is also modified accordingly with the same weighting design.

Figs. 3 and 4 show the test accuracy comparison between different scheduling and aggregation policies for asynchronous FL under IID and non-IID data distributions, respectively. The performance of FedAvg is also presented as a baseline scheme. Since the aggregation period in the asynchronous FL case is , the model aggregation in asynchronous FL is four times more frequent than in the case with FedAvg. The abbreviations in the legends are summarized in Table I.

From the simulation results, we first observe that the asynchronous FL scheme generally outperforms FedAvg in both IID and non-IID data scenarios. However, this advantage comes at the cost of more frequent local update uploading and aggregation, which leads to higher communication costs. Besides, among the considered scheduling and aggregation policies for asynchronous FL, the combination of random scheduling and age-aware aggregation favoring fresher local updates shows superior performance compared to the others. We have observed that with other data distribution scenarios, especially when slower devices possess some unique training data, favoring older updates sometimes performs better. Particularly, we observe that significance-based scheduling leads to fluctuating test performance due to the increased variation in the model aggregation. This suggests that for asynchronous FL, norm-based significance-aware update scheduling might not be an appropriate option. Another observation is that, among all the scheduling policies, frequency-based scheme has generally the worst performance, which shows that imposing equal participation rate among the agents is not an efficient choice for asynchronous FL with heterogeneous devices.

In summary, we conclude that the joint design of scheduling and aggregation for asynchronous FL requires different considerations than the synchronous FL case. Furthermore, our proposed asynchronous FL scheme with periodic aggregation provides an efficient and flexible structure to resolve the straggler issue in synchronous FL systems.

Scheduling Policy Legend Aggregation Policy Legend
random rdm age-aware, fOld
significance-based sgnfc age-aware, fFresh
frequency-based freq
TABLE I: Legend description in Figs. 3 and 4.
Fig. 3: Test accuracy under IID data distribution.
Fig. 4: Test accuracy under non-IID data distribution.

V Conclusions

In this work, we proposed an FL framework with asynchronous local training and periodic update aggregation. Specifically, we considered an asynchronous FL system over a resource-limited network where only a fraction of devices are allowed to upload their local model updates to the server in every communication round. Several device scheduling and update aggregation polices were investigated and compared through simulations. We observed that random scheduling performs surprisingly better than the alternative options for our proposed asynchronous FL scheme, especially under non-IID data distribution. Due to different levels of data freshness caused by asynchronous training, an appropriate age-aware model aggregation design can also greatly affect the system performance.

References