Aggregation Delayed Federated Learning

by   Ye Xue, et al.
Northwestern University

Federated learning is a distributed machine learning paradigm where multiple data owners (clients) collaboratively train one machine learning model while keeping data on their own devices. The heterogeneity of client datasets is one of the most important challenges of federated learning algorithms. Studies have found performance reduction with standard federated algorithms, such as FedAvg, on non-IID data. Many existing works on handling non-IID data adopt the same aggregation framework as FedAvg and focus on improving model updates either on the server side or on clients. In this work, we tackle this challenge in a different view by introducing redistribution rounds that delay the aggregation. We perform experiments on multiple tasks and show that the proposed framework significantly improves the performance on non-IID data.



There are no comments yet.


page 1

page 2

page 3

page 4


Adaptive Federated Dropout: Improving Communication Efficiency and Generalization for Federated Learning

With more regulations tackling users' privacy-sensitive data protection ...

RoFL: Attestable Robustness for Secure Federated Learning

Federated Learning is an emerging decentralized machine learning paradig...

Aggregation Service for Federated Learning: An Efficient, Secure, and More Resilient Realization

Federated learning has recently emerged as a paradigm promising the bene...

Multi-Model Federated Learning

Federated learning is a form of distributed learning with the key challe...

Towards Heterogeneous Clients with Elastic Federated Learning

Federated learning involves training machine learning models over device...

A Federated Learning Aggregation Algorithm for Pervasive Computing: Evaluation and Comparison

Pervasive computing promotes the installation of connected devices in ou...

A Distributed and Elastic Aggregation Service for Scalable Federated Learning Systems

Federated Learning has promised a new approach to resolve the challenges...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

As the amount of data generated by mobile devices increase explosively, followed by increasing privacy concerns of user data, researchers start seeking a solution to the dilemma of utilizing a large volume of user data while preserving the privacy of users. Federated learning is a machine learning paradigm that provides a solution to this dilemma. Under the coordination of a central server, a model is trained collaboratively by clients. To update the model, the server only collects a minimal amount of necessary information from clients but not their data (McMahan et al., 2017). Federated learning has been drawing increasing interest in recent years and has been applied in many on-device prediction tasks (Hard et al., 2019; Ramaswamy et al., 2019; Bakopoulou et al., 2021; Chen et al., 2019). The privacy promise of federated learning also makes it an appealing choice in healthcare applications (Silva et al., 2019; Liu et al., 2019; Huang et al., 2019, 2020).

In federated learning, a global model is trained collaboratively on clients which are coordinated by a central server. Each round of training typically consists of four phases: an aggregation phase on the server, a local training phase on clients, and two communication (server-to-client and client-to-server) phases. The whole training process starts with a global model initialized on the server-side. In the server-to-client communication phase, a group of active clients is selected as training clients based on certain policy and the model is sent to them. Then, each client trains the model by calculating updates based on its own data stored on the local device. Stochastic gradient descent (SGD) is typically used to update local models during the local training phase. In the client-to-server communication phase, clients send their updated models back to the server, which then aggregates local models into a new global model in the aggregation phase.

Different from traditional distributed learning, in federated learning, client’s raw data are never collected by the central server. Instead, clients only send the updated model parameters to the server. In order to prevent potential information leakage from the local updated models, many privacy-preserving techniques, such as differential privacy, have been studied and applied to ensure privacy of model parameters (Geyer et al., 2018; Liu et al., 2020; Ghosh et al., 2019).

Since the server has no control of the client’s data, this raises several challenges. First, data on a particular client are generated by a particular user, therefore, client data most likely are not distributed in a balanced and IID manner, which is usually an assumption in distributed learning. Second, the number of clients can be much larger than the number of samples on each client. This aggravates the issue of aggregation of non-IID clients. Third, clients are not always able to participate in training as user devices can be offline frequently or slow in communication. These challenges demand new methods different from existing algorithms designed for traditional distributed learning.

Our work focuses on mitigating the impact of non-IID client data distributions. Many existing works (Zhao et al., 2018; Li et al., 2019; Xie et al., 2020; Li et al., 2020; Sattler et al., 2019) adopt the FedAvg (McMahan et al., 2017) framework and applied various strategies to handle non-IID data. In this work, we propose a new framework of federated learning with delayed aggregations. We delay the aggregation of local models on the server by redistributing local models to clients multiple times. Compared with several state-of-the-art federated learning algorithms that handle non-IID data distributions, our framework demonstrates a good ability to mitigate the impact of the non-IID data distribution and yields the best performance on multiple datasets. We also propose an algorithm to select clients using importance sampling, incorporating which client further improves the performance of the algorithm. We implement our framework in Ray (Moritz et al., 2018) with code made public 111Code is available at: Our algorithm outperforms the best benchmark algorithm by on average across 9 non-IID datasets. On multiple datasets with a more challenging task, our algorithm demonstrates an improvement of roughly against the best comparison algorithm.

In order to better evaluate our algorithm’s performance under non-IID settings, we propose a new method to generate non-IID data. Different from many existing sampling methods that focus only on sampling non-IID class distributions, the method can sample non-IID sizes of clients, class distributions and even feature distributions. Most importantly, it allows us to control the non-IID level for each of the three attributes separately. With this method, we can simulate different and specific non-IID settings.

Along the way, we study the impact of localized and global data standardization. In global standardization, clients receive the mean and standard deviation of the global data distribution from the server and standardize local data using these statistics. Localized standardization is a procedure where each client standardizes the local data using its own statistics. Although global standardization is commonly used in federated learning studies, localized standardization is the only realistic choice when global statistics are not available. We observe a performance regression on federated algorithms when clients perform localized instead of global standardization, as expected. The proposed algorithms are robust to localized standardization scenarios, where we observe a larger improvement against comparison models than under global standardization.

In Section 2, we discuss related work. Proposed algorithms are described in Section 3. The experimental setup, including dataset collection and generation, is described in Section 4. Section 5 discusses the computational results and the conclusions are drawn in Section 6.

2. Related Work

Many works have been done to tackle aforementioned challenges. Improving communication efficiency (Konečnỳ et al., 2017; McMahan et al., 2017; Luping et al., 2019) is one of the most important topics in federated learning as client devices are usually on slow and expensive connections. Performing sketched updates is a popular strategy. Konečnỳ et al. (Konečnỳ et al., 2017) applied quantization and subsampling on the model update to compress it before sending it back to the server. Wang et al. (Luping et al., 2019) reduce communication by avoiding irrelevant updates from clients. Each client determines if its update is relevant enough by checking whether its local update aligns with the global tendency.

Despite the great success of FedAvg, researchers showed that the performance of FedAvg reduces significantly when local data are non-IID (Zhao et al., 2018). Zhao et al. also proposed a strategy to mitigate non-IID data by sharing a subset of data between clients. The idea is to make the training data more IID through sharing. Many studies focuse on handling the non-IID issue in this direction (Jeong et al., 2018; Yoshida et al., 2020; Huang et al., 2020)

. Instead of sharing raw data, a generative adversarial network (GAN) was trained in

(Jeong et al., 2018) to reproduce client data, which preserves privacy as no real data of clients is shared.

Another category of studies improving federated learning on non-IID data adds constraints when updating the model. This can be done either on clients or on the server side. Sahu et al. proposed FedProx (Li et al., 2020)

, which modified the loss function on the client side by adding a penalty to the weight difference between the local model and the global model. Xie et al. also added such penalty in their asynchronous federated learning algorithm

(Xie et al., 2020). Sattler et al. proposed a communication-efficient federated learning framework to reduce communication costs by applying Top-k Sparsification (Sattler et al., 2019). The sparsification restricted changes to only a small subset of the model’s parameters and is shown to suffer the least from the non-IID data among existing model compression methods.

On the server side, Li et al. (Li et al., 2019) applied momentum uniformly to the gradients of all clients to stabilize the training process under a non-IID scenario. However, collecting gradients from clients might require more frequent communications than collecting models from clients. Xie et al. (Xie et al., 2020) proposed to update the global model by weighted averaging between new local updates and the old global model. Reddi et al. (Reddi et al., 2021) proposed adaptive federated learning algorithms, which treats the difference between the client’s local update and the global model as pseudo-gradient and applied adaptive gradient descent algorithms to update the global model.

Our work focuses on handling non-IID data. Similar to (Li et al., 2020; Reddi et al., 2021), we modify the FedAvg algorithm to make it more robust on non-IID data. Different from existing works, we change the aggregation logic by introducing redistribution rounds which delay the aggregation. We also improve the client sampling process by incorporating the idea of importance sampling.

3. Methodology

One of the most common approaches to solve the optimization problem in federated learning is FedAvg (McMahan et al., 2017). In each training round, the server sends the global model to a subset of randomly selected clients. The clients update their local model using SGD on their own data in parallel and send back the updated model to the server. The server then updates the global model by averaging local updates from clients. Consider a subset of training clients, the aggregation at the -th round is written as

where is the updated model on client , is the size of client and

is the total size of clients. When data are identically distributed at clients, this aggregation works well since each local model is trained on a subset of data that is representative of the global distribution. It is identical to updating the global model in a centralized way. In non-IID cases, however, the client data can be highly skewed and it might not be a good idea to average the model weights trained on a highly skewed client with less skewed ones. The weighted averaging makes the aggregation even worse if a highly skewed client has a large number of samples, as the size of a client is also taken into account and a larger client has a bigger impact on aggregation.

3.1. Delayed Aggregation

In order to make this aggregation work better in the non-IID setting, we have to answer the question: can each local model be trained on data that are representative of the global distribution at the time of aggregation?

clients participate in training; is the fraction of clients participating in each training round; is the number of training iterations and is the number of redistributing iterations.
Server executes:
for each round  do
       , for
       for each redistributing iteration  do
             uniformly sample training clients
             for  do
             end for
       end for
end for
Algorithm 1 RADFed

One of the core promises federated learning makes is that no client data is collected by the server, so we can not make data on each client be representative of the global distribution by rearranging client data. However, we can rearrange local models. If we train a model on all clients one by one, we end up with a model that is trained on all the data. This would be similar to standard epoch based training and thus very slow. Second, it would assume that each client is active when needed. Alternatively, we can select only a subset of clients to perform this strategy. Due to the fact that each client data can be skewed, the model may be trained on consecutive skewed mini-batches and thus might not be as good as the one trained in the centralized fashion, where data can be properly shuffled. Despite this, if each local model is trained on same data points at the time of aggregation regardless of the order of samples, we shall still expect a much more reliable aggregated model, compared with the case where each local model is only trained on data of a single client.

Following this idea, we propose the Randomized Aggregation Delayed Federated learning algorithm (RADFed). We delay the aggregation by adding another training round to FedAvg. As shown in Algorithm 1, in the inner rounds, the server randomly sends local models back to clients again without performing aggregation. The server only aggregates local models at the end of the inner rounds. We call the inner training rounds the redistributing rounds. ClientUpdate(,) trains the model of client with initial weights .

With enough redistributions, all local models are expected to be trained on a similar number of samples. Therefore, we remove the sample size factor during aggregation and perform plain averaging over local models with equal weights. Because of this, the algorithm has another appealing property in terms of privacy-preserving in that the clients do not have to expose the size of their data. In many cases, the size of data can also be considered as sensitive information and exposing them may also cause privacy leakage. For example, it is more likely that a heavier user of a health-tracking app has a health problem.

In practice, it is possible that the number of active clients is different in each round. To apply our framework, a small subset of active clients can be selected to make sure the number of active clients is the same across redistributing rounds. In some extreme cases where too few clients are active during redistribution, there are multiple strategies to make the framework work, e.g., reducing the number of redistributing models accordingly, or counting the number of times a local model has been redistributed and scheduling the redistributing process to make sure local models are redistributed a similar number of times before aggregation.

3.2. Importance Sampling

Not all samples are equally important and so are clients, especially in federated learning where client data are usually non-IID. If data are not identically distributed on clients, why should we select training clients through a simple uniform random sampling? We hypothesize that focusing computation on good clients can help improve federated learning algorithms. Inspired by (Katharopoulos and Fleuret, 2018), we propose RADFed-IS that incorporates the idea of importance sampling into our aggregation delayed framework. In (Katharopoulos and Fleuret, 2018)

it is established that the optimal sampling probability is proportional to the square of the norm of the gradients.

The idea of importance sampling is to find a good mini-batch to train the model on in the next training step. A straightforward way of adopting this idea in our framework is to score the importance of all clients with respect to the current global model right after each aggregation and select the next set of clients to participate in training based on this score. However, collecting scores from all clients is usually not feasible in federated learning under the assumption that clients are not always active. Besides, it may increase the training time largely by adding an extra communication round to collect scores after each aggregation.

Instead, we score each client along with its local training. After local training, each client calculates the average square of the gradient norm of all mini-batches as its importance score and sends it back to the server along with the updated local model. The advantage of this strategy is that there is almost no extra burden added to the communication. Compared with the model itself, the size of an importance score can be neglected. However, the importance score calculated this way is no longer a good indicator of the importance of the client’s data to the global model as each score is associated with a local model. In addition, a local model is not likely going to be trained on the same client in the next round because of the redistribution. Therefore, selecting clients based on this score might not be a good idea.

In order to solve this issue, we accumulate the importance scores for each client by averaging the scores calculated on all local models that have been trained on its local data. We expect that the accumulated score of a client becomes a good indicator of the importance of this client’s data to all local models after accumulating over multiple rounds.

The server accumulates importance score of client by taking a weighted average between the old score and the new one as

with a mixing hyper-parameter . The detailed algorithm is shown in Algorithm 2.

Server executes:
initialize for each training client
for each round  do
       , for
       for each redistributing iteration  do
             clients sampled with probabilities
             for  do
             end for
       end for
end for
Client updates local model on local data
return , to server
Algorithm 2 Importance Sampling in Federated Learning

4. Experimental Setup

In this work, we focus on evaluating the performance of federated learning algorithms in non-IID settings. Although a real-world non-IID dataset is ideal, datasets with an artificial partition are also very helpful in simulating different non-IID settings. Many studies create heterogeneous clients by manually sampling data on clients so that the class distribution is not identical across clients. In existing sampling methods, the sizes of clients are usually determined by class sampling. To the best of our knowledge, feature-imbalance has not been considered in prior works.

In order to simulate non-IID settings with more control of the distribution of sizes, classes and features, we propose a sampling method where we can sample them independently with a different Dirichlet prior. It is not always the case that we can draw a desired number of samples to satisfy all these independently sampled distributions at the same time. Let us consider sampling non-IID sample sizes and classes as an example. A sampling solution for clients and classes is a matrix where each entry denotes the number of samples of class on client . By sampling sizes and classes separately, we specify each entry of the matrix, that is a total of numbers. However, given the number of samples in each class in a dataset, we only need entries to specify a solution. Therefore, we propose a Quadratic Programming (QP) method to find a random feasible sampling solution.

4.1. Partitioning of Heterogeneous Data

4.1.1. Non-IID over classes and sizes

Let be the number of samples of class and be the total number of samples. Clearly, we have . Let be the sizes of clients and be the class distribution of client . Let be the number of samples of client of class . We want , given and . However, the dataset needs and , which they might not hold. Therefore, we find a feasible solution for by solving


subject to:

which is a convex QP.

4.1.2. Non-IID over features, classes and sizes

Using the similar idea of sampling classes and sizes, we also sample categorical features in a non-IID manner. We sample category distribution Dir() of feature with categories on client . We consider classes as a feature that are sampled separately. Let be a set of all possible combinations of categories in the dataset and be the number of samples that fall into configuration . The first element of corresponds to classes. Now let be the number of samples on client with configuration . Then we find a feasible solution through


subject to:

Here is the number of categorical features. If a feature is non-categorical by nature, we can create buckets that correspond to categories.

4.1.3. A random solution

The above QPs may have many optimal solutions but we want a random one. We generate a random solution by modifying values at the “4 vertices of a random rectangle,” in a way that the modified values still satisfy our constraints in (1) or (2), see details in Algorithm 3. A step size is used to control the modification. The algorithm has two phases. In the first phase, we find a suboptimal solution by randomly modifying values. Then, in the second phase, starting from the suboptimal solution, we continue modifying and record the best random solution we find.

Input: a feasible solution from QP
for  do
       RandomizeSolution() // a burn-in period
end for
for  do
       calculate loss from (1) or (2)
       if  then
       end if
end for
indices of two randomly selected entries of
for Each position  do
end for
Algorithm 3 Random QP solution
Dataset Min Max Mean Stdev C-score
Cifar10 2 2,850 600 605.85 1.286
Shakespeare 3 41,305 3,616 6,808.44 0.266
COVCLS 110 33,300 4,920 5,110.28 0.794
COVFEAT 372 17,328 4,920 3,237.06 0.682
MNIST 3 3,365 700 667.12 0.696
MNIST 11 3,327 700 658.89 1.293
eICU 108 5,683 901 925.50 0.060
Table 1. Statistics of datasets (number of samples of clients)
Figure 1. Test performance comparison on the Covertype and Shakespeare datasets. Multiple runs are performed with different seeds on the most representative fold, defined as the one with the closest performance gap to the average of all folds. The performance gap is the difference in the test performance between RADFed and FedAvg.
Dataset FedAvg FedProx FedAsync FedAdapt RADFed RADFed-IS
Cifar10 2.07 1.24 (84.63) 1.69 0.61 0.90
Shakespeare 0.75 0.42 (52.10) 0.21 3.26 3.72
COVFEATG 0.34 0.02 (87.91) 1.79 2.88 2.24
COVFEATL 0.92 0.36 (79.52) 2.40 4.05 3.32
COVCLSG 0.20 0.23 (93.68) 0.05 0.36 0.44
COVCLSL 0.06 0.32 (90.67) 0.29 2.22 0.03
MNIST 0.05 0.19 (97.24) 0.15 0.22 0.28
MNIST 0.17 0.32 (96.79) 0.27 0.40 0.45
eICU 0.11 0.11 (92.31) 0.05 0.03
Table 2. Average test performance of 5-fold cross-validation: % accuracy for the MNIST, Cifar10 and Shakespeare dataset; F1 score (

100) for all Covertype datasets; and Area Under the Receiver Operating Characteristic Curve (AUC) (

100) for the eICU dataset. The best values are shown in bold. The absolute scores are reported for FedAsync and the % relative performance difference against FedAsync is shown for other algorithms.

4.2. Datasets and Models

For all datasets that are partitioned by (1) or (2), we set . These datasets have reasonably large variations in client sizes, see Table 1. We use the same for all clients, on all datasets but MNIST, where we experiment on different values of . For feature sampling, we set for all clients and features. In QP, we use , and . The impact of , (the mini-batch size) and (the number of local training epochs) is well studied and thus we do not focus on experimenting on various settings of these variables. We set , which is shown to be a generally good setting that balances the performance and the convergence speed (McMahan et al., 2017). The mini-batch size is set to 10 and 16 for MNIST and Cifar10, respectively, considering that clients on these datasets do not have many samples. On other datasets, is set to 256. We set for MNIST to make the task more challenging and set for the other datasets. Besides these general federated learning hyper-parameters as mentioned above, each particular algorithm has its own hyper-parameters. RADFed has one more hyper-parameter, the number of redistribution rounds , than FedAvg. RADFed-IS adds another hyper-parameter, the mixing weight . We tune hyper-parameters specified by each federated learning algorithm using grid search on validation clients. Table 3 lists the hyper-parameter values of proposed algorithms used in our experiments.

param COVCLS (-L & -G) COVFEAT (-L & -G) MNIST MNIST Cifar10 Shakespeare eICU
(RADFed) 22 20 22 15 15 15 80
(RADFed-IS) 22 20 22 15 100 100 80
(RADFed-IS) 0.9 0.9 0.9 0.9 0.9 0.8 0.9
Table 3. Hyper-parameters in proposed algorithms

4.2.1. Covertype

Covertype is a large structured dataset for forest cover type prediction from the UCI KDD archive (Hettich and Bay, 1999). It consists of 10 numerical features and 2 categorical features with 7 imbalanced classes. Since our goal is to evaluate our method’s performance on non-IID data, we do not want to consider other data quality problems such as high class imbalance at the same time. Therefore, in our experiments, we only focus on predicting the two largest classes. We partition the data into 100 clients. The number of training clients () is . The number of validation and test clients are 20 each. The splitting of clients is discussed in Section 4.3

. Same sizes are used for the MNIST and Cifar10 datasets. We train a fully connected neural network with 2 hidden layers with 64 neurons each.

Figure 2. Validation performance comparison on the Covertype and MNIST datasets. The F1 score is used on all Covertype datasets and accuracy is reported on the MNIST datasets. Curves are smoothed by taking the average over evenly spaced intervals for better visualization. The intervals are chosen differently considering that validation frequencies are different. The intervals are set to 100 for the Covertype datasets and 5 for the MNIST datasets.
Figure 3. Validation performance comparison. Accuracy is reported on the Cifar10 and Shakespeare datasets. AUC is used on the eICU dataset. Similar to Figure 2, curves are smoothed with intervals 1, 2 and 50 for the Cifar10, Shakespeare and eICU datasets, respectively.

We create two types of datasets, one (COVCLS) with classes and client sizes sampled non-identically based on (1) and the other (COVFEAT) with also features sampled non-identically thus using (2). For both datasets, we set for all clients, and set for the COVFEAT dataset. All the datasets that follow are created based on (1).

On the Covertype datasets, we also study the impact of localized and global data standardization. The difference is whether to use global statistics of all clients’ data to standardize client local data or to let each client perform standardization with its own statistics. On COVFEAT-G and COVCLS-G, we perform global standardization, while on COVFEAT-L and COVCLS-L, localized standardization is used. When comparing our algorithm with benchmarks on other datasets, we use global standardization to be consistent with the original papers.

4.2.2. Mnist

MNIST (LeCun et al., 2010) consists of images of digits with 10 classes. We sample 100 clients with classes and sizes non-identically distributed. We study how data heterogeneity impacts the performance of federated learning algorithms by creating two datasets with and , respectively. A dataset generated with the larger has a lower heterogeneity in class distributions. We build a fully connected neural network same as (McMahan et al., 2017).

4.2.3. Cifar10

Cifar10 (Krizhevsky, 2009) images are partitioned into 100 clients with classes () and sizes () non identically distributed. We use pre-trained MobileNetV2 (Sandler et al., 2018) as the model and train a subset of layers from the last bottleneck convolution layer to the classification layer.

4.2.4. Shakespeare

This dataset is a language modeling dataset built from The Complete Works of William Shakespeare (McMahan et al., 2017). We use the same data as (Li et al., 2020) but partition samples by speaking roles. Each speaking role corresponds to one client. In total, the dataset consists of 143 clients. The number of training, validation and test clients are and

, respectively. The task is to predict the next character given a sequence of 80 characters. We train a 2 layer long short-term memory (LSTM) classifier with an 8-dimensional embedding layer.

4.2.5. eICU

eICU is a large multi-center critical care database made available by Philips Healthcare (Pollard et al., 2018). We predict the in-hospital mortality using variables underlying the Acute Physiology Age Chronic Health Evaluation (APACHE) predictions 222The full variable list and descriptions are available at and To avoid a potential sampling bias, we focus on mid to large hospitals with more than 100 admissions and exclude those associated with a high mortality rate (greater than 20%). Each hospital corresponds to a client. The dataset contains 164 clients. The number of training, validation and test clients are and

, respectively. We train a logistic regression model with L2 regularization.

Figure 4. Test performance comparison on datasets with different levels of heterogeneity

4.3. Evaluation Setup

We compare the performance of our methods (RADFed and RADFed-IS) with FedAvg (McMahan et al., 2017), FedProx (Li et al., 2020), the adaptive federated operation method (FedAdapt) (Reddi et al., 2021) and the asynchronous federated optimization method (FedAsync) (Xie et al., 2020) on multiple tasks. FedAvg is probably the most popular and commonly used federated algorithm and the others are the state-of-the-art federated learning algorithms that handle non-IID data distributions.

Different from synchronous methods, FedAsync has to deal with the staleness of updates from clients. The staleness of a client’s update is defined as the timestamp difference between a client’s update and the server’s model. The performance of FedAsync suffers from large staleness. In order to mitigate the impact of staleness on training, the new global model is updated as a weighted average between the old global model and the client’s local update. In addition, the authors show that decaying the mixing weights as a function of staleness helps to fight against large staleness. Despite these efforts, the impact of staleness on FedAsync’s performance is not completely eliminated.

In order to make a fairer comparison between asynchronous and synchronous methods, we have to choose a reasonable value for staleness. We simulate the FedAsync’s training procedure and find maximum staleness where the average number of clients running in parallel per round is the same as in the synchronous methods. In other words, we compare the performance of FedAsync and synchronous methods under the same level of parallelism on average.

To our best knowledge, there is no gold standard for evaluating federated algorithms. Generally, there are 3 ways to split the data into training and test sets: splitting all data globally (Sattler et al., 2019; Li et al., 2019; Huang et al., 2019), splitting each client’s local data (Ghosh et al., 2019; Reddi et al., 2021; Bakopoulou et al., 2021) and splitting clients into training/test groups (Huang et al., 2020; Hard et al., 2019; Ramaswamy et al., 2019). In this work, we adopt the last strategy by assuming no local data can be collected by the server and the server can not manipulate the client’s local data. Additionally, we perform 5-fold cross-validation with the by-client splits in order to reduce the selection bias, which might be aggravated by the non-IID client distributions. We split all clients into 5 sets. One by one, a set is selected as the test set. For the remaining sets, one by one, a set is selected as the validation set and the others are used as the training set.

5. Results

We run each algorithm 3 times with different seeds on each of the 5 folds and report the average performance over the 15 runs in Table 2. On average, RADFed and RADFed-IS offer an improvement over the best benchmark, FedAsync, by 1.56% and 1.26%, respectively. On the MNIST datasets and the eICU dataset, all algorithms achieve a close performance. On other datasets, the best of RADFed and RADFed-IS is significantly better than FedAsync (p ¡ 0.05 under the Wilcoxon signed-rank test (Wilcoxon, 1992)). Under some difficult settings, which we discuss later, our framework offers a substantial improvement over all comparison algorithms on multiple datasets. The best of RADFed and RADFed-IS outperforms the best comparison algorithm by on Shakespeare, on COVFEAT-G, on COVFEAT-L and on COVCLS-L. Figure 1 shows that RADFed is quite stable across different seeds and confirms its significant improvement on these datasets.

In Figures 2 and 3, we compare the validation curves. With delayed aggregations, RADFed and RADFed-IS stabilize the training by demonstrating a smaller variation in validation scores than the algorithms that adapt the FedAvg framework. In general, our algorithms achieve the maximum validation score at a similar number of training rounds as other algorithms. On Shakespeare, our algorithms peak much later than FedAvg. This is due to a large learning rate used in FedAvg, where a relatively larger learning rate yields a better result, although the model gets overfitted quicker than using a lower learning rate. It does not imply that delaying aggregations also delays convergence. In fact, on Shakespeare, aggregations in RADFed are delayed with 15 redistribution rounds and the number in RADFed-IS is 100. We observe a similar convergence behavior.

5.1. Heterogeneity

We create datasets with various levels of heterogeneity to evaluate whether our model is effective and robust under different heterogeneous settings. In order to compare between manually partitioned datasets and naturally partitioned ones, we introduce the class non-IID score (C-score), which is defined as , where is the ratio of class on client and is the ratio of class in all data. This score measures the difference between client’s class ratios and the global class ratios. The C-score of each dataset is shown in Table 1.

The MNIST dataset is partitioned with and . The MNIST dataset is expected to have a higher heterogeneity of class distributions than the other due to a smaller value of . Its C-score is also higher than in MNIST . All algorithms perform worse on MNIST than MNIST , as shown in Figure 4. The RADFed-IS algorithm performs the best on both datasets and yields the smallest performance regression when class heterogeneity increases.

The COVCLS and COVFEAT datasets are partitioned with the same value of and , so they have a similar level of heterogeneity with respect to client sizes and classes. Their C-scores are also similar. However, since we also introduce heterogeneity on feature distributions in the COVFEAT datasets, they should have a severer issue on non-IID data distribution than COVCLS datasets. Therefore, as shown in Figure 4, all algorithms show a lower performance on COVFEAT datasets, no matter which standardization method is used. Similar to the results on the MNIST datasets, our algorithms suffer the least when data heterogeneity increases.

5.2. Standardization

With global standardization, RADFed-IS achieves a close performance as RADFed on both COVFEAT-G and COVCLS-G datasets. RADFed outperforms FedAvg by 0.17% and 3.24% on COVCLS-G and COVFEAT-G, respectively.

With localized standardization, we observe a performance regression on all federated learning algorithms, as shown in Figure 4. However, RADFed demonstrates a good ability in handling localized standardization by offering a larger performance improvement over FedAvg on both COVCLS-L (2.3%) and COVFEAT-L (5.0%).

Interestingly, RADFed outperforms RADFed-IS on both COVCLS-L and COVFEAT-L, which implies that it is more challenging for RADFed-IS to determine which clients are better under localized standardization. Therefore, we recommend RADFed-IS when it is possible to perform global standardization and RADFed under localized standardization.

5.3. Model Complexity

On all Covertype and MNIST datasets, we train fully connected neural networks with hidden layers. While on Cifar10 and Shakespeare datasets, we test our algorithms with deeper networks (i.e., Convolutional Neural Networks and Recurrent Neural Networks) on unstructured tasks. RADFed-IS performs the best on both datasets. On Cifar10, it improves RADFed by 0.28% and outperforms the best comparison algorithm and FedAvg by 0.90% and 3.03%, respectively. Different from Cifar10, Covertype and MNIST datasets, the Shakespeare dataset is not manually partitioned through our sampling algorithm. Instead, the data are partitioned naturally by speaking roles. As a result, the Shakespeare dataset has a lower C-score than these three datasets. Our RADFed-IS algorithm improves RADFed by 0.45% and surpasses all other comparison methods by at least 3.26%.

Besides deep models, we train a logistic regression model on the eICU dataset. As shown in Table 2, all federated algorithms achieve similar AUC scores. The scores are all close to the centralized model’s AUC of 0.924. The eICU dataset is also not partitioned through the sampling algorithm and its C-score is much smaller than the scores of all other datasets. Despite small heterogeneity of the dataset, RADFed still offers an improvement over FedAvg, modest but significant ( under the Wilcoxon signed-rank test (Wilcoxon, 1992) with 15 runs).

5.4. Divergence on Delayed Updates

5.4.1. Divergence from the centralized model

In studying the non-IID challenge in federated learning, the weight divergence has been used to explain the performance reduction, which as shown in (Zhao et al., 2018) can be attributed to the divergence. The weight Divergence between the Centralized and federated models () measures the difference of the global weights of federated training relative to those of centralized training. It is defined as


where are the weights of the global model in federated training at the -th round and are the centralized weights.

Figure 5. Weight divergence

To visualize the weight divergence , we train a centralized model and a federated model side by side. Both models start with the same weight initialization. In each round, the same data are used in training. The difference is that in centralized training we collect data from clients and update the model using combined data. The divergence from the centralized model is expected due to the distance between the client data distribution and the population distribution. As shown in Figures 4(b) and 4(d), RADFed algorithm demonstrates a smaller weight divergence than FedAvg. It indicates that the aggregated weights of our algorithm are less impacted by the skewness of the data and are closer to the weights trained on data under the population distribution.

5.4.2. Divergence between local models

Another type of weight divergence is the Divergence between clients’ Local updates () before each aggregation. For a set of clients , the divergence is defined as


where is the local update from client in the -th round. A positive correlation between and federated learning performance is observed in (Kim et al., 2020). The study is based on the FedAvg framework that is different from ours. Although the same correlation might not hold when comparing different frameworks, it helps visualize how our algorithm behaves.

During training of our algorithm, we observe a periodical trajectory of , Figure 5. In the first round after each aggregation, the divergence is the smallest. As the aggregation being delayed for more rounds, the divergence keeps increasing until the next aggregation. The divergence in FedAvg vibrates around the lowest values of our algorithm. Figures 4(a) and 4(c) show the weight divergence of local updates on Shakespeare and COVFEAT-L datasets.

The increasing does not indicate any deficiency of our framework. It might be due to the nature of the redistribution of local models. For example, in a non-IID setting where each client has one class of data, training may start with clients of different classes and yield large divergence between local models. In the next redistribution round, the divergent local models are trained again on client data of different classes. The divergence accumulates as the redistribution continues. FedAvg, however, results in a smaller because it performs aggregation after each local training and divergence is not accumulated.

6. Conclusion

In this work, we propose a new training framework with delayed aggregation to handle the well-known non-IID issue in federated learning. We demonstrate that our framework offers a substantial improvement over the FedAvg framework and outperforms several state-of-the-art federated learning algorithms. Moreover, we incorporate importance sampling in our framework and further improve the framework on multiple datasets.

Along the way, we also discuss the following topics in federated learning: the splitting of training and test sets, localized and global standardization, and weight divergence on different frameworks. Experiments show that federated learning algorithms suffer from localized standardization. The proposed framework demonstrates a good ability in handling localized standardization. However, the importance sampling version does not offer further improvement under localized standardization.

In addition, we propose a sampling algorithm to generate non-IID datasets. It offers the choice for a desired non-IID level on client sizes, classes and features separately, thus providing researchers with more flexibility and control about simulating different non-IID settings. We also introduce the C-score to quantify the level of heterogeneity of non-IID datasets and demonstrate the robustness of proposed algorithms on datasets with various C-scores.


  • E. Bakopoulou, B. Tillman, and A. Markopoulou (2021) FedPacket: a federated learning approach to mobile packet classification. IEEE Transactions on Mobile Computing. Cited by: §1, §4.3.
  • M. Chen, R. Mathews, T. Ouyang, and F. Beaufays (2019) Federated learning of out-of-vocabulary words. External Links: 1903.10635 Cited by: §1.
  • R. C. Geyer, T. Klein, and M. Nabi (2018) Differentially private federated learning: a client level perspective. External Links: 1712.07557 Cited by: §1.
  • A. Ghosh, J. Hong, D. Yin, and K. Ramchandran (2019) Robust federated learning in a heterogeneous environment. External Links: 1906.06629 Cited by: §1, §4.3.
  • A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augenstein, H. Eichner, C. Kiddon, and D. Ramage (2019) Federated learning for mobile keyboard prediction. External Links: 1811.03604 Cited by: §1, §4.3.
  • S. Hettich and S. D. Bay (1999) UCI kdd archive. Note: Cited by: §4.2.1.
  • L. Huang, A. L. Shea, H. Qian, A. Masurkar, H. Deng, and D. Liu (2019) Patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records. Journal of biomedical informatics 99, pp. 103291. Cited by: §1, §4.3.
  • L. Huang, Y. Yin, Z. Fu, S. Zhang, H. Deng, and D. Liu (2020) LoAdaBoost: loss-based adaboost federated machine learning with reduced computational complexity on iid and non-iid intensive care data. Plos one 15 (4), pp. e0230706. Cited by: §1, §2, §4.3.
  • E. Jeong, S. Oh, H. Kim, J. Park, M. Bennis, and S. Kim (2018) Communication-efficient on-device machine learning: federated distillation and augmentation under non-iid private data. External Links: 1811.11479 Cited by: §2.
  • A. Katharopoulos and F. Fleuret (2018)

    Not all samples are created equal: deep learning with importance sampling

    In International conference on machine learning, pp. 2525–2534. Cited by: §3.2.
  • H. Kim, T. Kim, and C. Youn (2020) On federated learning of deep networks from non-iid data: parameter divergence and the effects of hyperparametric methods. External Links: Link Cited by: §5.4.2.
  • J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon (2017) Federated learning: strategies for improving communication efficiency. External Links: 1610.05492 Cited by: §2.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §4.2.3.
  • Y. LeCun, C. Cortes, and C. Burges (2010) MNIST handwritten digit database. ATT Labs [Online]. Available: 2. Cited by: §4.2.2.
  • C. Li, R. Li, H. Wang, Y. Li, P. Zhou, S. Guo, and K. Li (2019) Gradient scheduling with global momentum for non-iid data distributed asynchronous training. External Links: 1902.07848 Cited by: §1, §2, §4.3.
  • T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith (2020) Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems, Cited by: §1, §2, §2, §4.2.4, §4.3.
  • D. Liu, D. Dligach, and T. A. Miller (2019) Two-stage federated phenotyping and patient representation learning. In Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 283–291. Cited by: §1.
  • Y. Liu, Z. Ma, X. Liu, S. Ma, S. Nepal, and R. Deng (2020) Boosting privately: privacy-preserving federated extreme boosting for mobile crowdsensing. External Links: 1907.10218 Cited by: §1.
  • W. Luping, W. Wei, and L. Bo (2019) CMFL: mitigating communication overhead for federated learning. In 2019 IEEE 39th International Conference on Distributed Computing Systems, pp. 954–964. Cited by: §2.
  • B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §1, §1, §2, §3, §4.2.2, §4.2.4, §4.2, §4.3.
  • P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, and I. Stoica (2018) Ray: a distributed framework for emerging ai applications. In 13th USENIX Symposium on Operating Systems Design and Implementation, pp. 561–577. Cited by: §1.
  • T. J. Pollard, A. E. Johnson, J. D. Raffa, L. A. Celi, R. G. Mark, and O. Badawi (2018) The eicu collaborative research database, a freely available multi-center database for critical care research. Scientific data 5 (1), pp. 1–13. Cited by: §4.2.5.
  • S. Ramaswamy, R. Mathews, K. Rao, and F. Beaufays (2019) Federated learning for emoji prediction in a mobile keyboard. External Links: 1906.04329 Cited by: §1, §4.3.
  • S. J. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečný, S. Kumar, and H. B. McMahan (2021) Adaptive federated optimization. In 9th International Conference on Learning Representations, Cited by: §2, §2, §4.3, §4.3.
  • M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 4510–4520. Cited by: §4.2.3.
  • F. Sattler, S. Wiedemann, K. Müller, and W. Samek (2019) Robust and communication-efficient federated learning from non-iid data. IEEE transactions on neural networks and learning systems 31 (9), pp. 3400–3413. Cited by: §1, §2, §4.3.
  • S. Silva, B. A. Gutman, E. Romero, P. M. Thompson, A. Altmann, and M. Lorenzi (2019) Federated learning in distributed medical databases: meta-analysis of large-scale subcortical brain data. In 2019 IEEE 16th international symposium on biomedical imaging, pp. 270–274. Cited by: §1.
  • F. Wilcoxon (1992) Individual comparisons by ranking methods. In Breakthroughs in statistics, pp. 196–202. Cited by: §5.3, §5.
  • C. Xie, S. Koyejo, and I. Gupta (2020) Asynchronous federated optimization. External Links: 1903.03934 Cited by: §1, §2, §2, §4.3.
  • N. Yoshida, T. Nishio, M. Morikura, K. Yamamoto, and R. Yonetani (2020) Hybrid-fl for wireless networks: cooperative learning mechanism using non-iid data. In 2020 IEEE International Conference on Communications, pp. 1–7. Cited by: §2.
  • Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra (2018) Federated learning with non-iid data. External Links: 1806.00582 Cited by: §1, §2, §5.4.1.