CSAFL: A Clustered Semi-Asynchronous Federated Learning Framework

by   Yu Zhang, et al.

Federated learning (FL) is an emerging distributed machine learning paradigm that protects privacy and tackles the problem of isolated data islands. At present, there are two main communication strategies of FL: synchronous FL and asynchronous FL. The advantages of synchronous FL are that the model has high precision and fast convergence speed. However, this synchronous communication strategy has the risk that the central server waits too long for the devices, namely, the straggler effect which has a negative impact on some time-critical applications. Asynchronous FL has a natural advantage in mitigating the straggler effect, but there are threats of model quality degradation and server crash. Therefore, we combine the advantages of these two strategies to propose a clustered semi-asynchronous federated learning (CSAFL) framework. We evaluate CSAFL based on four imbalanced federated datasets in a non-IID setting and compare CSAFL to the baseline methods. The experimental results show that CSAFL significantly improves test accuracy by more than +5 compared to TA-FedAvg. In particular, CSAFL improves absolute test accuracy by +34.4


page 2

page 4

page 5

page 6

page 7

page 8

page 9

page 10


FedAT: A Communication-Efficient Federated Learning Method with Asynchronous Tiers under Non-IID Data

Federated learning (FL) involves training a model over massive distribut...

Towards Efficient and Stable K-Asynchronous Federated Learning with Unbounded Stale Gradients on Non-IID Data

Federated learning (FL) is an emerging privacy-preserving paradigm that ...

QuAFL: Federated Averaging Can Be Both Asynchronous and Communication-Efficient

Federated Learning (FL) is an emerging paradigm to enable the large-scal...

Improving the Robustness of Federated Learning for Severely Imbalanced Datasets

With the ever increasing data deluge and the success of deep neural netw...

SAFA: a Semi-Asynchronous Protocol for Fast Federated Learning with Low Overhead

Federated learning (FL) has attracted increasing attention as a promisin...

FedHiSyn: A Hierarchical Synchronous Federated Learning Framework for Resource and Data Heterogeneity

Federated Learning (FL) enables training a global model without sharing ...

Federated Reinforcement Learning for Collective Navigation of Robotic Swarms

The recent advancement of Deep Reinforcement Learning (DRL) contributed ...

I Introduction

Federated learning (FL) [23, 34, 17] is an emerging machine learning framework that utilizes multiple edge devices to jointly train a global model under the coordination of the central server. The training process of FL is divided into plenty of communication rounds. In each communication round, the edge devices download and use the global model parameters from the central server to perform optimization with local data for obtaining the local model parameters. Finally, each edge device updates the local model parameters to the central server for aggregating new global model parameters. Throughout the training process of the federated network, the training data of the client is always kept locally and is not transmitted, which protects data privacy [14, 11, 24]. In practice, The FL framework plays a crucial role in supporting privacy-sensitive applications on edge devices [12].

Fig. 1: The training procedures of synchronous FL and asynchronous FL

There are two model updating mechanisms for federated learning: synchronous and asynchronous. As shown in Fig. 1, we observe the training process of the two update strategies. Due to the different update mechanisms, these two FL frameworks have their own merits and drawbacks:

  • Synchronous FL: In the synchronous FL, all clients need to download the global model parameters at a unified time node, and the central server waits for all clients to complete the training tasks. The merits of this synchronous updating strategy are the model has fast convergence and high precision. The only drawback is the straggler effect [32] caused by network resources or poor hardware resources, which puts the server fall into an idle state.

  • Asynchronous FL: Under the asynchronous FL, the central server cooperates with clients that complete the training tasks, rather than waiting for all clients, and each client can ask the central server for the new global model parameters when the training procedure is completed. Although this model updating strategy avoids the server falling into the idle state, it also has some shortcomings. On the one hand, this strategy makes the data transmission larger, which may cause the server to crash [31]. On the other hand, the gradient divergence caused by asynchronous updating will further degrade the performance of the model.

There have been many efforts in FL synchronous or asynchronous training algorithms. McMahan et al. propose the FL framework Federated Averaging (FedAvg) [23] and Li et al. experimentally and theoretically prove that FedAvg can achieve convergence rate with decayed learning rate under a statistical heterogeneous setting [19]. To relieve the effect of imbalanced data on accuracy, Duan et al. propose a self-balancing FL framework Astraea [10]. FedAvg and Astraea are both synchronous FL frameworks. However, these algorithms do not consider the stragglers, which is unfriendly for time-critical applications. Asynchronous FL framework mitigate the straggler effect. Xie et al. Proposed FedAsync [8] framework with a staleness function. Chen et al. Propose an asynchronous algorithm to tackle the challenges associated with heterogeneous edge devices, such as computational loads, stragglers [6]. However, these frameworks ignore the situation that the server may crash due to the continuous processing of requests from the clients [31].

In this paper, we propose a novel clustered semi-asynchronous federated learning (CSAFL) framework, which mitigates the straggler effect and controls model staleness in the asynchronous update for accuracy improvement. CSAFL leverages a spectral clustering algorithm 


, which groups clients according to the affinity matrix constructed by the delay and direction of clients’ model update. In each communication round with fixed time budget, the selected clients do synchronous or asynchronous update independently, and contribute the update parameters to the group model them belong to. The main contributions of this paper are summarized as follows:

  • We propose a novel clustered semi-asynchronous federated learning (CSAFL) framework. As far as I know, this is the first framework that combines synchronous and asynchronous update mechanisms.

  • We design two strategies to alleviate the model staleness problem caused by asynchronous updating. The first is to leverage a spectral clustering algorithm to divide clients with different learning objectives into multiple groups. The second is by limiting the model delay.

  • We evaluate CSAFL on four federated datasets, and show its effectiveness in mitigating the straggler effect compared to T-Fedavg, and show the accuracy improvement of the model is more than on the four datasets compares to TA-FedAvg. Specially, the maximum accuracy improvement is on FFEMNIST dataset compared to TA-FedAvg.

The rest of this paper is organized as follows. Section II outlines the background of synchronous FL and asynchronous FL. Section III shows the motivation of the CSAFL framework. Section IV details the design of the CSAFL framework. Evaluation results are presented and analyzed in Section V. Section VI concludes the paper.

Ii background and related work

Ii-a Federated Learning

In this section, we first present the vanilla synchronous FL framework FedAvg. Then we briefly introduce the asynchronous FL framework based on Fedavg.

McMahan et al. first propose the concept of federated learning, and later propose a widely used federated learning framework FedAvg [23], which involves solving a machine learning problem by loosely combining multiple devices under the coordination of a central server. Unlike the traditional distributed machine learning method, the computing nodes in the FL framework keep the training data locally and do not exchange or transfer the training data in the federated network. Instead, only the local updates of each computing node are transmitted, which reduces privacy risks. Nevertheless, the central server has no control over computing nodes. For example, computing nodes can join or drop out of the federated network at any time.

1:procedure FL Server Training
2:     Initialize global model .
3:     for each communication round  do
4:          Randomly select clients from clients.
5:         Server broadcasts to the selected clients.
6:         for each client parallelly do
7:               ClientUpdate().          
10:function ClientUpdate()
12:     for

 each local epoch

13:         for each local batch  do
15:     return
Algorithm 1 Federated Averaging

In particular, the optimization goal of general FL is:


which means solving model updates when the value of is the smallest. In (1), is the number of clients participating in training, refers to the participation weight of the -th device, and , , is the local objective optimization function of the -th device, we define as:


where is the training data size of the -th client, is the sample of the -th client and obeys the data distribution .

is the prediction loss function on


Specifically, the synchronous FL based on FedAvg mainly includes a central server maintaining the global model and multiple participants, which communicate through network connections with the server. In each communication round , the server first selects a part of the client from all clients to participate in the training tasks and then broadcasts the global model through the network into the selected clients. The client updates the locally based on . When the client completes the local training procedure, it transmits to the server through the network. The server waits for all clients to complete the training task. Finally, the server aggregates the new global model by averaging the local solutions of the clients. For an FL task, to achieve the target accuracy, it usually needs hundreds of communication rounds.

The pseudo-code of is shown in Algorithm 1, where is a random subset of clients randomly selected in each communication round. Minibatch is the batch number of training data, and refers to the subset of training data separated by . Local epoch is the number of local training. is the data size of client and is the total data size of selected clients. is the learning rate of the local solver.

As shown in Algorithm 1, we note that the server needs to wait for all clients to complete the synchronous FL system’s training tasks, which is negative for some time-critical applications. Therefore, the asynchronous FL framework is proposed. We briefly introduce the asynchronous FL framework based on FedAvg, where each client updates the local model to the global model independently. Whenever the server receives the local update from the client, it will refresh the global model. Therefore, the server does not need to wait for stragglers for aggregation.

Ii-B Related Work

Federated Learning (FL), first proposed by Google, is a new approach to fitting machine learning into the edge. Existing FL frameworks can be classified into synchronous FL and asynchronous FL according to model updating. Most of the studies on synchronous FL do not consider the issues of stragglers due to the device heterogeneity and the instability of network conditions 

[23] [26] [29]. To eliminate the straggler effect on the statistical heterogeneity, Li et al. [18] propose a near term, experimentally prove this term can improve the stability of the framework and provide convergence guarantees in theory. However, the server still needs to wait for the local updates of stragglers before aggregation. In order to address the issue that all clients have to wait for the slowest one, Li et al. [20] propose a strategy to let the central server only accept the responses of the top threshold k clients, and the rest of the clients are regarded as stragglers. This method simply dropouts the stragglers, ignores the possibility of valuable data on the stragglers. In addition, to mitigate the impact of stragglers, Dhakal et al. [9] develop a coded computing technique for FL where sever compensates the gradient of the stragglers. However, this compensation is the result of the sever calculates the gradients based on the client’s parity data. There is a risk of privacy leakage.

Asynchronous FL has a natural advantage over synchronous FL in solving the straggler effect [33] [5] [22], where the server can aggregate without waiting for stragglers. Chen et al. propose a ASO-Fed framework [6], which updates the global model in an asynchronous manner to tackle stragglers. However, gradient staleness is not considered, which may threaten the convergence of the model. Especially, Xie et al. develop a FedAsync [8] algorithm which combines a function of staleness with asynchronous update protocol. However, the clients will continue to transmit a large amount of data to the server, which may cause the server to crash. In terms of reducing data transmission, Wu et al. propose a SAFA [33] protocol which divides clients into three types, in which asynchronous clients continuously perform local update until the difference between local update version and global model version reaches tolerance. Although SAFA considers model staleness, the server needs to wait for the asynchronous clients.

Iii preliminary case studies and motivation

In this section, we illustrate our preliminary case studies, which guide the motivation for a new grouping model update strategy in FL.

Fig. 2: The straggler effect in the synchronous FL framework can be observed in the figure on the left, and the frequency distribution of the clients‘ idle time is unbalanced; The figure on the right shows the accuracy comparison of asynchronous FL and synchronous FL in the process of model training.

Asynchronous FL has a natural advantage in solving the straggler effect in FL [33]

, but the model delay caused by asynchronous update may affect the accuracy of the model. To investigate the influence of different updating strategies on the accuracy of the model influence of different model updating strategies, we study the training process of synchronous FL and asynchronous FL. More specifically, we set up this test based on a 62- class Federated Extended MNIST 


(FEMNIST) dataset using the multinomial logistic regression model. The statistical information of FEMNIST dataset is shown in Table 

I. In order to show the straggler effect clearly, we set a fixed time budget hyper-parameter for each communication round. So in a communication round, the client may do multiple synchronous or asynchronous updates. Under a fixed time budget, we set up two groups of experiments based on FedAvg, which are T-FedAvg and TA-FedAvg. TA-FedAvg is an asynchronous update algorithm based on FedAvg. In this FL test, we adopt the same notation as [23]: the number of all clients K = 200, the size of local minibatch , the number of local epochs , the number of round , the learning rate , the number of selected clients per round is . We add the time budget of each round . In this test, .

Task Dataset Model devices samples
Image Classification MNIST MCLR 1000 69035
FEMNIST 200 18345
Simulation Test Synthetic(0.8,0.5) 100 75349
Sentiment Analysis Sent140 LSTM 772 40783
TABLE I: The statistics of the four federated datasets

The top- test accuracy of the two model update strategies based on FEMNIST dataset is shown in Fig. 2. For the TA-FedAvg algorithm, a reduction in the top- test accuracy compared to the T-FedAvg algorithm. This test shows that the asynchronous update strategy harms the accuracy of the model. In the case of synchronous model update, all clients simultaneously download the global model parameters from the server. There is no client training with old model parameters in the same communication round. Obviously, the model delay of each client is . However, in the case of asynchronous model update, When the server receives the local model parameters uploaded by one client, the global model on the server has been updated many times by other clients. Therefore, we realize that the model delay of each client is not steady. Due to the instability of model delay, the accuracy of asynchronous updates has decreased significantly.

As shown in Fig. 2, we can observe the clients’ idle time distribution, which clearly shows that the server needs to wait too long, and the idle time is unbalanced in synchronization FL. However, in asynchronous FL, the server does not need to wait for all clients to complete the training tasks of each round. That is, when a client finishes uploading local model parameters, the server immediately refresh the global model, so the idle time of each client is . We assume that the communication time for the client to download model parameters from the server is negligible.

The preliminary case study shows that there is a straggler effect in the synchronous model update. In addition, the asynchronous model update strategy has the effect of precision degradation caused by model delay. We are also inspired by the Iterative Federated Clustering Algorithm (IFCA) framework proposed by Avishek [13], which considers different groups of users have their objectives. Therefore, we propose a novel clustered semi-asynchronous federated learning (CSAFL) framework.

Iv Csafl

Iv-a Framework Overview

To tackle the straggler effect in synchronous FL and model staleness in asynchronous FL, we propose a novel clustered semi-asynchronous federated learning (CSAFL) framework, which leverages a clustering algorithm based on similarity metrics to group clients. CSAFL combines synchronous FL and asynchronous FL to drive clients to update local model parameters to the same group’s group model. As far as we know, this is the first paper to combine synchronous and asynchronous update mechanisms.

Our model architecture is shown in Fig. 3, CSAFL includes several groups, which can be deployed to the central server, or some devices in the middle layer, such as the edge server. They maintain the group models. In a group, the group model is the global model and the latest model. In this paper, we assume that all groups are deployed on a central server. CSAFL also contains a number of clients, which can be mobile phones, IoT devices, etc. There is a one-to-one communication between the client and its group. Due to the asynchronous update mechanism, the model maintained on the client is not necessarily the latest.

The training process of CSAFL is shown in Algorithm 2, CSAFL first initializes the group model . In each communication round, the selected clients update the group model in its group, as shown in Fig. 3. For group , clients are the selected clients in one communication round. The update process within the group is divided into the following five steps:

  1. Group broadcasts the initial model to clients and passes the version number of the group model, which is represented by ()

  2. After the clients receive the latest group model, they independently update the version number of the local model, such as, for client ()

  3. Clients asynchronously update the group model according to its computing capacities and communication conditions. Each process of the asynchronous update will increase the version number of the group model. Before update operation, the difference between and is calculated. If the difference is more significant than the tolerance of gradient staleness, the clients whose time budgets are not exhausted are forced to update synchronously.

  4. After each synchronization update, all clients download the same group model.

  5. Repeat step to step until the time budget of the communication round is exhausted.

Similarly, when other clients are selected, the training process within the group is the same. Each group completes its own intra group update without interference. Each group has the same number of communication rounds.

We cover more details of CSAFL in the following four sections.

Fig. 3: The framework of CSAFL. In a group training process, group 1 randomly selects three clients, A, B, C. The up arrow indicates that the client updates the local model to the group model, and the down arrow indicates that the client downloads the parameters of the group model

Iv-B Clustering Strategy

Clustering Algorithm

As shown in Algorithm 2, we apply spectral clustering in this paper. Spectral clustering is more suitable for our framework, the following reasons:

  • It is more general than the K-Means clustering algorithm 

    [25], concretely, it is suitable for both convex sets and non-convex sets.

  • It is not very sensitive to stragglers.

  • It uses the Laplacian Eigenmaps for high-dimensional data, which reduces the load of calculation.

Similarity Metrics

The traditional clustering federated learning (CFL) framework is based on the similarity of sub-optimization tasks, which only considers the similarity of gradient direction [2] [28]. But in FL, this is not comprehensive. For example, In the asynchronous FL, we assume that the network conditions are same, and the transmission capacities of the devices are also fixed, and the device with high computing power and the device with poor computing power are divided into a group, the will constantly refresh the global model with its own local updates, and the result of model training is more inclined to the . In the synchronous FL, this grouping strategy will lead to the model training inefficient. Similarly, the data amount of different devices and different network conditions will make this grouping strategy have a bad impact on the performance of the model. Therefore, the data amount and the computing power of the device, and the network condition between the devices and the server may also be indicators of devices’ similarity. The above three indicators are represented by the local model update latency of the client, so we determine the metrics for measuring the clients’ similarity: gradient direction and the latency of model update. In the local update of a client, the local update latency can be divided into two parts: computation latency and communication latency, and we define the following three delay models:

Iv-B1 Computation Latency

To quantify the randomness of the client’s computing capabilities, we use a shift exponential distribution 



where and are parameters related to the computing capabilities of the client . and represent the maximum and fluctuation of the client computation capabilities, respectively. is the size of the dataset on the client , and is the computation latency of the client . Because the physical facility’s computing capabilities where the group is located is much greater than that of the client, We ignore the delay due to group model aggregation. Based on (3), we can get the mathematical expectation of computation latency for client as follows:

1:number of all clients , number of selected clients per round , set of groups , learning rate , local minibatch size , number of local epochs ,global initial model , time budget of each communication round , number of round , delay threshold per round .
2:updated group model parameters .
3:procedure FedSpecAsync Training
4:     Initialize group model.
5:     for all , initial to
6:      .
7:     for each communication round  do
8:          Sever selects random clients from all clients .
9:         for all , initial to , .
10:         for each group in parallelly do
11:              for all , initial to , to , to
12:              for each client in parallelly do
13:                  while  is not exhausted do
14:                       .
15:                       if  then
16:                            the of is not exhausted, .
17:                            SynUpdates()
18:                       else
19:                            ModelUpdates().
20:                           .
21:                           .
22:                           .
23:                           .                                                                      
25:function ModelUpdates()
26:      split dataset of client into batches of size .
27:     for each local epoch from to  do
28:         for each local batch  do
29:              .               
30:     return .
32:function SynUpdates()
33:     for each client in parallelly do
34:          ModelUpdates()
35:         .      
36:     .
37:     .
38:     for each client in parallelly do
39:         .
40:         .
41:         .
42:         .      
Algorithm 2 clustered semi-asynchronous federated learning
1:number of all clients , learning rate , local minibatch size , number of local epochs ,global initial model , number of group pre-training hyper-parameter , weight of time .
2:set of groups
3:procedure Grouping Clients
4:      Calculate ref CALCULATE
5:      SpectralClustering()
7:     return
Algorithm 3 Grouping Clients

Iv-B2 Communication Latency

We consider such a communication scenario, where all the clients participating in the training are within the cell radius of the central server. Those who are out of range can not participate in training. There is path loss in the transmission link. Given an FDMA system with a total bandwidth of , for client

, its signal-to-noise ratio (

) is defined as follows [21]:


where is the transmission power (unit: ) from client to its group ,

is the thermal noise variance (unit:

), is the path loss (unit: ) between client and group , we define as follows:


Where is the distance (unit: ) between client and group .

Based on (5) and (6), we define the transmission speed from client to group as follows:


where represents the bandwidth allocated to client . Based on (7), we define the communication latency of local update of client as:


where is the size of the model update. Because the transmission power of the physical facility where the group is located is relatively large, we ignore the communication latency from the group to the client .

Iv-B3 Model Update Latency

We only cluster the classified clients once, so we take the mathematical expectation of each device’s computation latency. Therefore, in the process of a model update, the model update latency of the client is defined by the following formula:


Inspired by these CFL papers [2] [28]

, we calculate the cosine similarity of the gradient update between clients to obtain their similarity. We define the cosine similarity between client

and client :



is the vector of gradient update of client

, similarly, is the vector of gradient update of client .

1:(same as Algorithm 3)
2:similarity matrix
3:procedure Calculate
4:      set of all clients
5:      NormalizeTime()
6:     for each client in parallelly do
7:          PreTrainClient()      
8:     for each client in parallelly do
9:          Similarity()
11:     return
13:function Similarity()
14:     .
15:     for each client in parallelly do
16:           CosineSimilarity()
18:     return
20:function PreTrainClient()
22:      split dataset of client into batches of size .
23:     for each local epoch from to  do
24:         for each local batch  do
25:              .               
27:     return  Flatten()
29:function NormalizeTime()
32:     for each client in parallelly do
Algorithm 4 Similarity Matrix Calculation

Building Affinity Matrix

Before we build the affinity matrix, we should build the similarity matrix. In order not to make the latency with a small value lose its effect, we need to normalize these latencies. As shown in Algorithm 4, we first calculate the variance and mean value of the model update latency of the clients to be classified, expressed by and respectively, and then normalize the model update latency of the clients (line , Line to line):


Finally, given a hyper-parameter weight , which represents the trade off between model update latency and gradient update on the same client .

We first leverage the (10) to calculate the cosine similarities between the client and the clients to be classified. We use a vector () to represent the cosine similarities of the client . is the number of clients to be classified. Then we connect and vector as vector (). Finally, the vector is used as the client’s data for classification. Similarly, the similarity matrix () is constructed by calculating the vector of all clients to be classified. At last, Gaussian similarity is used to construct an affinity matrix between clients based on the row vector of the similarity matrix . The formula is as follows:


In this paper, we use the average aggregation strategy of FedAvg.

To verify our CSAFL framework is reasonable, we will evaluate it based on several real federated datasets in the next section.

V evaluation

In this section, we introduce the experimental results of the CSAFL framework. We evaluate our experiments based on four open federated datasets. We show the details of our experiments in Section V-A. Then, in Section V-B, we present the performance improvement of our method compared to the baselines. In order to further verify the effectiveness of our grouping strategy and update strategy, specifically, we design comparative experiments, which are introduced in Section V-C and Section V-D, respectively. For comparison, we fix the hyper-parameter time budget

in all experiments. Specifically, each communication round in the training process of each experiment has the same time budget. Our code is based on TensorFlow 


Fig. 4: The accuracy curves of CSAFL, R-FedAvg, baselines and NoG-FedAvg on MNIST and FEMNIST. 10K=10000, 15K=15000, unit:ms.
Fig. 5: The frequency distribution of the clients’ idle time. 10K=10000, 15K=15000, unit: ms.

V-a Experimental Setup

The parameter of FL setting we used in this paper is described in III, and except these, we set model delay threshold

, We evaluated our experiments on four federal datasets, where including two image classification datasets and a synthetic dataset, an emotion analysis dataset. We use a convex multinomial logistic regression (MCLR) model for training the first three datasets, and we use a Long Short-Term Memory (LSTM) model for training the last emotion analysis task. The statistical information of datasets and models is shown in Table 

I. The details are as follows:

Dataset-Time Budget T-FedAvg TA-FedAvg G-FedAvg GA-FedAvg R-FedAvg NoG-FedAvg CSAFL
MNIST-10000 90.0 86.1 96.3 95.8 88.0 89.9 96.2
MNIST-15000 90.0 86.0 96.2 95.6 88.3 89.6 96.2
FEMNIST-10000 77.1 55.1 88.6 77.1 62.4 74.1 84.2
FEMNIST-15000 77.4 51.1 89.7 77.7 63.0 77.2 85.5
Synthetic-10000 26.6 33.0 88.1 64.8 31.9 40.5 66.2
Synthetic-15000 20.4 53.6 71.4 62.3 35.6 52.9 64.1
Sent140-10000 71.3 62.3 71.0 63.6 66.4 69.5 68.0
Sent140-15000 71.1 62.5 69.7 62.8 65.3 70.1 69.0
TABLE II: CSAFL and its comparison algorithms are evaluated on four federated datasets, and this table shows the accuracy of the model with these algorithms

Datasets and Models. Our experiments are based on four non-IID datasets which are class-imbalanced and data-imbalanced, and we leverage the appropriate models to train these datasets, the details follow as:

  • MNIST [15]: a handwritten digits data set, which is divided into to the categories of numbers, each data is a flatted dimensional (x) pixel image. According to the power law, we assign data to clients, and each client has two types of digits. We use the MCLR model, and the input of the model is pixel images, and the output is labels of digitals.

  • FEMNIST: a class federated extended MNIST dataset, which is based on EMNIST [7] dataset construction, and only sampling to these types of lowercase characters. We divide data into clients by the power law, and each client contains five types. We also use the MCLR model, and the input and output of the model are similar to MNIST.

  • Synthetic: Shamir et al. propose a synthetic federated dataset [30]. We set (a,b)=() and divide data into clients. Similarly, for this synthetic data set, we use the MCLR model to test it.

  • Sentiment140 (Sent140) [27]: a data set based on Twitter users’ emoticons to express the sentiment, each user as a client, we use a two-layer logistic regression model with hidden units to train emoticons based on Twitter users.

Baseline. We have two baseline methods:

  • a synchronous method FedAvg with the time budget that we call it T-FedAvg.

  • The asynchronous method is based on T-fedAvg, which we call it TA-FedAvg.

Comparison methods. In order to Significantly illustrate the performance of our method, we designed six comparison experiments. The specific FL settings are similar to our method. We will elaborate in the following sections.

Metrics. We have two metrics:

  1. Frequency of clients’ idle time: We record each client’s idle time in the training process of the synchronous round. In order to evaluate the effectiveness of the CSAFL framework in mitigating the straggler effect, we evaluate the frequency of clients who wait for more than of the time budget.

  2. Accuracy: CSAFL framework is based on the clustered algorithm, and its extended comparative experiments are also divided into groups for testing. The accuracy of each client is calculated based on the group model of the group to which the client belongs, so we use the weighted test accuracy to evaluate each group model for intuitively expressing the overall performance of grouped experiments.

V-B Experimental Results

As shown in Table II, we present the experimental results of our framework with the baseline methods. The results show that our method is superior to the baseline method in MNIST, FEMNIST, Synthetic. Especially, CSAFL improves test accuracy by on synthetic( ) with . besides, the accuracy of TA-FedAvg is better than that of T-FedAvg, which indicates the influence of the straggler effect on synchronous FL. Although the average test accuracy of CSAFL is not better than that of T-FedAvgg on Sent140, we can observe from Fig. 4 that the group curve of CSAFL is superior.

More detail that each communication round’s test accuracy is shown in Fig. 4 and Fig. 6. We observe that our method is roughly as fast as synchronous FL and faster than asynchronous FL in the convergence speed of the model on FEMNIST. That is, our CSAFL framework can converge infinite communication rounds. On the other hand, the convergence rate of CSAFL is about the same as that of the T-FedAvg and TA-FedAvg on MNIST, which may be that different model update strategies will not cause significant divergence of gradients on MNIST.

Fig. 6: The accuracy curves of G-FedAvg, GA-FedAvg on MNIST and FEMNIST. 10K=10000, 15K=15000, unit: ms.

V-C Effects of The Grouping Strategy

To demonstrate the effectiveness of the grouping strategy, we set up two groups of comparative experiments, as follows:

  • Under the same update strategy, we use the random grouping method R-FedAvg based on T-FedAvg to compare with CSAFL. As shown in Table II, the test accuracy of CSAFL is higher than that of the R-FedAvg at least on MNIST, FEMNIST, Synthetic. As for Sent140, We can observe that in Fig. 4, the curve of CSAFL is higher than R-FedAvg in overall accuracy. In short, our grouping strategy is effective.

  • Similarly, based on the same update strategy, we design a NoG-FedAvg without grouping. Compared with CSAFL, the results show that the curve fluctuation of CSAFL is smaller than that of NoG-FedAvg in Fig. 4 and Fig. 6.

V-D Effects of The Update Strategy

In order to verify the effectiveness of the update strategy, we designed two groups of comparative experiments:

  • For different update strategies in the case of grouping, we design the synchronous FL and asynchronous FL based on FedAvg, which are named G-FedAvg and GA-FedAvg, respectively. As shown in Table II, we can observe that our method’s experimental results are better than other update strategies.

  • Without grouping, we use GA-FedAvg and T-FedAvg to compare with TA-FedAvg, and the results show the effectiveness of our strategy

V-E effects of the update strategy

In order to verify that CSAFL can mitigate the straggler effect, as shown in Fig. 5, in the time budget , we can observe the frequency distribution of clients’ idle time. The results show that CSAFL can significantly mitigate the straggler effect. Especially on the MNIST, FEMNIST, Sent140, these frequencies of the clients’ idle time is close to zero after . On the Synthetic, we can observe that the frequencies of T-FedAvg’s clients’ idle time are greater than after , which also shows that the conclusion in Section D is correct. That is, the straggler effect affects the accuracy of the synchronization updates.

Vi conclusion

In this work, we propose a new clustered semi-asynchronous federated learning framework (CSAFL), which can effectively mitigate the straggler effect and improve the accuracy of the asynchronous FL. Based on four datasets, our evaluation experiments show that the CSAFL framework is better than asynchronous FL on accuracy, convergence speed, and CSAFL is an effective solution to the straggler effect in synchronous FL. We further prove the effectiveness of the grouping strategy and update mechanism.

In the future, we will explore the following directions:

  1. The advantages of our framework in privacy protection.

  2. We will further expand the trade-off of time and direction; personalized learning beta is based on various data sets.

  3. We will explore a better strategy to aggregate global models.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, and M. Devin (2016) TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016, pp. 265–283. Cited by: §V.
  • [2] C. Briggs, Z. Fan, and P. Andras (2020)

    Federated learning with hierarchical clustering of local updates to improve training on non-iid data


    2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, United Kingdom, July 19-24, 2020

    pp. 1–9. Cited by: §IV-B3, §IV-B.
  • [3] S. Caldas, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, V. Smith, and A. Talwalkar (2018) Leaf: a benchmark for federated settings. arXiv preprint arXiv:1812.01097. Cited by: §III.
  • [4] P. K. Chan, M. D. F. Schlag, and J. Y. Zien (1994) Spectral k-way ratio-cut partitioning and clustering. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 13 (9), pp. 1088–1096. Cited by: §I.
  • [5] Y. Chen, X. Sun, and Y. Jin (2020)

    Communication-efficient federated deep learning with layerwise asynchronous model update and temporally weighted aggregation

    IEEE Trans. Neural Networks Learn. Syst. 31 (10), pp. 4229–4238. Cited by: §II-B.
  • [6] Y. Chen, Y. Ning, and H. Rangwala (2019) Asynchronous online federated learning for edge devices. CoRR abs/1911.02134. Cited by: §I, §II-B.
  • [7] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik (2017) EMNIST: extending mnist to handwritten letters. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2921–2926. Cited by: 2nd item.
  • [8] I. G. Cong Xie (2019) Asynchronous federated optimization. CoRR abs/1903.03934. External Links: Link Cited by: §I, §II-B.
  • [9] S. Dhakal, S. Prakash, Y. Yona, S. Talwar, and N. Himayat (2020) Coded federated learning. CoRR abs/2002.09574. Cited by: §II-B.
  • [10] M. Duan, D. Liu, X. Chen, Y. Tan, J. Ren, L. Qiao, and L. Liang (2019) Astraea: self-balancing federated learning for improving classification accuracy of mobile deep learning applications. In Proceedings of the IEEE 37th International Conference on Computer Design (ICCD), pp. 246–254. Cited by: §I.
  • [11] C. Fang, Y. Guo, N. Wang, and A. Ju (2020) Highly efficient federated learning with strong privacy preservation in cloud computing. Comput. Secur. 96, pp. 101889. Cited by: §I.
  • [12] J. Feng, C. Rong, F. Sun, D. Guo, and Y. Li (2020) PMF: A privacy-preserving human mobility prediction framework via federated learning. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 4 (1), pp. 10:1–10:21. Cited by: §I.
  • [13] A. Ghosh, J. Chung, D. Yin, and K. Ramchandran (2020) An efficient framework for clustered federated learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: §III.
  • [14] M. Hao, H. Li, G. Xu, S. Liu, and H. Yang (2019) Towards efficient and privacy-preserving federated deep learning. In 2019 IEEE International Conference on Communications, ICC 2019, Shanghai, China, May 20-24, 2019, pp. 1–6. Cited by: §I.
  • [15] Y. LeCun, C. Cortes, and C. Burges (2010) MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2. Cited by: 1st item.
  • [16] K. Lee, M. Lam, R. Pedarsani, D. S. Papailiopoulos, and K. Ramchandran (2018) Speeding up distributed machine learning using codes. IEEE Trans. Inf. Theory 64 (3), pp. 1514–1529. Cited by: §IV-B1.
  • [17] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith (2020) Federated learning: challenges, methods, and future directions. IEEE Signal Processing Magazine 37 (3), pp. 50–60. Cited by: §I.
  • [18] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith (2020) Federated optimization in heterogeneous networks. In Proceedings of the 3rd SysML Conference, Cited by: §II-B.
  • [19] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang (2020) On the convergence of fedavg on non-iid data. In Proceedings of the 8th International Conference on Learning Representations (ICLR), Cited by: §I.
  • [20] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang (2020) On the convergence of fedavg on non-iid data. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Cited by: §II-B.
  • [21] L. Liang, G. Feng, and Y. Zhang (2011) Integrated interference coordination for relay-aided cellular ofdma system. In 2011 IEEE International Conference on Communications (ICC), Cited by: §IV-B2.
  • [22] X. Lu, Y. Liao, P. Liò, and P. Hui (2020) Privacy-preserving asynchronous federated learning mechanism for edge network computing. IEEE Access 8, pp. 48970–48981. Cited by: §II-B.
  • [23] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In

    Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS)

    pp. 1273–1282. Cited by: §I, §I, §II-A, §II-B, §III.
  • [24] V. Mothukuri, R. M. Parizi, S. Pouriyeh, Y. Huang, A. Dehghantanha, and G. Srivastava (2021) A survey on security and privacy of federated learning. Future Gener. Comput. Syst. 115, pp. 619–640. Cited by: §I.
  • [25] D. Pollard (1982) Quantization and the method of k -means. IEEE Trans. Inf. Theory 28 (2), pp. 199–204. Cited by: 1st item.
  • [26] J. Ren, G. Yu, and G. Ding (2021) Accelerating DNN training in wireless federated edge learning systems. IEEE J. Sel. Areas Commun. 39 (1), pp. 219–232. Cited by: §II-B.
  • [27] T. Sahni, C. Chandak, N. R. Chedeti, and M. Singh (2017) Efficient twitter sentiment classification using subjective distant supervision. In 9th International Conference on Communication Systems and Networks, COMSNETS 2017, Bengaluru, India, January 4-8, 2017, pp. 548–553. Cited by: 4th item.
  • [28] F. Sattler, K. Müller, T. Wiegand, and W. Samek (2020) On the byzantine robustness of clustered federated learning. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020, pp. 8861–8865. Cited by: §IV-B3, §IV-B.
  • [29] T. Sery, N. Shlezinger, K. Cohen, and Y. C. Eldar (2020) COTAF: convergent over-the-air federated learning. In IEEE Global Communications Conference, GLOBECOM 2020, Virtual Event, Taiwan, December 7-11, 2020, pp. 1–6. Cited by: §II-B.
  • [30] O. Shamir, N. Srebro, and T. Zhang (2014) Communication-efficient distributed optimization using an approximate newton-type method. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, JMLR Workshop and Conference Proceedings, Vol. 32, pp. 1000–1008. Cited by: 3rd item.
  • [31] W. Shi, S. Zhou, and Z. Niu (2020) Device scheduling with fast convergence for wireless federated learning. In 2020 IEEE International Conference on Communications, ICC 2020, Dublin, Ireland, June 7-11, 2020, pp. 1–6. Cited by: 2nd item, §I.
  • [32] T. T. Vu, D. T. Ngo, H. Q. Ngo, M. N. Dao, N. H. Tran, and R. H. Middleton (2020) User selection approaches to mitigate the straggler effect for federated learning on cell-free massive MIMO networks. CoRR abs/2009.02031. Cited by: 1st item.
  • [33] W. Wu, L. He, W. Lin, R. Mao, and S. A. Jarvis (2019) SAFA: a semi-asynchronous protocol for fast federated learning with low overhead. CoRR abs/1910.01355. Cited by: §II-B, §III.
  • [34] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra (2018) Federated learning with non-iid data. arXiv preprint arXiv:1806.00582. Cited by: §I.