1 Introduction
With the proliferation of digitalized data, and with the rapid advances in deep learning, there is an everincreasing demand to benefit from insights that can be extracted from stateoftheart deep learning models, while still not sacrificing on data privacy
[geyer2017differentially, yang2019federated]. For effective training, these models would require a huge amount of data, which is usually distributed over a large heterogeneous network of clients who are not willing to share their private data. Federated learning (FL) promises to solve these key issues, by training a global model collaboratively over decentralized data at local clients [konevcny2016federated2, mcmahan2017communication].The key idea of FL is that all clients keep their private data and share a global model under the coordination of a central server, which is summarized as the FedAvg algorithm [mcmahan2017communication]. In a typical communication round, a fraction of the clients, selected by the server, would download the current global model and perform training on local data. Next, the server performs model aggregation to compute a new global model, based on all updated models transmitted by the selected clients. However, in realworld nonIID data distributions, FL in heterogeneous networks would face many statistical challenges, such as model divergence and accuracy reduction [smith2017federated, li2019convergence, ghosh2019robust, karimireddy2020scaffold].
Related work: A theoretical analysis of the convergence of FedAvg in nonIID data settings is given in [li2019convergence]. FedProx, a general optimization framework with robust convergence guarantees, was proposed in [li2020prox] to tackle data heterogeneity by introducing a proximal term in the local optimization process. In [karimireddy2020scaffold], a stochastic algorithm SCAFFOLD
was proposed to reduce the variance and gradient dissimilarity in local FL updates, which yields better communication efficiency and faster convergence. To address fairness issues in FL,
Fair Federated Learning (qFFL) [li2019fair] and Agnostic Federated Learning (AFL) [mohri2019agnostic] were proposed as modified federated optimization algorithms that improve fairness in performance across clients.In [wang2019fedma], a layerwise Federated Matched Averaging (FedMA) algorithm was proposed for CNNbased and LSTMbased FL models. Ji et al.
ji2019learning also introduced a layerwise attentive aggregation method for federated optimization, which has lower perplexity and communication cost for neural language models. Other methods have been proposed to tackle data heterogeneity and communication efficiency, from the perspective of asynchronous optimization
[xie2019asynchronous], model compression [konevcny2016federated2, sattler2019robust], personalized training [deng2020adaptive], and incentive design [yu2020fairness, kang2019incentive].There are also several works in FL that tackle performance degradation in nonIID data settings, but for which the data privacy assumption in FL is not strictly adhered to. These works require a small amount of raw data to be shared either among clients [zhao2018federated] or with the global server [jeong2018communication]. FedMix [yoon2021fedmix] avoids the direct sharing of raw data, by incorporating mixup data augmentation [zhang2018mixup] into FL, whereby clients share averaged data with each other. Such methods would require the exchange of raw or averaged data, which no longer guarantee strict data privacy, and which also require additional communication cost.
In each communication round of the usual FL framework, a fixed fraction of the clients is selected, based on a fixed probability distribution. A good choice for this fraction is not clear. Small fractions are widely used in existing work
[mcmahan2017communication, karimireddy2020scaffold], but when such small fractions are used, data heterogeneity inadvertently causes fluctuations in training performance and reduces the rate of convergence [li2019convergence]. Large fractions yield a more stable performance, as well as a slight acceleration in convergence, but at the expense of a larger communication cost [li2019convergence]. Hence, to obtain a stable training performance with relatively low communication cost, we shall instead consider a dynamic fraction method that captures the advantages of both small and large fractions.The selection probability for each client is a measure of the “importance” of that client. Thus, in heterogeneous networks where different clients have different “importance”, the fixed client selection probability distribution used in the usual FL is typically nonuniform. However, the relative contribution of each client is fluid. It depends on both the client’s local model performance, and on the actual aggregated global model, so the “importance” of the clients may vary during training.
With these considerations, we propose an attentionbased adaptive federated learning algorithm, which we shall call AdaFL. The attention mechanism in AdaFL serves to better capture the relative “importance” among the clients, by taking into account the divergence of the local updated models, relative to the global model. Our method gives clients who have worse models a higher chance to participate in training. AdaFL also incorporates a dynamic fraction method to balance the tradeoff between small and large fractions, which essentially represents the tradeoff between communication efficiency and performance stability.
Our contributions are summarized as follows:

We introduce a method to update the client selection probability in FL, by using a distancebased attention mechanism, so as to better capture the heterogeneity of client performance.

To the best of our knowledge, we are the first to propose a dynamic fraction method in FL. We show that by increasing the fraction progressively, we can improve both the performance stability and the final model accuracy, while concurrently achieving lower communication and computation costs.

We propose AdaFL, which combines both methods. We show experimentally the outperformance of AdaFL over FedAvg with respect to model accuracy, performance stability, and communication efficiency. We also show that AdaFL can be incorporated to enhance the performance of various stateoftheart FL algorithms.
2 Proposed method
The training of FL would typically be performed based on updates over hundreds, or even thousands of communication rounds. In this section, we introduce the preliminaries of FL [mcmahan2017communication]
, and show the details of our proposed algorithm within each round. For concreteness, we assume in this paper that FL is used to learn a global neural network model.
2.1 Preliminaries of FL and FedAvg Algorithm
The FL framework consists of one central server and multiple clients. Clients participate in training a shared global model under the coordination of the server, without having to share private data. Given clients, let be the number of datapoints that client has, and let
be the total number of datapoints. In the usual FL setup, the stochastic vector
represents the discrete probability distribution used for client selection, where is the probability that client is selected in each communication round. The optimization problem that FL tackles can thus be formulated as the minimization problem:where
is the local loss function of client
, whose input is the set of model parameters of a fixed model architecture.The FedAvg algorithm summarizes how this FL optimization problem is solved [mcmahan2017communication]. In each communication round, a small group of clients is selected and local training is performed at each client, based on the same model downloaded from the global server. The main idea is that the gradient updates from the clients are aggregated at the server via a weighted average. At the end of round , the server updates the model parameters with gradient descent via , where , and is the learning rate. Here, is the local update of selected client , is the subset of selected clients in round , and . The number of selected clients is calculated by the formula , where (satisfying ) denotes the fraction of the selected clients.
In this usual FedAvg algorithm, both the probability distribution for client selection and the fraction of selected clients, are kept invariant throughout all communication rounds. In the next two subsections, we explain how AdaFL varies the client selection probability distribution, and varies the fraction , respectively.
2.2 Attention Mechanism
In any realworld FL implementation on nonIID client data, the relative training performances of different clients cannot be predicted in advance. Different clients could have different relative importance towards model aggregation, which could vary over different communication rounds. To avoid making idealistic assumptions, we shall take a datadriven approach. We introduce an attention mechanism to measure the relative importance of the different clients, and adjust the probability distribution for client selection accordingly, based on realtime local training performance. Our approach differs from existing FL approaches, e.g. [konevcny2016federated2, mcmahan2017communication, li2019fair, ji2019learning], where client selection is not modified and hence independent of the local training performance of the clients.
We shall use Euclidean distance as a measure of the model divergence of each local model, relative to the global model. The vector of attention scores in round is identical to the corresponding client selection probability distribution for that round , and it is initialized in the first round as .
Specifically, at the beginning of round , the server selects clients, according to the probability distribution . Local training then occurs. We shall denote the local models of the selected clients by , where is the index of the th client in the selected subset (for round ). Each is a collection of weight matrices for the layers of the neural network. After aggregation at the server, we obtain a new global model, denoted by . The process in a typical round is shown in Fig. 1.
Identify each by weight matrix as a vector in and concatenate all such vectors, so that is represented by a single parameter vector . Thus, the local models and the new global model are represented by the vectors and respectively. For selected client in round , we calculate the Euclidean distance between the global and local parameter vectors as follows:
(1) 
To reduce the fluctuations in attention scores of consecutive rounds, we incorporate the current attention score in our updating criterion:
(2) 
where represents a decay rate of previous attention score contributions. For an unselected client , we set . Note that remains as a stochastic vector. Client selection in round then follows the updated probability distribution .
Since our work does not deal with the usual federated optimization, our attention mechanism only updates the client selection probability distribution and will not change the aggregation weights. Moreover, the additional communication cost brought by our proposed algorithm is negligible. Notice also that by (1) and (2), a larger Euclidean distance between the vectors for the global model and local model will increase the probability that client would be selected in the next communication round. Subsequently, more training and computation would be done at the corresponding clients to obtain a better performance. Hence, the overall effect is a fairer scheme, whereby the variance of local training performances among the clients is deliberately reduced.
2.3 Dynamic Fraction for Client Selection
How do we choose a “good” fraction for client selection? What is considered “good” will depend on how we want to balance the tradeoff between communication efficiency and performance stability. A smaller would mean a lower communication cost in each communication round, but at the expense of larger fluctuations in training performance, and a slower rate of convergence [li2019convergence]; this implies that more training rounds are required to reach a desired performance level. A larger would bring better performance stability, and less communication rounds, but at the expense of a larger total communication cost. Under the usual assumption in FL that is fixed throughout training, we would then be forced to choose between communication efficiency and performance stability.
To circumvent this tradeoff, we drop the assumption that is fixed, and propose a dynamic fraction method, which adopts different fractions during different training stages, with the fraction increasing progressively. As an explicit example, depicted in Fig. 2, we begin training with a small fraction , and end training with a large fraction . (Here, and are arbitrarily selected.) When using gradient descent, our method yields a relatively good performance, even with fewer clients involved at the beginning of training. With an increased amount of local training data in subsequent rounds, the training performance would have a more stable convergence. Intuitively, the updated global model gradients would be closer to optimal gradients that reflect the true data distribution, as there are increasingly more clients (and hence more data) involved in training, as the fraction increases.
To represent dynamic fractions, we shall use to denote the vector of fractions, whose th entry is the fraction used in the th communication round. Since is the total number of communication rounds, (resp. ) is the starting (resp. ending) fraction used in our dynamic fraction method. For simplicity, we recommend using a fixed step size between consecutive fraction updates, and a fixed increment when updating the fraction ; this means that and , where is the desired number of distinct fractions to be used. The vector can be computed accordingly. Observe that in our running example (see Fig. 2), we used , so that each fraction update, which increases the value of by , occurs at every communication rounds.
Although we only consider fixed and for simplicity, it should be noted that our method works for any number of fraction values and any nonconstant , and more generally, for any nonconstant fraction that is monotonically increasing from to . In this paper, we do not address all the different (infinitely many) types of monotonically increasing that could be used for our proposed dynamic fraction method; In our experiments, this simple use of multiple fraction values as described above, is already sufficient to yield performance improvement over the use of constant fraction.
2.4 Algorithm Summary
Adaptive Federated Learning (AdaFL), our proposed method, combines both the attention mechanism described in Section 2.2, and the dynamic fraction method described in Section 2.3. The juxtaposition of these two components, on top the usual FedAvg algorithm, incorporates adaptive training adjustments, thereby yielding better communication efficiency with better performance stability. We give an overview of AdaFL in Algorithm 1.
Inputs:
The key difference of our proposed AdaFL algorithm over the FedAvg algorithm is the adaptive parameter adjustment scheme. Parameters are adjusted by using realtime information from local training. Hence, the central server also plays a dual role as a resource allocator during training, in addition to its usual role of coordinating the aggregation of local weights in each communication round. The resource allocation can be made fairer by improving the weights of clients with larger model divergence.
It should be noted that our proposed algorithm complements existing communicationefficient federated algorithms, such as compression [konevcny2016federated2, sattler2019robust], data augmentation [yoon2021fedmix] and some optimization methods for federated learning [li2020prox, karimireddy2020scaffold]. Later in our experiments, we show how our proposed AdaFL can enhance the performance of existing popular FL algorithms.
3 Experiments
In this section, we describe the details of our experiments, and evaluate the performance of our proposed AdaFL algorithm.
3.1 Experiment Setup
We evaluate our AdaFL
algorithm on image classification tasks on two image datasets, MNIST
[lecun1998gradient]and CIFAR10
[krizhevsky2009learning], with neural network models. For experiments on the MNIST dataset, we train a MultiLayerPerceptron (MLP) model (2 hidden layers, each layer with 200 units and ReLU activation) with a fixed learning rate
. For training data samples, we use the nonIID data partition as described in [mcmahan2017communication]. For experiments on the CIFAR10 dataset, we train a CNN model with the same model architecture as given in [yoon2021fedmix], with IID data partition. We used an initial learning rate of with a decay ofacross the communication rounds. In all our experiments, across all FL algorithms, we train a local model at each selected client via stochastic gradient descent (SGD), using momentum coefficient
.The remaining parameter settings are given as follows. We use
clients for training. The number of local epochs and batch size are set as
and respectively. The initial attention score vector is determined by the local dataset size as described in Algorithm 1. For our attention score update process, we fix . For the local dataset size, we consider a balanced data distribution for all experiments, in which every client has the same local dataset size (number of local training samples); this implies that the initial attention score of each client is . For the dynamic selection of fractions, we use and , with increments of (see Fig. 2).3.2 Ablation Study
We report our ablation study results in Tables 1 and 2, in which we evaluate the performances of our proposed AdaFL on the two datasets, with and without the two components in AdaFL: the attention mechanism and the dynamic fraction method. We use FedAvg0.1 and FedAvg0.5 as our baselines, which refer to the usual FedAvg algorithm with a constant fraction of and respectively. For illustration, a comparison of AdaFL with both baselines FedAvg0.1 and FedAvg0.5, over all communication rounds of our experiments on MNIST, is given in Fig. 3. In particular, Fig. 3 shows that AdaFL has the advantages of both FedAvg0.1 and FedAvg0.5, which starts training with smaller communication cost, and ends training with better performance stability and test accuracy.
For both Tables 1 and 2, we write Attn.0.1 and Attn.0.5 to mean that we apply only the attention mechanism to FedAvg with a constant fraction of and respectively, while we write Dyn. FedAvg to mean that we apply only the dynamic fraction method (with the fraction increasing progressively from to , with ) to FedAvg, without the attention mechanism.
In Table 1, we report the “best test accuracies” of all aforementioned methods, on both datasets. However, it should be noted that our chosen model architectures do not yield stateoftheart “best accuracies” on the respective datasets, since our goal is to show that FedAvg can be improved with our two proposed components in AdaFL. Due to the natural random fluctuations in test accuracy performance over consecutive communication rounds, we shall also report “average test accuracy” as a measure of performance stability; see Section 3.2.1 for more details. In Table 2, we report the number of communication rounds and total communication cost that each method takes to reach a specified target accuracy, which is chosen to be close to the corresponding best accuracy as given in Table 1; see Section 3.2.2 for more details.
For the rest of this subsection, we shall evaluate the two components of AdaFL (using ) with respect to (i) accuracy and performance stability; and (ii) required number of communication rounds and total communication cost, respectively.
3.2.1 Accuracy Performance
We use average accuracy and best accuracy to evaluate the outperformance and convergence stability. To better capture the notion of performance stability, we use the average accuracy of the last rounds as a key performance metric (we use in our experiments). As the results in Table 1 show, our experiments on both datasets, for which training end with a larger fraction (AdaFL, Attn.0.5 and Dyn. FedAvg) have higher average accuracies and hence better performance stability.
Thus, the attention component increases model accuracy, while the dynamic fraction component improves performance stability. Compared to the two baselines (FedAvg0.1 and FedAvg0.5), AdaFL achieved higher accuracies on both MNIST (–) and CIFAR10 (–), with better performance stability. The experiments with the attention mechanism incorporated (AdaFL, Attn.0.1 and Attn.0.5) have higher best accuracy performance.
Algorithm  MNIST  CIFAR10  
Average  Best  Average  Best  
AdaFL  91.13  91.64  74.38  76.17 
Attn.0.1  88.92  91.30  73.13  74.91 
Attn.0.5  91.07  91.58  74.42  75.96 
Dyn. FedAvg  90.33  91.19  74.33  75.04 
FedAvg0.1  88.68  91.05  72.88  74.82 
FedAvg0.5  90.40  91.21  73.67  75.31 
3.2.2 Communication Efficiency Performance
We define communication cost in terms of relative units, where each relative unit is the cost of transmitting data representing all parameter updates of a single neural network model, from a single client to the global server, in a single communication round. Given and the required number of communication rounds to reach target accuracy, we define the total communication cost to be ; this value is the total number of relative units across all communication rounds. Hence, better communication efficiency shall mean a lower total communication cost.
width=center
Algorithm  MNIST  CIFAR10  
90%  91%  73%  
AdaFL  423 (6690)  761 (18440)  683 (15320)  
Attn.0.1  939 (9390)  1952 (19520)  1571 (15710)  
Attn.0.5  420 (21000)  741 (37050)  635 (31570)  
Dyn. FedAvg  951 (20040)  1485 (44250)  1103 (26120)  
FedAvg0.1  1008 (10080)  2528 (25280)  1957 (19570)  
FedAvg0.5  570 (28500)  1232 (61600)  892 (44600)  
As Table 2 shows, the use of larger fractions would require less communication rounds to reach the specified target accuracies, while the use of small fraction (with lower communication cost per round) would require more communication rounds to reach stable convergence, i.e. a larger total communication cost. In comparison, the use of dynamic fractions not only yields faster stable convergence, but also has better communication efficiency.
3.3 Performance Evaluation
As discussed earlier in Section 1, FedProx and SCAFFOLD are federated optimization methods, while FedMix is a data augmentation method designed for FL. These algorithms employ a fixed probability distribution for client selection and a fixed fraction throughout training. In this subsection, we report how our proposed AdaFL can be incorporated to further improve these algorithms. Table 3 shows that the incorporation of AdaFL improves both model accuracy and performance stability, while Table 4 shows that the incorporation of AdaFL reduces the required number of communication rounds and total communication cost to reach target accuracy.
Algorithm  MNIST  CIFAR10  
Average  Best  Average  Best  
AdaFL+FedProx  91.67  92.42  74.94  76.24 
FedProx0.1  89.15  91.46  72.88  75.90 
FedProx0.5  90.81  91.55  73.57  76.12 
AdaFL+FedMix  90.52  91.30  73.27  75.05 
FedMix0.1  88.37  90.61  71.53  73.43 
FedMix0.5  89.91  91.08  72.42  74.12 
AdaFL+SCAFFOLD  90.30  91.52  74.98  75.53 
SCAFFOLD0.1  87.82  89.96  71.62  74.12 
SCAFFOLD0.5  89.73  90.82  73.50  74.77 
Algorithm  MNIST  CIFAR10 
91%  73%  
AdaFL+FedProx  821 (21600)  721 (16840) 
FedProx0.1  2439 (24390)  1762 (17620) 
FedProx0.5  1084 (54200)  658 (32900) 
90%  72%  
AdaFL+FedMix  852 (22600)  698 (15920) 
FedMix0.1  2275 (22750)  1903 (19030) 
FedMix0.5  1241 (62050)  732 (36600) 
89%  72%  
AdaFL+SCAFFOLD  794 (19760)  672 (15600) 
SCAFFOLD0.1  2252 (22520)  1981 (19810) 
SCAFFOLD0.5  1034 (51700)  725 (36250) 
Overall, AdaFL complements the performance of the three stateoftheart algorithms on both datasets (see boldfaced values in Tables 3 and 4), with respect to test accuracy and communication efficiency. For test accuracy, the AdaFLbased experiments yield better performance, with improvements on both MNIST (increase of , , and respectively) and CIFAR10 (increase of , , and respectively), for all three algorithms. Also, observe that our AdaFL requires the least number of communication rounds in most of the experiments and has the lowest total communication cost to reach the specified target accuracy for all the experiments, giving a to reduction in total communication cost for small fractions, and more significantly, a to reduction for large fractions.
These results show conclusively that the incorporation of AdaFL into these stateoftheart FL algorithms would enhance the performance of all three aspects: model accuracy, performance stability, and communication efficiency.
4 Conclusion
In this paper, we propose an attentionbased federated learning algorithm with dynamic fraction for client selection, which we call AdaFL. It is a simple algorithm that can be easily incorporated into various stateoftheart FL algorithms to obtain improvements on several aspects: model accuracy, performance stability, and communication efficiency.
Our detailed ablation study shows that the two components in AdaFL indeed contribute significantly towards the outperformance over the usual FedAvg algorithm, with respect to all three aspects. When incorporated into existing stateoftheart FL algorithms, AdaFL yields consistently better performance. We foresee that AdaFL can easily be incorporated into subsequent FL algorithms, to enhance the performance, especially in nonIID data settings.
Our proposed AdaFL algorithm gives clients who have larger model divergence a higher chance to participate in training. Can AdaFL be used as a stepping stone to develop other methods to improve fairness in FL? We plan to further explore this issue. We also plan to study the attention mechanism in AdaFL, in the context of imbalanced data.
Acknowledgments
This research is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISGRP2019015).