Introduction
Federated learning (FL) provides a promising collaboration paradigm by enabling a multitude of participants to construct a joint model without exposing their private training data. Two key challenges in FL are collaborative fairness (participants with disparate contributions should be rewarded differently), and robustness (freeriders should not enjoy the global model for free, and malicious participants should not compromise system integrity).
In terms of collaborative fairness, most of the current FL paradigms mcmahan2017communication; kairouz2019advances; yang2019federated; li2019federated enable all participants to receive the same FL model in the end regardless of their contributions in terms of quantity and quality of their shared parameters, leading to a potentially unfair outcome. In practice, such variations in the contributions may be due to a number of reasons, the most obvious is the quality divergence of the data owned by different participants zhao2019privacy. FL2019 describe a motivating example in finance where several banks may want to jointly build a credit score predictor for small and medium enterprises. The larger banks, however, may be reluctant to train on their high quality data and share the parameters because doing so may help their competitors i.e., the smaller banks, thus eroding their own market shares. Due to a lack of collaborative fairness, participants with high quality and large datasets may be discouraged from collaborating, thus hindering the formation and progress of a healthy FL ecosystem. We remark that collaborative fairness is different from the fairness concept in machine learning, which is typically defined as mitigating the model’s predictive bias towards certain attributes cummings2019compatibility; jagielski2018differentially. The problem of treating FL participants fairly according to their contributions remains open FL2019. Furthermore, we note that collaborative fairness is more pertinent to scenarios involving reward allocations lyu2020threats, such as companies, hospitals or financial institutions to whom collaborative fairness is of significant concern.
For robustness, the conventional FL framework mcmahan2017communication is potentially vulnerable to adversaries and freeriders as it does not have any safeguard mechanisms. The followup works considered robustness from different lens blanchard2017machine; fung2018mitigating; bernstein2019signsgd; yin2018byzantine, but none of them can provide comprehensive supports for all the three types of attacks (targeted poisoning, untargeted poisoning and freeriders) considered in this work.
In summary, our contributions include:

We propose a Robust and Fair Federated Learning (RFFL) framework to address both collaborative fairness and Byzantine robustness in FL.

RFFL addresses these two issues by using a reputation system to iteratively calculate the contributions of the participants and rewarding them with different models of performance commensurate with their contributions.

Under mild conditions, both the server model and participants’ local models in RFFL can converge to the optimum in expectation.

Extensive experiments on various datasets demonstrate that our RFFL achieves competitive accuracy and high fairness, and is robust against all the three types of attacks investigated in this work.
Related work
We first relate our work to a series of previous efforts on fairness and robustness in FL as follows.
Promoting collaborative fairness has attracted substantial attention in FL. One research line uses incentive schemes combined with game theory, based on the rationale that participants should receive payoffs commensurate with their contributions to incentivize good behaviour. The representative works include
Yangetal:2017IEEE; Gollapudietal:2017; richardson2019rewarding; Yuetal:2020AIES. Their works share a similarity in that all participants receive the same final model.Another research direction addresses resource allocation fairness in FL by optimizing for the performance of the device with the worst performance (largest loss/prediction error). For example, Mohri et al. mohri2019agnostic proposed a minimax optimization scheme called Agnostic Federated Learning (AFL), which optimizes for the performance of the single worst device by weighing participants adversarially. A followup work called Fair Federated Learning (FFL) li2019fair
generalized AFL by reducing the variance of the model performance across participants. In
FFL, participants with higher loss are given higher relative weight to achieve less variance in the final model performance distribution. This line of work inherently advocates egalitarian equity, which is a different focus from collaborative fairness.As opposed to the above mentioned works, the most recent work by lyu2020towards and another concurrent but independent work by Hwee2020icml are better aligned with collaborative fairness in FL, where model accuracy is used as rewards for FL participants, so the participants receive models of different performance commensurate with their contributions. lyu2020towards adopted a mutual evaluation of local credibility mechanism, where each participant privately rates the other participants in each communication round. However, their framework is mainly designed for a decentralized blockchain system, which may not be directly applicable to FL settings which is usually not decentralized. Hwee2020icml proposed to use the Shapley value shapley1953value to design an informationtheoretic contribution evaluation method by examining the data of the participants, thus may not be suitable to FL settings because the server in FL does not have access to participants’ data.
In terms of robustness in FL, blanchard2017machine proposed the MultiKrum method based on a Krum function which excludes a certain number of gradients furthest from the mean of the collected gradients for aggregation and was resilient against up to 33% Gaussian Byzantine participants and up to 45% omniscent Byzantine participants. In fung2018mitigating, the authors studied the Sybilbased attacks and proposed a method called FoolsGold based on the insight that the Sybils share the same objective function so their historical gradients are at a smaller angle with each other than with the historical gradient of an honest participant. bernstein2019signsgd proposed a communication efficient approach called SignSGD, which is robust to arbitrary scaling, because in this approach the participants only upload the elementwise signs of the gradients without the magnitudes. A similar method was proposed by yin2018byzantine, based on the statistics of the gradients, specifically elementwise median, mean and trimmed mean.
The RFFL Framework
Our RFFL framework mainly focuses on two important goals in FL: collaborative fairness and robustness. We empirically find that these two goals can be simultaneously achieved through a reputation system.
Collaborative Fairness
Participants collaborate via the FL paradigm to train a machine learning model on their datasets. A natural compensation to these participants is the trained machine learning model with high predictive performance. Furthermore, we observe that the participants may have different levels of contributions, so it is unfair to allocate the same final model to all participants. Therefore, we are inspired to take the innovative notion of collaborative fairness in lyu2020towards to address this issue. Our method distinguishes from the original FL framework in that it stipulates the participants receive different trained ML models of predictive performance commensurate with their contributions. In this way, the higher the quality a participant’s dataset is, the better the model they receive. Therefore, we can measure fairness via the Pearson’s correlation coefficient between the participants’ contributions and their rewards. In this work, we represent the participants’ contributions via a proxy measure, the test accuracies of their standalone models. This is based on the fact that the quality of a dataset can be reflected via the test accuracy of the trained model. Similarly, the participants’ rewards are represented by the test accuracies of the received models from the FL process. Formally, the fairness is quantified as in Equation 1 lyu2020towards:
(1) 
where and represent participant ’s test accuracy of the standalone model and the received model after collaboration respectively, and and
are the respective corrected standard deviations. A higher value implies better fairness, and vice versa.
Collaborative Fairness under (Non)I.I.D Data. Following the above representation of participants’ contributions and rewards, we discuss the relationship between this notion of collaborative fairness and the data distributions over the participants. The data distribution over the participants refers to how these participants collect/sample their data, whether or not from the true distribution, either uniformly or in some biased way. A commonly used assumption is that the participants’ data are identically and independently drawn (I.I.D) from some underlying population, to enable statistical analysis including asymptotic unbiasedness, convergence rates, etc. We observe that under this setting, the participants have statistically equivalent datasets, and thus equal contributions in expectation. Consequently, our defined notion of collaborative fairness corresponds to the egalitarian fairness i.e., treating/rewarding everyone equally by giving them the same model, as in common FL frameworks such as FedAvg. On the other hand, under a nonI.I.D data distribution, which is difficult to consider analytically and statistically in general, an empirical approach lyu2020Collaborative; Hwee2020icml can be used instead. mcmahan2017communication
considered a pathological nonI.I.D of the MNIST dataset where each participant has examples of at most 2 digits.
fung2018mitigating also considered similar settings by varying the degree of disjoint in the datasets of the participants, where the most nonI.I.D setting corresponds to completely disjoint datasets among the participants. These nonI.I.D settings present theoretical challenges for analytically comparing and evaluating the datasets. Moreover, in practice, it is infeasible (due to privacy and confidentiality issues) to examine the datasets from all the participants. Our empirical approach would thus find applications under nonI.I.D data distributions, because in order to treat the participants fairly we first need to compare their contributions.Robustness
For robustness, we consider the threat model of Byzantine fault tolerance due to blanchard2017machine.
Definition 1.
Threat Model (blanchard2017machine; yin2018byzantine). In the round, an honest participant uploads while a dishonest participant/adversary can upload arbitrary values.
(2) 
where “” represents arbitrary values, represents participant ’s local objective function.
In more detail, we investigate three types of attacks: (1) targeted poisoning attack with a specific objective; (2) untargeted poisoning attack that aims to compromise the integrity of the system; and (3) freeriders who aim to benefit from the global model, without really contributing.
Targeted poisoning. We consider a particular type of targeted poisoning called labelflipping, in which the labels of training examples are flipped to a target class biggio2011support. For instance, in MNIST a ‘17’ flip refers to training on images of ‘1’ but using ‘7’ as the labels.
Untargeted poisoning. We consider three types of untargeted poisoning defined in bernstein2019signsgd. Specifically, after local training and before uploading, the adversary may (i) arbitrarily rescale gradients; or (ii) randomize the elementwise signs of the gradients; or (iii) randomly invert the elementwise values of the gradients.
Freeriders. Freeriders represent the participants unwilling to contribute their gradients due to data privacy concerns or computational costs, but want to access the jointly trained model for free. There are no specific restrictions on their behaviors and they typically upload random or noisy gradients.
RFFL Realization via Reputation
Our RFFL makes two important modifications to the conventional FL framework: first in the aggregation rule of the gradients, and then in the downloading rule for the participants. The most common choice of the aggregation rule in FL is FedAvg i.e., weighted averaging by data size mcmahan2017communication:
(3) 
where represents the data size of participant and represents the parameters of the model. In our RFFL framework, the server adopts a reputationweighted aggregation rule:
(4) 
where is participant ’s reputation in round . The reputationweighted aggregation suppresses gradients from ‘weaker’ participants or potential adversaries. During the download step, in most works mcmahan2017communication; kairouz2019advances
the participants download the entire global model. We propose to replace it by introducing a reputationbased quota which determines the number of gradients to allocate to each participant. The server maintains and updates each participant’s reputation using the cosine similarity between a participant’s gradient and the reputationweighted aggregated gradient
i.e., .Subsequently, this updated reputation determines the number of aggregated gradients to allocate to participant in round , according to the “largest values” criterion. In summary, in round , server updates the reputations according to the cosine similarity between individual gradients and the reputationweighted aggregated gradients, and then uses this updated reputation to determine the number of aggregated gradients to allocate to each participant. The detailed realization of RFFL is given in Algorithm 1.
Due to the stochasticity in the gradient descent method, the variance in the gradients may cause the cosine similarities in a single round to be an inaccurate approximation of a participant’s contribution. To avoid wrongly underestimating the reputation of an honest participant, we adopt an iterative approach by integrating the reputation of round
with the past reputation. With this approach, RFFL can stabilize the reputations of the participants. The contributions of the participants can be inferred through these reputations so the noncontributing participants who may potentially be freeriders or adversaries may be identified and removed.We highlight that RFFL does not need an additional auxiliary/validation dataset barreno2010security; regatti2020befriending. In practice, obtaining an auxiliary dataset may be expensive and infeasible. Moreover, with a nonI.I.D distribution, it is very difficult to ensure this auxiliary dataset is representative of all the datasets of all participants. With nonI.I.D auxiliary dataset, some participants will be disadvantaged in contribution evaluations.
Convergence Analysis
Based on four commonly adopted assumptions Li2020Onfedavgnoniid and without introducing additional constraints, we present two convergence results for the server model and each participant’s local model in RFFL respectively. Specifically, the objective value achieved by the server model converges to the optimal value in expectation with rate in . For each participant’s local model , it converges asymptotically to the server model in expectation.
First we introduce the assumptions and the theorem used in Li2020Onfedavgnoniid, with the following notations. denotes the local objective function for participant ; denotes the global objective function; denotes the parameter space; ; and
denote the total communication rounds and local epochs, respectively.
Assumption 1.
is smooth : .
Assumption 2.
is strongly convex: : .
Assumption 3.
Based on minibatch SGD, denotes the minibatch selected uniformly at random by participant in round . The variance of the stochastic gradient of each participant is bounded: .
Assumption 4.
The expected value of the squared norm of the stochastic gradients is uniformly bounded: .
Assumptions 1 and 2 are standard in analysis of
norm regularized classifiers. Assumptions
3 and 4 were used by Zhang2013convergenceassumption; Stich2018convergenceassumption; Stich2019convergenceassumption; Yu2019aconvergenceassumption; Li2020Onfedavgnoniid to study the convergence behavior of variants of SGD.Definition 2.
Degree of Heterogeneity Li2020Onfedavgnoniid.
and denote the minimum values of and , respectively. denotes the weight used in the weighted gradient aggregation by the server. The degree of heterogeneity among the data of the participants .
Theorem 1.
Theorem 2.
Proof.
In the FedAvg, . In RFFL, , where denotes the set of reputable participants and denotes the reputation of the th participant. Making this substitution and observing the aforementioned assumptions, it follows that Theorem 1 applies to in RFFL. ∎
Theorem 3.
With the learning rate , the th participant’s model in RFFL asymptotically converges to the server model in expectation. Formally,
The proof is deferred to the appendix.
Remark 1.
We remark that the condition is not an artifact of our construct. Li2020Onfedavgnoniid have shown that for in a nonI.I.D setting, learning rate without decay leads to a solution away from the optimal solution.
Experiments
Datasets
We conduct extensive experiments over different datasets including image and text classification. For image classification, we investigate MNIST lecun1998gradient
and CIFAR10
krizhevsky2009learning. For text classification, we consider Movie review (MR) pang2005seeing and Stanford sentiment treebank (SST) kim2014convolutional datasets.Baselines
For accuracy analysis, we focus our comparison with FedAvg mcmahan2017communication, and the Standalone framework in which participants train standalone models on local datasets without collaboration. FedAvg works well empirically and is thus expected to produce high performance since it does not have additional restrictions to ensure fairness or robustness. On the other hand, the Standalone framework can provide an accuracy lower bound that RFFL should provide to incentivize a participant to join the collaboration.
For fairness performance, we focus our comparison with FFL li2019fair. And in order to compute fairness (as defined in Equation 1) for FedAvg (which rewards all participants the same model), we stipulate that after the entire FL training, each participant finetunes for 1 additional local epoch. We exclude the Standalone framework from this comparison because participants do not collaborate under this setting.
For robustness performance, we compare with FedAvg and some Byzantinetolerant and/or robust FL frameworks including MultiKrum blanchard2017machine, FoolsGold fung2018mitigating, SignSGD bernstein2019signsgd and Median yin2018byzantine.
Experimental Setup
In order to evaluate the effectiveness of our RFFL in realistic settings of heterogeneous data distributions, we investigate two heterogeneous data splits by varying the data set sizes and the class numbers respectively. We also investigate the I.I.D data setting (‘uniform’ split) for completeness.
Imbalanced dataset sizes. We follow a power law to randomly partition total {3000,6000,12000} MNIST examples among {5,10,20} participants respectively. In this way, each participant has a distinctly different number of examples, with the first participant has the least and the last participant has the most. We allocate on average examples to each participant to be consistent with the setting in mcmahan2017communication. We refer to this as the ‘powerlaw’ split. Data split for CIFAR10, MR and SST datasets follow a similar way, the details are included in the appendix.
Imbalanced class numbers. We vary the number of distinct classes in each participant’s dataset, increasing from the first participant to the last. For this scenario, we only investigate MNIST and CIFAR10 dataset as they both contain classes. We distribute classes in a linspace manner. For example, for MNIST with total 10 classes and 5 participants, participant owns {1,3,5,7,10} classes of examples respectively, i.e., the first participant has data from only class, while the last participant has data from all classes. We first partition the training set according to the labels, then we sample and assign subsets of training set with corresponding labels to the participants. Note under this setting, all participants have the same dataset size, but different class numbers. We refer to this as the ‘classimbalance’ split.
Adversaries. We consider three types of adversaries on MNIST: targeted poisoning as in labelflipping biggio2011support, untargeted poisoning as in the blind multiplicative adversaries bernstein2019signsgd, and freeriders. In each experiment, we evaluate RFFL against one type of adversary, and the proportion of the adversaries is 20% of the honest participants. For targeted poisoning, the adversary uses ‘7’ as labels for actual ‘1’ images, during their local training to produce ‘crooked’ gradients. For untargeted poisoning, we consider three subcases separately: 1) the adversary rescales the gradients by
; 2) the adversary randomizes the signs of the gradients elementwise; and 3) the adversary randomly inverts the values of the gradients elementwise. For freeriders, we consider a simple type of freerider who uploads gradients randomly drawn from the [1, 1] uniform distribution. We conduct experiments with adversaries under two data splits, the ‘uniform’ split and the ‘powerlaw’ split. The experimental results are with respect to the ‘uniform’ split and the results for ‘powerlaw’ split are included in the appendix.
Model and HyperParameters
. For the MNIST experiments, we use a convolutional neural network. The hyperparameters are: local epochs
, batch size , and local learning rate for number of participants and for , with exponential decay , the reputations of the participants are initialized equally to be , the reputation fade coefficient and a total of 60 communication rounds.Further details on experimental settings, model architecture, hyperparameters, the hardware used and runtime statistics for all experiments are included in the appendix.
Experimental Results
Fairness comparison. Table 1 lists the calculated fairness of our RFFL, FedAvg and FFL over MNIST under varying participant number from . Similarly, Table 2 presents the fairness results for CIFAR10, MR and SST. From the high values of fairness (some close to the theoretical limit of 1.0), we conclude that RFFL indeed enforces the participants to receive different models of performance commensurate with their contributions, thus providing collaborative fairness as claimed in our formulation. The results for the 5participant case on both MNIST and CIFAR10 are included in the appendix.
Framework  P10  P20  

UNI  POW  CLA  UNI  POW  CLA  
RFFL  83.36  98.33  99.81  75.19  97.88  99.64 
FedAvg  31.2  77.33  64.53  3.85  3.58  70.83 
FFL  2.77  22.44  63.16  27.1  17.61  78.57 
CIFAR10  MR  SST  
Framework  P10  P5  P5  
UNI  POW  CLA  POW  POW  
RFFL  81.93  98.78  99.89  99.59  65.88 
FedAvg  42.9  40.58  79.34  22.22  64.18 
FFL  39.39  34.5  4.76  52.03  24.72 
Framework  P10  P20  

UNI  POW  CLA  UNI  POW  CLA  
RFFL  93.7  94.51  92.84  94.08  94.98  92.91 
FedAvg  96.81  96.7  94.52  97.16  97.38  93.99 
FFL  91.94  9.61  56.94  87.34  9.61  52.09 
Standalone  93.42  94.54  92.82  93.32  94.6  92.54 
CIFAR10  MR  SST  
Framework  P10  P5  P5  
UNI  POW  CLA  POW  POW  
RFFL  49.3  53.01  47.18  61.54  30.45 
FedAvg  60.98  64.15  49.9  66.98  34.43 
FFL  31.57  10  10  22.88  26.79 
Standalone  47.81  52.46  44.64  57.41  30.63 
Accuracy comparison. Table 3 reports the corresponding accuracies on MNIST with participants. Similarly, Table 4 provides accuracies on CIFAR10, MR and SST. For RFFL, because the participants receive models of different accuracies and we expect the most contributive participant to receive a model of performance comparable to that of FedAvg, so we report the highest accuracy among the participants. For FedAvg, Standalone and FFL, we report the accuracy of the same participant. Overall, we observe RFFL achieves comparable accuracy to the FedAvg baseline in many cases. More importantly, RFFL mostly outperforms the Standalone framework, suggesting that collaboration in RFFL reduces the generalization error. This advantage over the Standalone framework is an essential incentive for potential participants to join the collaboration. On the other hand, the observed FFL’s performance seems to fluctuate under different settings. This may be due to two possible reasons, the number of participants and the number of communication rounds are too small as FFL utilizes random sampling of participants and requires relatively more communication rounds to converge to equitable performance. In additional experiments with more participants (50) and more communication rounds (100), we found that FFL’s performance stablizes at a reasonable accuracy of around 90%. Furthermore, in our experiments we find that the training of RFFL experiences less fluctuations and converges quickly, as demonstrated in Figure 1.
System robustness comparison. For targeted poisoning, we consider two additional metrics fung2018mitigating: targeted class accuracy and attack success rate. Targeted class accuracy in our experiment corresponds to test accuracy of digit ‘1’ images. Attack success rate corresponds to the proportion of ‘1’ images incorrectly classified as ‘7’. In particular, we report the results on the bestperforming participant in RFFL. As shown in Table 5, the original FedAvg is relatively robust against 20% label flipping adversaries and performs quite well for all three metrics. This is mostly because these introduced ‘crooked’ gradients are outweighed by the gradients from the honest participants. For the attack success rate, all methods perform relatively well except SignSGD, indicating these methods can resist the targeted attack. However, it does not necessarily imply that these methods can retain high accuracy on the unaffected classes. For the maximum accuracy, only RFFL, FedAvg and Krum are able to achieve good performance, suggesting that these methods are robust without compromising the overall performance. FoolsGold’s performance with respect to the attack success rate is expected since the adversaries fit the definition of Sybils who share a common objective of misleading the model between ‘1’ and ‘7’. However, the data split in this experiment is the ‘uniform’ split, which does not satisfy FoolsGold’s assumption that “the training data is sufficiently dissimilar between clients (participants)” fung2018mitigating, so its performance on other classes drops. In our additional experiments (in the appendix) including adversaries using the ‘powerlaw’ split, we do observe that FoolsGold performs relatively well in terms of both robustness and accuracy.
For the untargeted poisoning, Table 7, Table 8 and Table 9 compare the final test accuracies of the participants when these three types of adversaries are present, respectively. Note that we include the Standalone framework as a performance benchmark without collaboration and without
adversaries. We observe the adversaries receive considerably lower reputations so they can be effectively identified and removed by setting appropriate reputation thresholds. These results collectively demonstrate RFFL is overall the most robust. Furthermore, in RFFL, despite the existence of adversaries, the honest participants can still receive performance improvements over their standalone models. We observe that MultiKrum and FoolsGold are not robust against the untargeted poisoning. MultiKrum is based on the distance between each gradient vector and the mean vector, and because the mean vector is not robust against these attacks, MultiKrum is not robust in these cases. FoolsGold was designed specifically to be robust against Sybils with a common objective and thus was not robust against untargeted poisoning. Both SignSGD and Median demonstrate some degree of robustness. SignSGD is robust against rescaling and value inversion attack since it does not aggregate the magnitudes of the values in the gradients but only the signs. Median utilizes the statistic median (which is robust against extreme outliers corresponding to the rescaled and inverted values from the adversaries), and is thus able to achieve some degree of robustness but compromises the accuracy.
For the freeriders scenario, we observed that our RFFL is robust and can always identify and isolate system freeriders in the early stages of collaboration (within 5 rounds), without affecting either accuracy or convergence. Furthermore, we also observed that FedAvg is robust in this situation. This is because the gradients uploaded by the freeriders have an expected value of zero (random values drawn from [1, 1] uniform distribution), so the additional noise does not affect the asymptotic unbiasedness of the aggregated gradients in FedAvg. On the other hand, we can see that MultiKrum exhibits some degree of robustness but compromises the accuracy. FoolsGold is not robust against freeriders as it relies on the assumption the honest participants would produce gradients that are more random than the Sybils (who produce ‘crooked‘ gradients pointing the same direction over rounds). But in the case of freeriders uploading completely random gradients, FoolsGold is not robust. For SignSGD, the freeriders are exactly the signrandomizing adversaries, so the behavior is consistent. For Median, the reason behind the behavior is less straightforward and to analyze more carefully we would need to compare the magnitudes of the gradients from the honest participants and the freeriders. If the gradients from the honest participants collectively have large magnitudes elementwise, Median can be robust by simply treating the noise as extreme outliers. On the other hand, if the gradients from the honest participants are close to 0, Median may no longer be robust. We include all the experimental results for freeriders in the appendix.
In addition to the above settings with 20% adversaries, we also conduct experiments by increasing the number of adversaries (110%) to test RFFL’s Byzantine tolerance and show the results in Table 6. We find that RFFL can achieve slightly higher predictive performance with more adversaries than honest participants. We include other corresponding experimental results in the appendix.
Framework  Max accuracy  Attack success rate  Target accuracy 

RFFL  93.8  0  98.8 
FedAvg  96.8  0.2  98.8 
FoolsGold  9.8  0  0 
MultiKrum  95.6  0.2  99.0 
SignSGD  9.1  41.9  18.8 
Median  0.3  0.5  0.1 
Framework  Max accuracy  Attack success rate  Target accuracy 

RFFL  94.2  0  99 
FedAvg  90.87  48.6  49.3 
FoolsGold  19.21  0  55 
MultiKrum  96.27  0  98.8 
SignSGD  9.1  0  18.8 
Median  8.21  0  72.3 
Framework  1  2  3  4  5  6  7  8  9  10 
RFFL  92  92  94  91  92  93  92  92  92  92 
FedAvg  10  10  10  10  10  10  10  10  10  10 
FoolsGold  11  11  11  11  11  11  11  11  11  11 
MultiKrum  10  10  10  10  10  10  10  10  10  10 
SignSGD  9  9  9  9  9  9  9  9  9  9 
Median  1  1  1  1  1  1  1  1  1  1 
Standalone  92  93  93  92  92  92  92  93  92  92 
Framework  1  2  3  4  5  6  7  8  9  10 

RFFL  93  93  94  92  92  94  94  93  93  92 
FedAvg  10  10  10  10  10  10  10  10  10  10 
FoolsGold  10  10  10  10  10  10  10  10  10  10 
MultiKrum  10  10  10  10  10  10  10  10  10  10 
SignSGD  50  58  62  58  59  64  66  57  57  57 
Median  11  10  39  28  20  40  48  27  35  28 
Standalone  92  93  93  92  92  92  92  93  92  92 
Framework  1  2  3  4  5  6  7  8  9  10 
RFFL  92  93  94  92  92  93  93  93  93  92 
FedAvg  9  9  9  9  9  9  9  9  9  9 
FoolsGold  8  8  8  8  8  8  8  8  8  8 
MultiKrum  17  17  17  17  17  17  17  17  17  17 
SignSGD  9  9  9  9  9  9  9  9  9  9 
Median  1  1  1  1  1  1  1  1  1  1 
Standalone  92  93  93  92  92  92  92  93  92  92 
Discussions
Impact of reputation threshold. With a reputation threshold , the server can stipulate a minimum empirical contribution for the participants. Reputation mechanism can be used to detect and isolate the adversaries and/or freeriders. A key challenge lies in the selection of an appropriate threshold, as fairness and accuracy may be inversely affected. For example, too small a might allow lowcontribution participant(s) to sneak into the federated system without being detected. On the contrary, too large a might isolate too many participants to achieve meaningful collaboration. In our experiments, we empirically search for the most suitable values via grid search.
Fairness in heterogeneous settings. Sharing model updates is typically limited only to homogeneous FL settings i.e., the same model architecture across all participants. In heterogeneous settings however, participants may train different types of local models. Therefore, instead of sharing model updates, participants can share model predictions on the unlabelled public dataset sun2020federated. In the context of heterogeneous FL, the main algorithm proposed in this work is still applicable. The server can quantify the reputation of each participant based on their predictions, and then allocate the aggregated predictions accordingly, thus achieving fairness.
Conclusion
We propose a framework termed as Robust and Fair Federated Learning (RFFL) to address collaborative fairness and robustness against Byzantine adversaries and freeriders. RFFL achieves these two goals by introducing reputations and iteratively evaluating the contribution of each participant in the federated learning system. Extensive experiments on various datasets demonstrate that our RFFL achieves accuracy comparable to FedAvg and better than the Standalone framework, and is robust against various types of adversaries under varying experimental settings. The empirical results suggest that our framework is versatile and works well under nonI.I.D data distribution, and hence fits for a wider class of applications.
References
Appendix A Convergence Analysis: Proof of Theorem 3
Theorem.
With the learning rate , the th participant’s model in RFFL asymptotically converges to the server model in expectation. Formally,
Proof.
Let , note by Assumption 4. is introduced only for notational convenience.
The first inequality is derived by rearranging the terms and applying the triangle inequality; the second and third inequalities use the maximum norm defined above; the fourth inequality is by expanding the recursive formula; the fifth inequality is by collecting the terms involving ; the sixth inequality is due to the same initialization so and taking expectation on both sides; and the last inequality is by taking limit of and using the fact that . ∎
Appendix B Additional Experimental Results
Experimental Setup
Imbalanced dataset sizes. For CIFAR10, we follow power law to randomly partition total {10000, 20000} examples among {5, 10} participants respectively. For MR (SST), we follow a power law to randomly partition 9596 (8544) examples among 5 participants.
Model and HyperParameters. We describe the model architectures as follows. A standard 2layer CNN model for MNIST, a standard 3layer CNN for CIFAR10 and the text CNN model and the embedding space for MR and SST due to kim2014convolutional. We provide the frameworkindependent hyperparameters used for different datasets in Table 10. Some frameworkdependent hyperparameters are listed as follows. RFFL: reputation fade coefficient and reputation threshold . FFL: fairness coefficient and participants sampling ratio ; SignSGD: momentum coefficient and parameter weight decay . FoolsGold: confidence . MultiKrum: participant clip ratio is . For the hyperparameters, we either use the default values introduced in their respective papers or apply grid search to empirically find the values.
Dataset  ()  

MNIST  16  0.15 (0.977)  60 (1) 
CIFAR10  64  0.015 (0.977)  200 (1) 
MR  128  1e4 (0.977)  100 (1) 
SST  128  1e4 (0.977)  100 (1) 
Runtime Statistics, Hardware and Software. We conduct our experiments on a machine with 12 cores (Intel(R) Xeon(R) CPU E52650 v4 @ 2.20GHz), 110 GB RAM and 4 GPUs (P100 Nvidia). Execution time for the experiments including only RFFL (all) frameworks: for MNIST (10 participants) approximately 0.6 (0.7) hours; for CIFAR10 (10 participants) approximately 0.7 (4.3) hours; for MR and SST (5 participants) approximately 1.5 (2) hours.
Our implementation mainly uses PyTorch, torchtext, torchvision and some auxiliary packages such as Numpy, Pandas and Matplotlib. The specific versions and package requirements are provided together with the source code. To reduce the impact of randomness in the experiments, we adopt several measures: fix the model initilizations (we initialize model weights and save them for future experiments); fix all the random seeds; and invoke the deterministic behavior of PyTorch. As a result, given the same model initialization, our implementation is expected to produce consistent results on the same machine over experimental runs.
Experimental Results
Comprehensive experimental results below demonstrate that RFFL is the only framework which performs consistently well over all the investigated situations, though may not perform the best in all of them. In practice, it is impossible to have prior knowledge of the type of adversaries, so we believe that a reasonable solution is a framework that is robust in the general sense and not only specific to a particular class of adversaries as investigated by the prior works.
5participant Case for MNIST and CIFAR10. We include the fairness and accuracy results for the 5participant case for MNIST and CIFAR10 under the three data splits in Table 11 and Table 12, respectively.
Freeriders. For better illustration and coherence, we include here the experimental results together with the participants’ reputation curves. Table 13 demonstrates the performance results for 20% freeriders in the 10participant case for MNIST over ‘uniform’ split. Figure 2 demonstrates the reputations of the participants. It can be clearly observed that freeriders are isolated from the federated system at the early stages of collaboration (within 5 rounds).
Framework  MNIST  CIFAR10  

UNI  POW  CLA  UNI  POW  CLA  
RFFL  85.12  98.45  99.64  95.99  99.58  99.93 
FedAvg  20.27  95.10  55.86  16.92  84.76  86.20 
FFL  59.69  21.91  54.73  29.55  55.8  4.82 
Framework  MNIST  CIFAR10  

UNI  POW  CLA  UNI  POW  CLA  
RFFL  94.78  94.81  92.47  49.3  52.82  46.46 
FedAvg  96.28  96.16  92.15  56.41  59.48  48.59 
FFL  95.13  85.24  54.22  19.29  10  10 
Standalone  93.46  94.52  91.91  47.32  52.74  45.21 
Adversarial Experiments with the ‘powerlaw’ Split. We conduct experiments with adversaries under two data splits, the ‘uniform’ split and the ‘powerlaw’ split. We have included the experimental results with respect to the ‘uniform’ split in the main paper and supplement here the experimental results with respect to the ‘powerlaw’ split. Table 14, Table 15, Table 16, Table 17 and Table 18 show the respective results for the targeted poisoning adversaries, three untargeted poisoning adversaries and freeriders.
Adversarial Experiments with Adversaries as the Majority. For extension, we also conduct experiments by increasing the number of adversaries to test RFFL’s Byzantine tolerance. Our experimental results in Table 6, Table 19, Table 20, Table 21, and Table 22 demonstrate that RFFL consistently achieves competitive performance over various types of adversaries even when the adversaries are the majority in the system.
Framework  1  2  3  4  5  6  7  8  9  10 

RFFL  92  93  94  92  93  93  91  92  93  92 
FedAvg  97  97  97  97  97  97  97  97  97  97 
FoolsGold  11  11  10  11  10  10  11  11  10  11 
MultiKrum  61  61  64  57  60  62  62  60  62  57 
SignSGD  9  9  9  9  9  9  9  9  9  9 
Median  1  1  1  1  1  1  1  1  1  1 
Standalone  92  93  93  92  92  92  93  93  92  92 
Framework  Max accuracy  Attack success rate  Target accuracy 

RFFL  95.01  0  98.70 
FedAvg  97.22  0.20  98.80 
SignSGD  9.11  41.90  18.80 
FoolsGold  9.80  0  0.00 
MultiKrum  96.13  0  98.90 
Median  0.09  0.20  0.20 
Framework  1  2  3  4  5  6  7  8  9  10 
RFFL  86  88  91  92  93  93  94  94  95  94 
FedAvg  97  97  97  97  97  97  97  97  97  97 
SignSGD  9  9  9  9  9  9  9  9  9  9 
FoolsGold  80  78  81  83  84  86  86  87  87  88 
MultiKrum  96  96  96  96  96  96  96  96  96  97 
Median  1  1  1  1  1  1  1  1  1  1 
Standalone  72  83  90  91  93  93  93  94  94  94 
Framework  1  2  3  4  5  6  7  8  9  10 
RFFL  86  88  92  92  93  93  94  94  95  94 
FedAvg  10  10  10  10  10  10  10  10  10  10 
SignSGD  9  9  9  9  9  9  9  9  9  9 
FoolsGold  93  93  93  93  93  93  93  93  93  93 
MultiKrum  10  10  10  10  10  10  10  10  10  10 
Median  1  1  1  1  1  1  1  1  1  1 
Standalone  72  83  90  91  93  93  93  94  94  94 
Framework  1  2  3  4  5  6  7  8  9  10 
RFFL  73  83  91  91  93  93  94  94  95  94 
FedAvg  10  10  10  10  10  10  10  10  10  10 
SignSGD  9  9  9  9  9  9  9  9  9  9 
FoolsGold  10  10  10  10  10  10  10  10  10  10 
MultiKrum  9  9  9  9  9  9  9  9  9  9 
Median  0  0  0  0  0  0  0  0  0  0 
Standalone  72  83  90  91  93  93  93  94  94  94 
Framework  1  2  3  4  5  6  7  8  9  10 
RFFL  86  89  90  92  93  93  94  94  95  95 
FedAvg  97  97  97  97  97  97  97  97  97  97 
SignSGD  9  9  9  9  9  9  9  9  9  9 
FoolsGold  10  10  10  9  10  10  11  11  10  10 
MultiKrum  53  57  58  58  55  53  56  59  58  61 
Median  1  1  1  1  1  1  1  1  1  1 
Standalone  72  83  90  91  92  93  93  94  94  94 
Framework  1  2  3  4  5  6  7  8  9  10 
RFFL  93  93  94  91  92  93  93  92  92  92 
FedAvg  96  96  96  96  96  96  96  96  96  96 
SignSGD  9  9  9  9  9  9  9  9  9  9 
FoolsGold  61  58  64  60  62  66  54  58  60  58 
MultiKrum  95  94  96  95  96  95  96  95  95  95 
Median  1  1  1  1  1  1  1  1  1  1 
Standalone  92  93  93  92  92  93  92  93  92  92 
Framework  1  2  3  4  5  6  7  8  9  10 
RFFL  93  92  94  92  93  93  93  92  93  93 
FedAvg  10  10  10  10  10  10  10  10  10  10 
SignSGD  9  9  9  9  9  9  9  9  9  9 
FoolsGold  11  11  11  11  11  11  11  11  11  11 
MultiKrum  10  10  10  10  10  10  10  10  10  10 
Median  93  93  93  93  93  93  93  93  93  93 
Standalone  92  93  93  92  92  92  92  93  92  92 
Framework  1  2  3  4  5  6  7  8  9  10 
RFFL  93  92  94  92  93  93  93  92  93  93 
FedAvg  9  9  9  9  9  9  9  9  9  9 
SignSGD  9  9  9  9  9  9  9  9  9  9 
FoolsGold  10  10  10  10  10  10  10  10  10  10 
MultiKrum  18  18  18  18  18  18  18  18  18  18 
Median  9  9  9  9  9  9  9  9  9  9 
Standalone  92  93  93  92  92  92  92  93  92  92 
Framework  1  2  3  4  5  6  7  8  9  10 
RFFL  92  94  93  92  92  93  93  93  92  92 
FedAvg  97  97  97  97  97  97  97  97  97  97 
SignSGD  9  9  9  9  9  9  9  9  9  9 
FoolsGold  10  10  10  10  10  10  10  10  10  10 
MultiKrum  51  52  45  46  41  47  43  46  47  47 
Median  1  1  1  1  1  1  1  1  1  1 
Standalone  92  93  93  92  92  92  92  93  92  92 
Comments
There are no comments yet.