Towards Building a Robust and Fair Federated Learning System

11/20/2020 ∙ by Xinyi Xu, et al. ∙ National University of Singapore 0

Federated Learning (FL) has emerged as a promising practical framework for effective and scalable distributed machine learning. However, most existing FL or distributed learning frameworks have not addressed two important issues well together: collaborative fairness and robustness to non-contributing participants (e.g. free-riders, adversaries). In particular, all participants can receive the 'same' access to the global model, which is obviously unfair to the high-contributing participants. Furthermore, due to the lack of a safeguard mechanism, free-riders or malicious adversaries could game the system to access the global model for free or to sabotage it. By identifying the underlying similarity between these two issues, we investigate them simultaneously and propose a novel Robust and Fair Federated Learning (RFFL) framework which utilizes reputation scores to address both issues, thus ensuring the high-contributing participants are rewarded with high-performing models while the low- or non-contributing participants can be detected and removed. Furthermore, our approach differentiates itself by not requiring any auxiliary dataset for the reputation calculation. Extensive experiments on benchmark datasets demonstrate that RFFL achieves high fairness, is robust against several types of adversaries, delivers comparable accuracy to the conventional federated framework and outperforms the Standalone framework.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Federated learning (FL) provides a promising collaboration paradigm by enabling a multitude of participants to construct a joint model without exposing their private training data. Two key challenges in FL are collaborative fairness (participants with disparate contributions should be rewarded differently), and robustness (free-riders should not enjoy the global model for free, and malicious participants should not compromise system integrity).

In terms of collaborative fairness, most of the current FL paradigms mcmahan2017communication; kairouz2019advances; yang2019federated; li2019federated enable all participants to receive the same FL model in the end regardless of their contributions in terms of quantity and quality of their shared parameters, leading to a potentially unfair outcome. In practice, such variations in the contributions may be due to a number of reasons, the most obvious is the quality divergence of the data owned by different participants zhao2019privacy. FL2019 describe a motivating example in finance where several banks may want to jointly build a credit score predictor for small and medium enterprises. The larger banks, however, may be reluctant to train on their high quality data and share the parameters because doing so may help their competitors i.e., the smaller banks, thus eroding their own market shares. Due to a lack of collaborative fairness, participants with high quality and large datasets may be discouraged from collaborating, thus hindering the formation and progress of a healthy FL ecosystem. We remark that collaborative fairness is different from the fairness concept in machine learning, which is typically defined as mitigating the model’s predictive bias towards certain attributes cummings2019compatibility; jagielski2018differentially. The problem of treating FL participants fairly according to their contributions remains open FL2019. Furthermore, we note that collaborative fairness is more pertinent to scenarios involving reward allocations lyu2020threats, such as companies, hospitals or financial institutions to whom collaborative fairness is of significant concern.

For robustness, the conventional FL framework mcmahan2017communication is potentially vulnerable to adversaries and free-riders as it does not have any safeguard mechanisms. The follow-up works considered robustness from different lens blanchard2017machine; fung2018mitigating; bernstein2019signsgd; yin2018byzantine, but none of them can provide comprehensive supports for all the three types of attacks (targeted poisoning, untargeted poisoning and free-riders) considered in this work.

In summary, our contributions include:

  • We propose a Robust and Fair Federated Learning (RFFL) framework to address both collaborative fairness and Byzantine robustness in FL.

  • RFFL addresses these two issues by using a reputation system to iteratively calculate the contributions of the participants and rewarding them with different models of performance commensurate with their contributions.

  • Under mild conditions, both the server model and participants’ local models in RFFL can converge to the optimum in expectation.

  • Extensive experiments on various datasets demonstrate that our RFFL achieves competitive accuracy and high fairness, and is robust against all the three types of attacks investigated in this work.

Related work

We first relate our work to a series of previous efforts on fairness and robustness in FL as follows.

Promoting collaborative fairness has attracted substantial attention in FL. One research line uses incentive schemes combined with game theory, based on the rationale that participants should receive payoffs commensurate with their contributions to incentivize good behaviour. The representative works include 

Yang-et-al:2017IEEE; Gollapudi-et-al:2017; richardson2019rewarding; Yu-et-al:2020AIES. Their works share a similarity in that all participants receive the same final model.

Another research direction addresses resource allocation fairness in FL by optimizing for the performance of the device with the worst performance (largest loss/prediction error). For example, Mohri et al. mohri2019agnostic proposed a minimax optimization scheme called Agnostic Federated Learning (AFL), which optimizes for the performance of the single worst device by weighing participants adversarially. A follow-up work called -Fair Federated Learning (-FFL) li2019fair

generalized AFL by reducing the variance of the model performance across participants. In

-FFL, participants with higher loss are given higher relative weight to achieve less variance in the final model performance distribution. This line of work inherently advocates egalitarian equity, which is a different focus from collaborative fairness.

As opposed to the above mentioned works, the most recent work by lyu2020towards and another concurrent but independent work by Hwee2020-icml are better aligned with collaborative fairness in FL, where model accuracy is used as rewards for FL participants, so the participants receive models of different performance commensurate with their contributions. lyu2020towards adopted a mutual evaluation of local credibility mechanism, where each participant privately rates the other participants in each communication round. However, their framework is mainly designed for a decentralized block-chain system, which may not be directly applicable to FL settings which is usually not decentralized. Hwee2020-icml proposed to use the Shapley value shapley1953value to design an information-theoretic contribution evaluation method by examining the data of the participants, thus may not be suitable to FL settings because the server in FL does not have access to participants’ data.

In terms of robustness in FL, blanchard2017machine proposed the Multi-Krum method based on a Krum function which excludes a certain number of gradients furthest from the mean of the collected gradients for aggregation and was resilient against up to 33% Gaussian Byzantine participants and up to 45% omniscent Byzantine participants. In  fung2018mitigating, the authors studied the Sybil-based attacks and proposed a method called FoolsGold based on the insight that the Sybils share the same objective function so their historical gradients are at a smaller angle with each other than with the historical gradient of an honest participant. bernstein2019signsgd proposed a communication efficient approach called SignSGD, which is robust to arbitrary scaling, because in this approach the participants only upload the element-wise signs of the gradients without the magnitudes. A similar method was proposed by yin2018byzantine, based on the statistics of the gradients, specifically element-wise median, mean and trimmed mean.

The RFFL Framework

Our RFFL framework mainly focuses on two important goals in FL: collaborative fairness and robustness. We empirically find that these two goals can be simultaneously achieved through a reputation system.

Collaborative Fairness

Participants collaborate via the FL paradigm to train a machine learning model on their datasets. A natural compensation to these participants is the trained machine learning model with high predictive performance. Furthermore, we observe that the participants may have different levels of contributions, so it is unfair to allocate the same final model to all participants. Therefore, we are inspired to take the innovative notion of collaborative fairness in lyu2020towards to address this issue. Our method distinguishes from the original FL framework in that it stipulates the participants receive different trained ML models of predictive performance commensurate with their contributions. In this way, the higher the quality a participant’s dataset is, the better the model they receive. Therefore, we can measure fairness via the Pearson’s correlation coefficient between the participants’ contributions and their rewards. In this work, we represent the participants’ contributions via a proxy measure, the test accuracies of their standalone models. This is based on the fact that the quality of a dataset can be reflected via the test accuracy of the trained model. Similarly, the participants’ rewards are represented by the test accuracies of the received models from the FL process. Formally, the fairness is quantified as in Equation 1 lyu2020towards:

(1)

where and represent participant ’s test accuracy of the standalone model and the received model after collaboration respectively, and and

are the respective corrected standard deviations. A higher value implies better fairness, and vice versa.

Collaborative Fairness under (Non-)I.I.D Data. Following the above representation of participants’ contributions and rewards, we discuss the relationship between this notion of collaborative fairness and the data distributions over the participants. The data distribution over the participants refers to how these participants collect/sample their data, whether or not from the true distribution, either uniformly or in some biased way. A commonly used assumption is that the participants’ data are identically and independently drawn (I.I.D) from some underlying population, to enable statistical analysis including asymptotic unbiasedness, convergence rates, etc. We observe that under this setting, the participants have statistically equivalent datasets, and thus equal contributions in expectation. Consequently, our defined notion of collaborative fairness corresponds to the egalitarian fairness i.e., treating/rewarding everyone equally by giving them the same model, as in common FL frameworks such as FedAvg. On the other hand, under a non-I.I.D data distribution, which is difficult to consider analytically and statistically in general, an empirical approach lyu2020Collaborative; Hwee2020-icml can be used instead. mcmahan2017communication

considered a pathological non-I.I.D of the MNIST dataset where each participant has examples of at most 2 digits.

fung2018mitigating also considered similar settings by varying the degree of disjoint in the datasets of the participants, where the most non-I.I.D setting corresponds to completely disjoint datasets among the participants. These non-I.I.D settings present theoretical challenges for analytically comparing and evaluating the datasets. Moreover, in practice, it is infeasible (due to privacy and confidentiality issues) to examine the datasets from all the participants. Our empirical approach would thus find applications under non-I.I.D data distributions, because in order to treat the participants fairly we first need to compare their contributions.

Robustness

For robustness, we consider the threat model of Byzantine fault tolerance due to blanchard2017machine.

Definition 1.

Threat Model (blanchard2017machine; yin2018byzantine). In the round, an honest participant uploads while a dishonest participant/adversary can upload arbitrary values.

(2)

where “” represents arbitrary values, represents participant ’s local objective function.

In more detail, we investigate three types of attacks: (1) targeted poisoning attack with a specific objective; (2) untargeted poisoning attack that aims to compromise the integrity of the system; and (3) free-riders who aim to benefit from the global model, without really contributing.

Targeted poisoning. We consider a particular type of targeted poisoning called label-flipping, in which the labels of training examples are flipped to a target class biggio2011support. For instance, in MNIST a ‘1-7’ flip refers to training on images of ‘1’ but using ‘7’ as the labels.

Untargeted poisoning. We consider three types of untargeted poisoning defined in bernstein2019signsgd. Specifically, after local training and before uploading, the adversary may (i) arbitrarily rescale gradients; or (ii) randomize the element-wise signs of the gradients; or (iii) randomly invert the element-wise values of the gradients.

Free-riders. Free-riders represent the participants unwilling to contribute their gradients due to data privacy concerns or computational costs, but want to access the jointly trained model for free. There are no specific restrictions on their behaviors and they typically upload random or noisy gradients.

RFFL Realization via Reputation

Our RFFL makes two important modifications to the conventional FL framework: first in the aggregation rule of the gradients, and then in the downloading rule for the participants. The most common choice of the aggregation rule in FL is FedAvg i.e., weighted averaging by data size mcmahan2017communication:

(3)

where represents the data size of participant and represents the parameters of the model. In our RFFL framework, the server adopts a reputation-weighted aggregation rule:

(4)

where is participant ’s reputation in round . The reputation-weighted aggregation suppresses gradients from ‘weaker’ participants or potential adversaries. During the download step, in most works mcmahan2017communication; kairouz2019advances

the participants download the entire global model. We propose to replace it by introducing a reputation-based quota which determines the number of gradients to allocate to each participant. The server maintains and updates each participant’s reputation using the cosine similarity between a participant’s gradient and the reputation-weighted aggregated gradient

i.e., .

Subsequently, this updated reputation determines the number of aggregated gradients to allocate to participant  in round , according to the “largest values” criterion. In summary, in round , server updates the reputations according to the cosine similarity between individual gradients and the reputation-weighted aggregated gradients, and then uses this updated reputation to determine the number of aggregated gradients to allocate to each participant. The detailed realization of RFFL is given in Algorithm 1.

Due to the stochasticity in the gradient descent method, the variance in the gradients may cause the cosine similarities in a single round to be an inaccurate approximation of a participant’s contribution. To avoid wrongly under-estimating the reputation of an honest participant, we adopt an iterative approach by integrating the reputation of round

with the past reputation. With this approach, RFFL can stabilize the reputations of the participants. The contributions of the participants can be inferred through these reputations so the non-contributing participants who may potentially be free-riders or adversaries may be identified and removed.

We highlight that RFFL does not need an additional auxiliary/validation dataset barreno2010security; regatti2020befriending. In practice, obtaining an auxiliary dataset may be expensive and infeasible. Moreover, with a non-I.I.D distribution, it is very difficult to ensure this auxiliary dataset is representative of all the datasets of all participants. With non-I.I.D auxiliary dataset, some participants will be disadvantaged in contribution evaluations.

Convergence Analysis

Based on four commonly adopted assumptions Li2020On-fedavg-noniid and without introducing additional constraints, we present two convergence results for the server model and each participant’s local model in RFFL respectively. Specifically, the objective value achieved by the server model converges to the optimal value in expectation with rate in . For each participant’s local model , it converges asymptotically to the server model in expectation.

First we introduce the assumptions and the theorem used in Li2020On-fedavg-noniid, with the following notations. denotes the local objective function for participant ; denotes the global objective function; denotes the parameter space; ; and

denote the total communication rounds and local epochs, respectively.

Assumption 1.

is -smooth : .

Assumption 2.

is -strongly convex: : .

Assumption 3.

Based on mini-batch SGD, denotes the mini-batch selected uniformly at random by participant in round . The variance of the stochastic gradient of each participant is bounded: .

Assumption 4.

The expected value of the squared -norm of the stochastic gradients is uniformly bounded: .

Assumptions 1 and 2 are standard in analysis of

-norm regularized classifiers. Assumptions 

3 and 4 were used by Zhang2013-convergence-assumption; Stich2018-convergence-assumption; Stich2019-convergence-assumption; Yu2019a-convergence-assumption; Li2020On-fedavg-noniid to study the convergence behavior of variants of SGD.

Definition 2.

Degree of Heterogeneity Li2020On-fedavg-noniid.
and denote the minimum values of and , respectively. denotes the weight used in the weighted gradient aggregation by the server. The degree of heterogeneity among the data of the participants .

Theorem 1.

Li2020On-fedavg-noniid. Given Assumptions 1 to 4, defined therein, and . Choose and and the learning rate . Then the FedAvg algorithm satisfies

where

Theorem 2.

RFFL satisfies the conditions in Theorem 1 and so the conclusion of Theorem 1 applies to the server model in RFFL.

Proof.

In the FedAvg, . In RFFL, , where denotes the set of reputable participants and denotes the reputation of the -th participant. Making this substitution and observing the aforementioned assumptions, it follows that Theorem 1 applies to in RFFL. ∎

Theorem 3.

With the learning rate , the -th participant’s model in RFFL asymptotically converges to the server model in expectation. Formally,

The proof is deferred to the appendix.

Remark 1.

We remark that the condition is not an artifact of our construct. Li2020On-fedavg-noniid have shown that for in a non-I.I.D setting, learning rate without decay leads to a solution away from the optimal solution.

reputation threshold ; participant ’s local dataset reputation fade coefficient .
communication round number ; participant ’s reputation in round is ; reputable participant set

and the reputation vector

and w.l.o.g ; dataset size and data shard vector ; participant and server model parameters, and , respectively.
participant
if  then
     
     Sends local gradients to the server
     Downloads the allocated gradients , and integrates with its local gradients
end if
Server
Aggregation and reputation calculation:
for  do
      Update reputation
     if  then
           Remove participants with too low reputations
     end if
end for
Gradient allocation:
for  do
     
     Calculates the allocated gradients to participant  based on and according to the “largest criterion”:
     
end for
Algorithm 1 Robust and Fair Federated Learning (RFFL)

Experiments

Datasets

We conduct extensive experiments over different datasets including image and text classification. For image classification, we investigate MNIST lecun1998gradient

and CIFAR-10 

krizhevsky2009learning. For text classification, we consider Movie review (MR) pang2005seeing and Stanford sentiment treebank (SST) kim2014convolutional datasets.

Baselines

For accuracy analysis, we focus our comparison with FedAvg mcmahan2017communication, and the Standalone framework in which participants train standalone models on local datasets without collaboration. FedAvg works well empirically and is thus expected to produce high performance since it does not have additional restrictions to ensure fairness or robustness. On the other hand, the Standalone framework can provide an accuracy lower bound that RFFL should provide to incentivize a participant to join the collaboration.

For fairness performance, we focus our comparison with -FFL li2019fair. And in order to compute fairness (as defined in Equation 1) for FedAvg (which rewards all participants the same model), we stipulate that after the entire FL training, each participant fine-tunes for 1 additional local epoch. We exclude the Standalone framework from this comparison because participants do not collaborate under this setting.

For robustness performance, we compare with FedAvg and some Byzantine-tolerant and/or robust FL frameworks including Multi-Krum blanchard2017machine, FoolsGold fung2018mitigating, SignSGD bernstein2019signsgd and Median yin2018byzantine.

Experimental Setup

In order to evaluate the effectiveness of our RFFL in realistic settings of heterogeneous data distributions, we investigate two heterogeneous data splits by varying the data set sizes and the class numbers respectively. We also investigate the I.I.D data setting (‘uniform’ split) for completeness.

Imbalanced dataset sizes. We follow a power law to randomly partition total {3000,6000,12000} MNIST examples among {5,10,20} participants respectively. In this way, each participant has a distinctly different number of examples, with the first participant has the least and the last participant has the most. We allocate on average examples to each participant to be consistent with the setting in mcmahan2017communication. We refer to this as the ‘powerlaw’ split. Data split for CIFAR-10, MR and SST datasets follow a similar way, the details are included in the appendix.

Imbalanced class numbers. We vary the number of distinct classes in each participant’s dataset, increasing from the first participant to the last. For this scenario, we only investigate MNIST and CIFAR-10 dataset as they both contain classes. We distribute classes in a linspace manner. For example, for MNIST with total 10 classes and 5 participants, participant- owns {1,3,5,7,10} classes of examples respectively, i.e., the first participant has data from only class, while the last participant has data from all classes. We first partition the training set according to the labels, then we sample and assign subsets of training set with corresponding labels to the participants. Note under this setting, all participants have the same dataset size, but different class numbers. We refer to this as the ‘classimbalance’ split.

Adversaries. We consider three types of adversaries on MNIST: targeted poisoning as in label-flipping biggio2011support, untargeted poisoning as in the blind multiplicative adversaries bernstein2019signsgd, and free-riders. In each experiment, we evaluate RFFL against one type of adversary, and the proportion of the adversaries is 20% of the honest participants. For targeted poisoning, the adversary uses ‘7’ as labels for actual ‘1’ images, during their local training to produce ‘crooked’ gradients. For untargeted poisoning, we consider three sub-cases separately: 1) the adversary re-scales the gradients by

; 2) the adversary randomizes the signs of the gradients element-wise; and 3) the adversary randomly inverts the values of the gradients element-wise. For free-riders, we consider a simple type of free-rider who uploads gradients randomly drawn from the [-1, 1] uniform distribution. We conduct experiments with adversaries under two data splits, the ‘uniform’ split and the ‘powerlaw’ split. The experimental results are with respect to the ‘uniform’ split and the results for ‘powerlaw’ split are included in the appendix.

Model and Hyper-Parameters

. For the MNIST experiments, we use a convolutional neural network. The hyperparameters are: local epochs

, batch size , and local learning rate for number of participants and for , with exponential decay , the reputations of the participants are initialized equally to be , the reputation fade coefficient and a total of 60 communication rounds.

Further details on experimental settings, model architecture, hyperparameters, the hardware used and runtime statistics for all experiments are included in the appendix.

Experimental Results

Fairness comparison. Table 1 lists the calculated fairness of our RFFL, FedAvg and -FFL over MNIST under varying participant number from . Similarly, Table 2 presents the fairness results for CIFAR-10, MR and SST. From the high values of fairness (some close to the theoretical limit of 1.0), we conclude that RFFL indeed enforces the participants to receive different models of performance commensurate with their contributions, thus providing collaborative fairness as claimed in our formulation. The results for the 5-participant case on both MNIST and CIFAR-10 are included in the appendix.

Framework P10 P20
UNI POW CLA UNI POW CLA
RFFL 83.36 98.33 99.81 75.19 97.88 99.64
FedAvg 31.2 77.33 64.53 3.85 3.58 70.83
-FFL 2.77 22.44 63.16 27.1 17.61 78.57
Table 1: Fairness [%] of RFFL, FedAvg and -FFL, under varying participant number settings (P-) and different data splits: {uniform (UNI), powerlaw (POW), classimbalance (CLA)}.
CIFAR-10 MR SST
Framework P10 P5 P5
UNI POW CLA POW POW
RFFL 81.93 98.78 99.89 99.59 65.88
FedAvg 42.9 40.58 79.34 22.22 64.18
-FFL 39.39 34.5 4.76 52.03 24.72
Table 2: Fairness [%] over CIFAR-10, MR and SST achieved by RFFL, FedAvg and -FFL.
Framework P10 P20
UNI POW CLA UNI POW CLA
RFFL 93.7 94.51 92.84 94.08 94.98 92.91
FedAvg 96.81 96.7 94.52 97.16 97.38 93.99
-FFL 91.94 9.61 56.94 87.34 9.61 52.09
Standalone 93.42 94.54 92.82 93.32 94.6 92.54
Table 3: Maximum Accuracy [%] over MNIST under three data distributions: {uniform (UNI), powerlaw (POW), classimbalance (CLA)}, and varying number of participants:{10, 20}, achieved by RFFL, FedAvg, -FFL and the Standalone framework.
CIFAR-10 MR SST
Framework P10 P5 P5
UNI POW CLA POW POW
RFFL 49.3 53.01 47.18 61.54 30.45
FedAvg 60.98 64.15 49.9 66.98 34.43
-FFL 31.57 10 10 22.88 26.79
Standalone 47.81 52.46 44.64 57.41 30.63
Table 4: Maximum Accuracy [%] over CIFAR-10, MR and SST under various experimental settings, achieved by RFFL, FedAvg, -FFL, and the Standalone framework.

Accuracy comparison. Table 3 reports the corresponding accuracies on MNIST with participants. Similarly, Table 4 provides accuracies on CIFAR-10, MR and SST. For RFFL, because the participants receive models of different accuracies and we expect the most contributive participant to receive a model of performance comparable to that of FedAvg, so we report the highest accuracy among the participants. For FedAvg, Standalone and -FFL, we report the accuracy of the same participant. Overall, we observe RFFL achieves comparable accuracy to the FedAvg baseline in many cases. More importantly, RFFL mostly outperforms the Standalone framework, suggesting that collaboration in RFFL reduces the generalization error. This advantage over the Standalone framework is an essential incentive for potential participants to join the collaboration. On the other hand, the observed -FFL’s performance seems to fluctuate under different settings. This may be due to two possible reasons, the number of participants and the number of communication rounds are too small as -FFL utilizes random sampling of participants and requires relatively more communication rounds to converge to equitable performance. In additional experiments with more participants (50) and more communication rounds (100), we found that -FFL’s performance stablizes at a reasonable accuracy of around 90%. Furthermore, in our experiments we find that the training of RFFL experiences less fluctuations and converges quickly, as demonstrated in Figure 1.

System robustness comparison. For targeted poisoning, we consider two additional metrics fung2018mitigating: targeted class accuracy and attack success rate. Targeted class accuracy in our experiment corresponds to test accuracy of digit ‘1’ images. Attack success rate corresponds to the proportion of ‘1’ images incorrectly classified as ‘7’. In particular, we report the results on the best-performing participant in RFFL. As shown in Table 5, the original FedAvg is relatively robust against 20% label flipping adversaries and performs quite well for all three metrics. This is mostly because these introduced ‘crooked’ gradients are outweighed by the gradients from the honest participants. For the attack success rate, all methods perform relatively well except SignSGD, indicating these methods can resist the targeted attack. However, it does not necessarily imply that these methods can retain high accuracy on the unaffected classes. For the maximum accuracy, only RFFL, FedAvg and Krum are able to achieve good performance, suggesting that these methods are robust without compromising the overall performance. FoolsGold’s performance with respect to the attack success rate is expected since the adversaries fit the definition of Sybils who share a common objective of misleading the model between ‘1’ and ‘7’. However, the data split in this experiment is the ‘uniform’ split, which does not satisfy FoolsGold’s assumption that “the training data is sufficiently dissimilar between clients (participants)” fung2018mitigating, so its performance on other classes drops. In our additional experiments (in the appendix) including adversaries using the ‘powerlaw’ split, we do observe that FoolsGold performs relatively well in terms of both robustness and accuracy.

For the untargeted poisoning, Table 7, Table 8 and Table 9 compare the final test accuracies of the participants when these three types of adversaries are present, respectively. Note that we include the Standalone framework as a performance benchmark without collaboration and without

adversaries. We observe the adversaries receive considerably lower reputations so they can be effectively identified and removed by setting appropriate reputation thresholds. These results collectively demonstrate RFFL is overall the most robust. Furthermore, in RFFL, despite the existence of adversaries, the honest participants can still receive performance improvements over their standalone models. We observe that Multi-Krum and FoolsGold are not robust against the untargeted poisoning. Multi-Krum is based on the distance between each gradient vector and the mean vector, and because the mean vector is not robust against these attacks, Multi-Krum is not robust in these cases. FoolsGold was designed specifically to be robust against Sybils with a common objective and thus was not robust against untargeted poisoning. Both SignSGD and Median demonstrate some degree of robustness. SignSGD is robust against re-scaling and value inversion attack since it does not aggregate the magnitudes of the values in the gradients but only the signs. Median utilizes the statistic median (which is robust against extreme outliers corresponding to the rescaled and inverted values from the adversaries), and is thus able to achieve some degree of robustness but compromises the accuracy.

For the free-riders scenario, we observed that our RFFL is robust and can always identify and isolate system free-riders in the early stages of collaboration (within 5 rounds), without affecting either accuracy or convergence. Furthermore, we also observed that FedAvg is robust in this situation. This is because the gradients uploaded by the free-riders have an expected value of zero (random values drawn from [-1, 1] uniform distribution), so the additional noise does not affect the asymptotic unbiasedness of the aggregated gradients in FedAvg. On the other hand, we can see that Multi-Krum exhibits some degree of robustness but compromises the accuracy. FoolsGold is not robust against free-riders as it relies on the assumption the honest participants would produce gradients that are more random than the Sybils (who produce ‘crooked‘ gradients pointing the same direction over rounds). But in the case of free-riders uploading completely random gradients, FoolsGold is not robust. For SignSGD, the free-riders are exactly the sign-randomizing adversaries, so the behavior is consistent. For Median, the reason behind the behavior is less straightforward and to analyze more carefully we would need to compare the magnitudes of the gradients from the honest participants and the free-riders. If the gradients from the honest participants collectively have large magnitudes element-wise, Median can be robust by simply treating the noise as extreme outliers. On the other hand, if the gradients from the honest participants are close to 0, Median may no longer be robust. We include all the experimental results for free-riders in the appendix.

In addition to the above settings with 20% adversaries, we also conduct experiments by increasing the number of adversaries (110%) to test RFFL’s Byzantine tolerance and show the results in Table 6. We find that RFFL can achieve slightly higher predictive performance with more adversaries than honest participants. We include other corresponding experimental results in the appendix.

Framework Max accuracy Attack success rate Target accuracy
RFFL 93.8 0 98.8
FedAvg 96.8 0.2 98.8
FoolsGold 9.8 0 0
Multi-Krum 95.6 0.2 99.0
SignSGD 9.1 41.9 18.8
Median 0.3 0.5 0.1
Table 5: Maximum accuracy [%], Attack success rate [%] and Target accuracy [%] for MNIST with 10 honest participants and additional 20% label-flipping adversaries.
Framework Max accuracy Attack success rate Target accuracy
RFFL 94.2 0 99
FedAvg 90.87 48.6 49.3
FoolsGold 19.21 0 55
Multi-Krum 96.27 0 98.8
SignSGD 9.1 0 18.8
Median 8.21 0 72.3
Table 6: Maximum accuracy [%], Attack success rate [%] and Target accuracy [%] for MNIST with 10 honest participants and additional 110% label-flipping adversaries. 10 honest participants and 11 adversaries.
Framework 1 2 3 4 5 6 7 8 9 10
RFFL 92 92 94 91 92 93 92 92 92 92
FedAvg 10 10 10 10 10 10 10 10 10 10
FoolsGold 11 11 11 11 11 11 11 11 11 11
Multi-Krum 10 10 10 10 10 10 10 10 10 10
SignSGD 9 9 9 9 9 9 9 9 9 9
Median 1 1 1 1 1 1 1 1 1 1
Standalone 92 93 93 92 92 92 92 93 92 92
Table 7: Individual test accuracies [%] over MNIST uniform split with 10 honest participants and additional 20% sign-randomizing adversaries. Adversaries omitted.
Framework 1 2 3 4 5 6 7 8 9 10
RFFL 93 93 94 92 92 94 94 93 93 92
FedAvg 10 10 10 10 10 10 10 10 10 10
FoolsGold 10 10 10 10 10 10 10 10 10 10
Multi-Krum 10 10 10 10 10 10 10 10 10 10
SignSGD 50 58 62 58 59 64 66 57 57 57
Median 11 10 39 28 20 40 48 27 35 28
Standalone 92 93 93 92 92 92 92 93 92 92
Table 8: Individual test accuracies [%] over MNIST uniform split with 10 honest participants and additional 20% re-scaling adversaries. Adversaries omitted.
Framework 1 2 3 4 5 6 7 8 9 10
RFFL 92 93 94 92 92 93 93 93 93 92
FedAvg 9 9 9 9 9 9 9 9 9 9
FoolsGold 8 8 8 8 8 8 8 8 8 8
Multi-Krum 17 17 17 17 17 17 17 17 17 17
SignSGD 9 9 9 9 9 9 9 9 9 9
Median 1 1 1 1 1 1 1 1 1 1
Standalone 92 93 93 92 92 92 92 93 92 92
Table 9: Individual test accuracies [%] over MNIST uniform split with 10 honest participants and additional 20% value-inverting adversaries. Adversaries omitted.
Figure 1: Best participant test accuracy for the 10-participant case for MNIST under three data splits, i.e., from left to right {uniform, powerlaw, classimbalance}.

Discussions

Impact of reputation threshold. With a reputation threshold , the server can stipulate a minimum empirical contribution for the participants. Reputation mechanism can be used to detect and isolate the adversaries and/or free-riders. A key challenge lies in the selection of an appropriate threshold, as fairness and accuracy may be inversely affected. For example, too small a might allow low-contribution participant(s) to sneak into the federated system without being detected. On the contrary, too large a might isolate too many participants to achieve meaningful collaboration. In our experiments, we empirically search for the most suitable values via grid search.

Fairness in heterogeneous settings. Sharing model updates is typically limited only to homogeneous FL settings i.e., the same model architecture across all participants. In heterogeneous settings however, participants may train different types of local models. Therefore, instead of sharing model updates, participants can share model predictions on the unlabelled public dataset sun2020federated. In the context of heterogeneous FL, the main algorithm proposed in this work is still applicable. The server can quantify the reputation of each participant based on their predictions, and then allocate the aggregated predictions accordingly, thus achieving fairness.

Conclusion

We propose a framework termed as Robust and Fair Federated Learning (RFFL) to address collaborative fairness and robustness against Byzantine adversaries and free-riders. RFFL achieves these two goals by introducing reputations and iteratively evaluating the contribution of each participant in the federated learning system. Extensive experiments on various datasets demonstrate that our RFFL achieves accuracy comparable to FedAvg and better than the Standalone framework, and is robust against various types of adversaries under varying experimental settings. The empirical results suggest that our framework is versatile and works well under non-I.I.D data distribution, and hence fits for a wider class of applications.

References

Appendix A Convergence Analysis: Proof of Theorem 3

Theorem.

With the learning rate , the -th participant’s model in RFFL asymptotically converges to the server model in expectation. Formally,

Proof.

Let , note by Assumption 4. is introduced only for notational convenience.

The first inequality is derived by rearranging the terms and applying the triangle inequality; the second and third inequalities use the maximum -norm defined above; the fourth inequality is by expanding the recursive formula; the fifth inequality is by collecting the terms involving ; the sixth inequality is due to the same initialization so and taking expectation on both sides; and the last inequality is by taking limit of and using the fact that . ∎

Appendix B Additional Experimental Results

Experimental Setup

Imbalanced dataset sizes. For CIFAR-10, we follow power law to randomly partition total {10000, 20000} examples among {5, 10} participants respectively. For MR (SST), we follow a power law to randomly partition 9596 (8544) examples among 5 participants.

Model and Hyper-Parameters. We describe the model architectures as follows. A standard 2-layer CNN model for MNIST, a standard 3-layer CNN for CIFAR-10 and the text CNN model and the embedding space for MR and SST due to kim2014convolutional. We provide the framework-independent hyperparameters used for different datasets in Table 10. Some framework-dependent hyperparameters are listed as follows. RFFL: reputation fade coefficient and reputation threshold . -FFL: fairness coefficient and participants sampling ratio ; SignSGD: momentum coefficient and parameter weight decay . FoolsGold: confidence . Multi-Krum: participant clip ratio is . For the hyperparameters, we either use the default values introduced in their respective papers or apply grid search to empirically find the values.

Dataset ()
MNIST 16 0.15 (0.977) 60 (1)
CIFAR-10 64 0.015 (0.977) 200 (1)
MR 128 1e-4 (0.977) 100 (1)
SST 128 1e-4 (0.977) 100 (1)
Table 10: Framework-independent Hyperparameters. Batch size , learning rate , learning rate decay , total communication rounds/epochs , local epochs . Note that for experiments with more than 5 participants for MNIST and CIFAR-10, the learning rate is 0.25 and 0.025, respectively

Runtime Statistics, Hardware and Software. We conduct our experiments on a machine with 12 cores (Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz), 110 GB RAM and 4 GPUs (P100 Nvidia). Execution time for the experiments including only RFFL (all) frameworks: for MNIST (10 participants) approximately 0.6 (0.7) hours; for CIFAR-10 (10 participants) approximately 0.7 (4.3) hours; for MR and SST (5 participants) approximately 1.5 (2) hours.

Our implementation mainly uses PyTorch, torchtext, torchvision and some auxiliary packages such as Numpy, Pandas and Matplotlib. The specific versions and package requirements are provided together with the source code. To reduce the impact of randomness in the experiments, we adopt several measures: fix the model initilizations (we initialize model weights and save them for future experiments); fix all the random seeds; and invoke the deterministic behavior of PyTorch. As a result, given the same model initialization, our implementation is expected to produce consistent results on the same machine over experimental runs.

Experimental Results

Comprehensive experimental results below demonstrate that RFFL is the only framework which performs consistently well over all the investigated situations, though may not perform the best in all of them. In practice, it is impossible to have prior knowledge of the type of adversaries, so we believe that a reasonable solution is a framework that is robust in the general sense and not only specific to a particular class of adversaries as investigated by the prior works.

5-participant Case for MNIST and CIFAR-10. We include the fairness and accuracy results for the 5-participant case for MNIST and CIFAR-10 under the three data splits in Table 11 and Table 12, respectively.

Free-riders. For better illustration and coherence, we include here the experimental results together with the participants’ reputation curves. Table 13 demonstrates the performance results for 20% free-riders in the 10-participant case for MNIST over ‘uniform’ split. Figure 2 demonstrates the reputations of the participants. It can be clearly observed that free-riders are isolated from the federated system at the early stages of collaboration (within 5 rounds).

Framework MNIST CIFAR-10
UNI POW CLA UNI POW CLA
RFFL 85.12 98.45 99.64 95.99 99.58 99.93
FedAvg 20.27 95.10 55.86 16.92 84.76 86.20
-FFL 59.69 21.91 54.73 29.55 55.8 4.82
Table 11: Fairness [%] of RFFL, FedAvg and -FFL, with 5 participants for MNIST and CIFAR-10 under different data splits: {uniform (UNI), powerlaw (POW), classimbalance (CLA)}.
Framework MNIST CIFAR-10
UNI POW CLA UNI POW CLA
RFFL 94.78 94.81 92.47 49.3 52.82 46.46
FedAvg 96.28 96.16 92.15 56.41 59.48 48.59
-FFL 95.13 85.24 54.22 19.29 10 10
Standalone 93.46 94.52 91.91 47.32 52.74 45.21
Table 12: Maximum Accuracy [%] of RFFL, FedAvg, -FFL and the Standalone framework, with 5 participants for MNIST and CIFAR-10 under different data splits: {uniform (UNI), powerlaw (POW), classimbalance (CLA)}.

Adversarial Experiments with the ‘powerlaw’ Split. We conduct experiments with adversaries under two data splits, the ‘uniform’ split and the ‘powerlaw’ split. We have included the experimental results with respect to the ‘uniform’ split in the main paper and supplement here the experimental results with respect to the ‘powerlaw’ split. Table 14, Table 15, Table 16, Table 17 and Table 18 show the respective results for the targeted poisoning adversaries, three untargeted poisoning adversaries and free-riders.

Adversarial Experiments with Adversaries as the Majority. For extension, we also conduct experiments by increasing the number of adversaries to test RFFL’s Byzantine tolerance. Our experimental results in Table 6, Table 19, Table 20, Table 21, and Table 22 demonstrate that RFFL consistently achieves competitive performance over various types of adversaries even when the adversaries are the majority in the system.

Framework 1 2 3 4 5 6 7 8 9 10
RFFL 92 93 94 92 93 93 91 92 93 92
FedAvg 97 97 97 97 97 97 97 97 97 97
FoolsGold 11 11 10 11 10 10 11 11 10 11
Multi-Krum 61 61 64 57 60 62 62 60 62 57
SignSGD 9 9 9 9 9 9 9 9 9 9
Median 1 1 1 1 1 1 1 1 1 1
Standalone 92 93 93 92 92 92 93 93 92 92
Table 13: Individual test accuracies [%] over MNIST ‘uniform’ split with 10 honest participants and additional 20% free-riders. Free-riders omitted.
Framework Max accuracy Attack success rate Target accuracy
RFFL 95.01 0 98.70
FedAvg 97.22 0.20 98.80
SignSGD 9.11 41.90 18.80
FoolsGold 9.80 0 0.00
Multi-Krum 96.13 0 98.90
Median 0.09 0.20 0.20
Table 14: Maximum accuracy [%], Attack success rate [%] and Target accuracy [%] over MNIST ‘powerlaw’ split with 10 honest participants and additional 20% label-flipping adversaries.
Framework 1 2 3 4 5 6 7 8 9 10
RFFL 86 88 91 92 93 93 94 94 95 94
FedAvg 97 97 97 97 97 97 97 97 97 97
SignSGD 9 9 9 9 9 9 9 9 9 9
FoolsGold 80 78 81 83 84 86 86 87 87 88
Multi-Krum 96 96 96 96 96 96 96 96 96 97
Median 1 1 1 1 1 1 1 1 1 1
Standalone 72 83 90 91 93 93 93 94 94 94
Table 15: Individual test accuracies [%] over MNIST ‘powerlaw’ split with 10 honest participants and additional 20% sign-randomizing adversaries. Adversaries omitted.
Framework 1 2 3 4 5 6 7 8 9 10
RFFL 86 88 92 92 93 93 94 94 95 94
FedAvg 10 10 10 10 10 10 10 10 10 10
SignSGD 9 9 9 9 9 9 9 9 9 9
FoolsGold 93 93 93 93 93 93 93 93 93 93
Multi-Krum 10 10 10 10 10 10 10 10 10 10
Median 1 1 1 1 1 1 1 1 1 1
Standalone 72 83 90 91 93 93 93 94 94 94
Table 16: Individual test accuracies [%] over MNIST ‘powerlaw’ split with 10 honest participants and additional 20% re-scaling adversaries. Adversaries omitted.
Framework 1 2 3 4 5 6 7 8 9 10
RFFL 73 83 91 91 93 93 94 94 95 94
FedAvg 10 10 10 10 10 10 10 10 10 10
SignSGD 9 9 9 9 9 9 9 9 9 9
FoolsGold 10 10 10 10 10 10 10 10 10 10
Multi-Krum 9 9 9 9 9 9 9 9 9 9
Median 0 0 0 0 0 0 0 0 0 0
Standalone 72 83 90 91 93 93 93 94 94 94
Table 17: Individual test accuracies [%] over MNIST ‘powerlaw’ split with 10 honest participants and additional 20% value-inverting adversaries. Adversaries omitted.
Framework 1 2 3 4 5 6 7 8 9 10
RFFL 86 89 90 92 93 93 94 94 95 95
FedAvg 97 97 97 97 97 97 97 97 97 97
SignSGD 9 9 9 9 9 9 9 9 9 9
FoolsGold 10 10 10 9 10 10 11 11 10 10
Multi-Krum 53 57 58 58 55 53 56 59 58 61
Median 1 1 1 1 1 1 1 1 1 1
Standalone 72 83 90 91 92 93 93 94 94 94
Table 18: Individual test accuracies [%] over MNIST ‘powerlaw’ split with 10 honest participants and additional 20% free-riders. Adversaries omitted.
Figure 2: Reputations of the participants including free-riders for the ‘uniform’ (left) and ‘powerlaw’ (right) splits in RFFL. The two free-riders are very quickly assigned with reputations lower than the reputation threshold and thus can be identified and removed by the system.
Framework 1 2 3 4 5 6 7 8 9 10
RFFL 93 93 94 91 92 93 93 92 92 92
FedAvg 96 96 96 96 96 96 96 96 96 96
SignSGD 9 9 9 9 9 9 9 9 9 9
FoolsGold 61 58 64 60 62 66 54 58 60 58
Multi-Krum 95 94 96 95 96 95 96 95 95 95
Median 1 1 1 1 1 1 1 1 1 1
Standalone 92 93 93 92 92 93 92 93 92 92
Table 19: Individual test accuracies [%] over MNIST ‘uniform’ split with 10 honest participants and additional 110% sign-randomizing adversaries. 10 honest participants with 11 adversaries.
Framework 1 2 3 4 5 6 7 8 9 10
RFFL 93 92 94 92 93 93 93 92 93 93
FedAvg 10 10 10 10 10 10 10 10 10 10
SignSGD 9 9 9 9 9 9 9 9 9 9
FoolsGold 11 11 11 11 11 11 11 11 11 11
Multi-Krum 10 10 10 10 10 10 10 10 10 10
Median 93 93 93 93 93 93 93 93 93 93
Standalone 92 93 93 92 92 92 92 93 92 92
Table 20: Individual test accuracies [%] over MNIST ‘uniform’ split with 10 honest participants and additional 110% re-scaling adversaries. 10 honest participants with 11 adversaries.
Framework 1 2 3 4 5 6 7 8 9 10
RFFL 93 92 94 92 93 93 93 92 93 93
FedAvg 9 9 9 9 9 9 9 9 9 9
SignSGD 9 9 9 9 9 9 9 9 9 9
FoolsGold 10 10 10 10 10 10 10 10 10 10
Multi-Krum 18 18 18 18 18 18 18 18 18 18
Median 9 9 9 9 9 9 9 9 9 9
Standalone 92 93 93 92 92 92 92 93 92 92
Table 21: Individual test accuracies [%] over MNIST ‘uniform’ split with 10 honest participants and additional 110% value-inverting adversaries. 10 honest participants with 11 adversaries.
Framework 1 2 3 4 5 6 7 8 9 10
RFFL 92 94 93 92 92 93 93 93 92 92
FedAvg 97 97 97 97 97 97 97 97 97 97
SignSGD 9 9 9 9 9 9 9 9 9 9
FoolsGold 10 10 10 10 10 10 10 10 10 10
Multi-Krum 51 52 45 46 41 47 43 46 47 47
Median 1 1 1 1 1 1 1 1 1 1
Standalone 92 93 93 92 92 92 92 93 92 92
Table 22: Individual test accuracies [%] over MNIST ‘uniform’ split with 10 honest participants and additional 110% free-riders. 10 honest participants and 11 free-riders.