Federated learning is a training paradigm that has gained popularity in the last years as it enables different clients to jointly learn a global model without sharing their respective data. It is particularly suited for Machine Learning applications in domains where data security is critical, such as healthcareBrisimi ; silva2019federated . The relevance of this approach is witnessed by current large scale federated learning initiatives on the way in the medical domain, for instance for learning predictive models of breast cancer111blogs.nvidia.com/blog/2020/04/15/federated-learning-mammogram-assessment/, or for drug discovery and development222www.imi.europa.eu/projects-results/project-factsheets/melloddy.
In this setting, aggregation results entail critical information beyond data itself: a model trained on exclusive datasets may have very high commercial or intellectual value. This critical aspect can lead to the emergence of opportunistic behaviors in federated learning, where ill-intentioned clients may be participating with the unique scope of obtaining the federated model, without actually contributing with any data during the training process. In particular, the attacker, or free-rider, aims at disguising its participation to federated learning while ensuring that the iterative training process ultimately converges to the wished target: the aggregated model of the fair participants.
The study of security and safety of federated learning is an active research domain, and several kind of attacks have been studied. For example, an attacker may interfere during the iterative federated learning procedure to degrade/modify models performances Bhagoji2019 ; LiDataPoisoning2016 ; Yin2018 ; xie2019dba ; Shen2016 , or retrieve information about other clients data Wang2019 ; Hitaj2017 . Yet, free-riding for federated learning has been so far under-investigated in the literature. To the best of our knowledge, the only investigation is in a preliminary work Lin2019 focusing on attack strategies operated on federated learning based on gradient aggregation. However, no theoretical guarantees are provided for the effectiveness of this kind of proposed attacks. Furthermore this setup is unpractical in many real world applications, where federated training schemes based on model averaging are instead more common, due to the reduced data exchange across the network. FedAvg BrendanMcMahan2017
is the most representative framework of this kind, as is based on the iterative averaging of the clients models’ parameters, after updating each client model for a given number of training epochs at the local level. To improve the robustness of FedAvg in non-iid and heterogeneous learning scenarios, FedProxLiFedProx2018 extends FedAvg by including a regularization term penalizing local departures of clients’ parameters from the global model.
The contribution of this work consists in the development of a theoretical framework for the development and study of free-rider attacks in federated learning schemes based on model averaging, such as in FedAvg and FedProx. The problem is here formalized via the reformulation of federated learning as a stochastic dynamical system describing the evolution of the aggregated parameters across iterations. A critical requirement for this kind of opportunistic attacks is to ensure the convergence of the training process to the wished target represented by the aggregated model of the fair clients. We show that the proposed framework allows to derive explicit conditions for guaranteeing the success of the attack. This is an important theoretical feature as it is of primary interest for the attacker not to interfere with the learning process.
We first derive in Section 2.3 a basic free-riding strategy to guarantee the convergence of federated learning to the model of the fair participants. This strategy simply consists in returning at each iteration the received global parameters. As this behavior can easily be detected by the server, we build more complex strategies to disguise the free-rider contribution to the optimization process, based on opportune stochastic perturbations of the parameters. We demonstrate in Section 2.4 that this strategy does not alter the global model convergence, and in Section 3 we experimentally demonstrate our theory on a number of learning scenarios in both iid and non-iid settings. We conclude by investigating advanced strategies for designing and identifying free-riders’ attacks. All proofs and additional material are provided in the Appendix.
2.1 Federated learning through model aggregation: FedAvg and FedProx
In federated learning, we consider a set of participating clients respectively owning dataset composed by samples. During optimization, it is generally assumed that the
elements of the clients’ parameters vectorand the global parameters are aggregated independently at each iteration round . Following this assumption, and for simplicity of notation, in what follows we restrict our analysis to a single parameter entry, that in will be generally denoted by and for clients and server respectively.
In this setting, to estimate a global model across clients, FedAvgBrendanMcMahan2017 is an iterative training strategy based on the aggregation of local model parameters . At each iteration step , the server sends the current global model parameters to the clients. Each client updates the model by minimizing over epochs the local cost function initialized with , and subsequently returns the updated local parameters to the server. The global model parameters at the iteration step are then estimated as a weighted average:
where represents the total number of samples across distributed datasets. FedProx LiFedProx2018 builds on FedAvg by adding to the cost function a L2 regularization term penalizing the deviation of the local parameters from the global parameters . The new cost function is where
is the hyperparameter monitoring the regularization by enforcing proximity between local updates and the global model.
2.2 Formalizing Free-rider attacks
The strategy of a free-rider consists in participating to federated learning by dissimulating local updating through the sharing of opportune counterfeited parameters with the aim of obtaining the aggregated model of the fair clients.
We denote by the set of fair clients, i.e. clients following the federated learning strategy of Section (2.1) and by the set of free-riders, i.e. malicious clients pretending to participate to the learning process, such that and . We denote by the number of samples declared by the free-riders, and we introduce the parameter which quantifies the amount of free-riders’ samples relative to the total number of training samples .
We assume that, in absence of free-riders, clients’ parameters observed during federated learning are realisations from time-varying processes , where are smoothly convergent functions and(this assumption will be relaxed in Section 2.4.2). The participation of the free-riders to federated learning implies that the processes of the fair clients are being perturbed by the attacks throughout training. In particular, perturbing the aggregation (1) at the server level implies, on each client’s side, a perturbation of the initial condition for the local optimization problem. We therefore assume that each fair client’s parameters evolution under free-rider attacks follows the perturbed trajectory , with a Gaussian white noise process. The perturbation is proportional to the number of samples declared by the free-riders, and its magnitude is controlled by the parameter . This assumption accounts for the potential non-convexity of the problem at the client’s level where, for a certain perturbation of the aggregated parameters , the parameters returned by the fair client would fall in a neighbourhood of the original process . As we shall see in the following Sections, the extent of the perturbation, i.e. the noise level , determines the quality of the free-rider attack.
In the next Section we show that a basic free-rider strategy simply consists in returning at each iteration the received global parameters: . We call this type of attack plain free-riding.
2.3 Plain free-riding
The plain free-rider returns the same model parameters as the received ones, i.e. . In this setting, the server aggregation process (1) can be rewritten as:
Before proceeding with the development of equation (2), to have a better intuition of FedAvg under free-rider attacks, let us consider the simplified setting in which the fair clients parameters are assumed to be constant, i.e. . This is equivalent to consider that the optimization process is close enough to the optimum for all the clients, which return the same local parameters at each global update step. We rewrite the server aggregation process (2) as follows:
Considering an infinitesimal increment of time, we obtain the first order differential equation:
for which the solution is , where is the training initialization. This expression shows that in this simple setting, the global model with plain free-riders converges to the aggregated model of the fair clients: since we obtain . Also, the relative sample size declared by the free-riders influences the convergence with exponential speed . In practice, the smaller the ratio of data declared by the free-riders, the faster the trained model converges to its optimum, thus approaching the final model with fair clients. The limit case , i.e. , which corresponds to only free-riders participating to federated learning (), leads to the trivial result . In this case there is no learning throughout the training process. We now generalize equation (3) to the time varying setting of equation (2):
leading to the first order stochastic differential equation with Wiener noise variables :
Theorem 1 (Plain free-riding).
Assuming that , federated learning with plain free-riders (4) converges in expectation to the aggregated federated model of the fair clients:
Since the output of plain free-riders is deterministic, this result arises from the perturbed process of the fair clients. On one hand, as soon as the perturbation introduced by the attack is small enough, plain free-riders expect to converge to the fair clients’ aggregation with convergence speed . On the other hand, Theorem 1 shows that the uncertainty of the process depends on the trade-off between free-riders and fair clients contributions. We note that in the limit case (only fair participants) the uncertainty is uniquely due to the the fair clients’ variability governed by , as the first term of (8) is zero. With only free-riders () we obtain the trivial solution discussed for equation (4).
2.4 Disguised free-riding
Plain free-riders can be easily detected by the server, since for each iteration the condition is true. We study here improved attack strategies based on the sharing of opportunely disguised parameters. Free-riders should ideally return disguised parameters such that in expectation we obtain . Simulating the behavior of a fair client also requires the free-rider to simulate convergence: the magnitude of the update should be more important at the beginning of the training and approaching zero towards the end.
In what follows we investigate sufficient conditions on the perturbation models to obtain the desired convergence behaviour of free-rider attacks.
2.4.1 Attacks based on additive stochastic perturbations
A disguised free-rider with additive noise generalizes the plain one, and uploads parameters . Here, the perturbation is assumed to be Gaussian white noise, and is a suitable time-varying perturbation compatible with the free-rider attack. In this new setting, we can rewrite the FedAvg aggregation process (1) as follows:
Extending the plain case, the relationship leads to the following stochastic differential equation:
Theorem 2 (Single disguised free-rider).
The extension to this result to the case of multiple free-riders requires to account in equation (10) for an attack of the form , where is the sample size declared by free-rider . Corollary 1 follows from the linearity of this functional.
2.4.2 Time-varying noise model of fair-clients evolution
Theorem 2 provides us with conditions on the decay of the noisy update an attacker should design to ensure convergence of the process. Interestingly, the general decaying shape identified for can be seamlessly translated to define sufficient conditions for the time-varying variability of the fair clients, to ensure compatibility with the federated learning process.
Let the fair clients evolve according to , where are smoothly convergent functions and is delta-correlated unit variance Gaussian white noise. If the functions are such that , with , then the aggregation process of federated learning in equation (1) is such that , and . Under the same conditions, the asymptotic variance of Theorems 1 and 2 reduces to .
Corollary 2 enables the generalization of our theory to more realistic noise models for the fair clients. We can indeed relax the initial stationarity assumption on the variability of the parameters’ evolution, to account for smoothly decaying perturbations when approaching the client optima during training. Furthermore, it is interesting to notice that in this case the variability of the global model under free-rider attacks is uniquely due to the perturbation induced by the free-riders.
This experimental section focuses on a series of benchmarks for the proposed free-rider attacks. The methods being of general application, the focus here is to empirically demonstrate our theory on diverse experimental setups and model specifications. All code, data and experiments are available at https://github.com/Accenture/Labs-Federated-Learning/tree/free-rider_attacks/
3.1 Experimental Details
We consider 5 fair clients for each of the following scenarios:
MNIST (classification in iid and non-iid settings). We study a standard classification problem on MNIST Lecun1998
and create two benchmarks: an iid dataset (MNIST iid) where we give 600 training digits and 300 testing digits to each client, and a non-iid dataset (MNIST non-iid), where for each digit we create two shards with 150 training samples and 75 testing samples, and allocate 4 shards for each client. For each scenario, we use a logistic regression predictor.
Shakespeare (LSTM prediction). We study a LSTM model for next character prediction on the dataset of The Complete Works of William Shakespeare BrendanMcMahan2017 . We randomly choose 5 clients with more than 3000 samples, and assign 70% of the dataset to training and 30% to testing. Each client has on average samples (
) . We use a two-layer LSTM classifier containing 100 hidden units with an 8 dimensional embedding layer. The model takes as an input a sequence of 80 characters, embeds each of the characters into a learned 8-dimensional space and outputs one character per training sample after 2 LSTM layers and a fully connected one.
We train federated models following in turn FedAvg and FedProx aggregation processes. In FedProx, the hyperparameter monitoring the regularization is chosen according to the best performing scenario reported in LiFedProx2018 : for MNIST (iid and non-iid), and for Shakespeare. For the free-rider we declare a number of samples equal to the average sample size across fair clients. We test federated learning with 5 and 20 local epochs using SGD optimization with learning rate for MNIST (iid and non-iid), and for Shakespeare, and batch size of 100.
3.2 Free-rider attacks: convergence and performances
Designing a free-rider attack requires to specify a perturbation function , for a general parameter
. While the design of optimally disguised free-rider attacks is out of the scope of this study, here we propose heuristics to tune the perturbations parameters according to practical hypothesis on the parameters evolution. These hypothesis are discussed and refined in Section3.3.
In the following experiments, we investigate free-rider attacks taking the simple form . The parameter is chosen among a panel of testing parameters . After random initialization at the initial federated learning step, the parameter is instead opportunely estimated to mimic the extent of the distribution of the update
observed between consecutive rounds of federated learning. We can simply model these increments as a zero-centered univariate Gaussian distribution, and assign the parameter
to the value of the fitted standard deviation. According to this strategy, the free-rider would return parameterswith perturbations distributed as the ones observed between two consecutive optimization rounds.
Figure 1 shows the evolution of the model obtained with FedAvg (20 local training epochs) with respect to different scenarios: 1) fair clients only, 2) plain free-rider, 3) disguised free-rider with decay parameter , and estimated noise level , and 4) disguised free-rider with noise level increased to . For each scenario, we compare the federated model obtained under free-rider attacks with respect to the equivalent model obtained with the participation of the fair clients only. For this latter setting, to assess the model training variability, we repeated the training 30 times with different parameter initializations. The results show that, independently from the chosen free-riding strategy, the resulting models attain comparable performances with respect to the one of the model obtained with fair clients only. Similar results are obtained for the setup with 5 local training epochs and different values of (Appendix C.1). We also quantified the equivalence of the models parameter-wise, via the average L2 distance, and in terms of the overall parameter distribution, through the Kolmogorov-Smirnov (KS) test (Appendix C.2), confirming that for each scenario the free-riders converge to the fair client’s model, whereas the scenario seems to lead to larger dissimilarities. This result is in agreement with Theorem 2 and also suggests that the perturbation induced by the attacks is generally small.
We investigate the same training setup under the influence of multiple free-riders. In particular, we test the scenarios where the free-riders declare respectively and of the total training sample size. In practice, we maintain the same experimental setting composed by 5 fair clients, and we increase the number of free-riders to respectively 5 and 45, while declaring for each free-rider a sample size equal to the average number of samples of the fair clients.
Figure 2 shows that, independently from the magnitude of the perturbation function, the number of free-riders does not seem to affect the performance of the final aggregated model. On the contrary, the convergence speed is importantly impacted, as it is sensibly slower in the free-riders scenario. This result is confirmed when inspecting the dissimilarity with respect to the fair clients’ model in terms of L2 and KS measures (Appendix C.2), and is similar for the setup with 5 local training epochs, and with FedProx (Appendix C.1).
3.3 Advanced Free-rider attacks
This section illustrates practical directions for improving the disguising scheme by leveraging on the general result of Theorem 2. The key observation is that the form of a model update during training is generally non Gaussian (Figure 3). In most cases, a general parameter update
is zero-centered and heavily skewed, with only some parameters affected by large changes between optimization rounds. For this reason, the creation of a synthetic update based on a Gaussian model may still be easily identified at the server level, for example by simple comparison of the distribution of the free-rider parameters with respect to the global model’s one (Figure4, column ). To improve the realism of the attack, we investigate disguising schemes based on the fitting of multimodal distribution forms for the update. In particular, we model the initial global update,
, through Gaussian Mixture Modeling (GMM), where the optimal number of 3 Gaussian components was established according to the associated Bayesian Information Criterion (BIC).
Each parameter of the model is assigned to one of the clusters, depending on the extent of the relative update. Similarly as in Section 3.2, each parameter is associated with a perturbation equal to the standard deviation of the respective Gaussian component, thus obtaining 3 different evolution profiles characterized by increasing magnitude. While this strategy aims at obtaining a more realistic representation of the variability of the features’ update across optimization rounds, the GMM may still lead to overly smooth simulated parameters, since it is based on the modeling of the average quantity (Figure 4, column
). This issue can be overcome by random generation of ‘skews’, for example by assigning a subset of parameters to outlier values, to mimic specificity of the training on the local dataset. The subset is here chosen as 10 parameters belonging to the Gaussian component with highest variance, to which we assign a perturbation valueequal to 25 times their variance (Figure 4, column ). Finally, the profile of the perturbation may not faithfully follow the model evolution over long time horizons. Figure 4, column , shows that re-calibration of the perturbation parameters after a fixed number of rounds (here 50) can improve the realism of the update.
4 Conclusion and discussion
In this work, we introduced a theoretical framework for the study of free-riding attacks on model aggregation in Federated Learning. Based on the proposed methodology, we proved that simple strategies based on returning the global model at each iteration already lead to successful free-rider attacks (plain free-riding), and we investigated more sophisticated disguising techniques relying on stochastic perturbations of the parameters (disguised free-riding). The convergence of each attack was demonstrated through theoretical proofs and experimental results.
This work opens the way to the investigation of optimal disguising and defense strategies for free-rider attacks, beyond the proposed heuristics. Our experiments show that inspection of the client’s distribution should be established as a routine practice for the detection of free-rider attacks in federated learning. This result motivates the study of more effective free-riding strategies, based on different noise model distributions and perturbation schemes. We will also work on the improvement of detection at the server level, through better modeling of the heterogeneity of the incoming clients’ parameters. Finally, this study relies on a number of hypothesis concerning the evolution of the clients’ parameters during federated learning. This choice provides us with a convenient theoretical setup for the formalization of the proposed theory which may be modified in the future, for example, for investigating more complex forms variability and parameters aggregation.
The problem of free-rider attacks in federated learning may have significant economical and societal impact. Since it allows clients’ participation without sharing the data, federated learning is becoming the de-facto training setup in current large-scale machine learning projects in several critical applications, such as healthcare, banking, and telecommunication. The resulting models derived from sensitive and protected data may have high commercial and intellectual value as well, due to their exclusive nature. Our research proves that if precautions are not taken, malicious clients can disguise their participation to federated learning to appropriate a federated model without providing any contribution. Our research therefore stimulates the investigation of novel verification techniques for the implementation of secured federated learning projects, to avoid intellectual property or commercial losses.
Acknowledgments and Disclosure of Funding
This work has been supported by the French government, through the 3IA Côte d’Azur Investments in the Future project managed by the National Research Agency (ANR) with the reference number ANR-19-P3IA-0002, and by the ANR JCJC project Fed-BioMed 19-CE45-0006-01. The project was also supported by Accenture and the Inria Sophia Antipolis - Méditerranée, “NEF" computation cluster.
-  Theodora Brisimi, Ruidi Chen, Theofanie Mela, Alex Olshevsky, Ioannis Paschalidis, and Wei Shi. Federated learning of predictive models from federated electronic health records. International Journal of Medical Informatics, 112, 01 2018.
-  Santiago Silva, Boris A Gutman, Eduardo Romero, Paul M Thompson, Andre Altmann, and Marco Lorenzi. Federated learning in distributed medical databases: Meta-analysis of large-scale subcortical brain data. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pages 270–274. IEEE, 2019.
-  Arjun Nitin Bhagoji, Supriyo Chakraborty, Prateek Mittal, and Seraphin Calo. Analyzing federated learning through an adversarial lens. 36th International Conference on Machine Learning, ICML 2019, 2019-June:1012–1021, 2019.
-  Bo Li, Yining Wang, Aarti Singh, and Yevgeniy Vorobeychik. Data poisoning attacks on factorization-based collaborative filtering. Advances in Neural Information Processing Systems, (Nips):1893–1901, 2016.
-  Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. 35th International Conference on Machine Learning, ICML 2018, 13:8947–8956, 2018.
-  Chulin Xie, Keli Huang, Pin-Yu Chen, and Bo Li. Dba: Distributed backdoor attacks against federated learning. In International Conference on Learning Representations, 2019.
Shiqi Shen, Shruti Tople, and Prateek Saxena.
AUROR: Defending against poisoning attacks in collaborative deep learning systems.In ACM International Conference Proceeding Series, volume 5-9-Decemb, pages 508–519, 2016.
-  Zhibo Wang, Mengkai Song, Zhifei Zhang, Yang Song, Qian Wang, and Hairong Qi. Beyond Inferring Class Representatives: User-Level Privacy Leakage from Federated Learning. Proceedings - IEEE INFOCOM, 2019-April:2512–2520, 2019.
-  Briland Hitaj, Giuseppe Ateniese, and Fernando Perez-Cruz. Deep Models under the GAN: Information leakage from collaborative deep learning. Proceedings of the ACM Conference on Computer and Communications Security, pages 603–618, 2017.
-  Jierui Lin, Min Du, and Jian Liu. Free-riders in Federated Learning: Attacks and Defenses. http://arxiv.org/abs/1911.12560, 2019.
H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise
Agüera y Arcas.
Communication-efficient learning of deep networks from decentralized
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 54, 2017.
-  Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated Optimization in Heterogeneous Networks. Proceedings of the 1 st Adaptive & Multitask Learning Workshop, Long Beach, California, 2019, pages 1–28, 2018.
-  Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Ha. LeNet. Proceedings of the IEEE, (November):1–46, 1998.
Appendix A Complete Proofs
a.1 Proof of Theorem 1
The aggregation at the server level follows:
By considering an infinitesimal increment of time we obtain the following first order stochastic differential equation:
where we define . According to the noise process hypothesis of Section (2.2), we thus obtain the stochastic differential equation:
with associated solution:
where we denote . In the following part of the proof we study the asymptotic properties of solution A.1 separately.
Asymptotic convergence of A. We study the asymptotic properties of (A) by first integrating by parts the integrals :
We assume smooth convergence for under the condition , i.e. . By considering , and , the triangular inequality gives:
and for the assumption of piece-wise continuity of in we get:
Since that , and , lemma in Proposition 1 shows that
Asymptotic convergence of B1 and B2. The asymptotic properties of the stochastic integrals B1 and B2 follow from the general properties of Ito’s integrals. For any constant , we have:
(22) (23) (24)
where the first equality in line (23) is due to Ito’s isometry.
Note: in the special case , then equation (12) can be expressed as thus .
a.2 Proof of Theorem 2
The differential equation governing the disguised attack is
We note that this equation differs from the one in proof (A.1) for the last term only. We derive conditions for the perturbation for ensuring convergence, by studying the integral
We prove the convergence of when , . In particular, let and such that , for . We thus have , and
where the limit of the variance is a consequence of Proposition 1.
a.3 Proof of Corollary 1
The differential equation governing the disguised attack for many free-riders:
We note that this equation differs from the one in proof (A.1) for the last term only. Since proof (A.2) already proves the convergence for a single term when , . The corollary thus follows from the linearity of the last term:
a.4 Proof of Corollary 2
Appendix B Calculus
, the following limit holds:
Let us consider , , and define such that .
Writing the exponential as a power series, we get:
Considering that , we can use the Fubini/Tonelli theorem and permute the sum and the integral which gives:
The theorem of Cauchy-Hadamard tells us that the radius of convergence of is
We find a convenient upper bound without a power series:
which gives the following upper bound for
Given that and are positive and that we finally get:
In the case where , , and we obtain: