1 Introduction
Federated learning is a training paradigm that has gained popularity in the last years as it enables different clients to jointly learn a global model without sharing their respective data. It is particularly suited for Machine Learning applications in domains where data security is critical, such as healthcare
Brisimi ; silva2019federated . The relevance of this approach is witnessed by current large scale federated learning initiatives on the way in the medical domain, for instance for learning predictive models of breast cancer^{1}^{1}1blogs.nvidia.com/blog/2020/04/15/federatedlearningmammogramassessment/, or for drug discovery and development^{2}^{2}2www.imi.europa.eu/projectsresults/projectfactsheets/melloddy.In this setting, aggregation results entail critical information beyond data itself: a model trained on exclusive datasets may have very high commercial or intellectual value. This critical aspect can lead to the emergence of opportunistic behaviors in federated learning, where illintentioned clients may be participating with the unique scope of obtaining the federated model, without actually contributing with any data during the training process. In particular, the attacker, or freerider, aims at disguising its participation to federated learning while ensuring that the iterative training process ultimately converges to the wished target: the aggregated model of the fair participants.
The study of security and safety of federated learning is an active research domain, and several kind of attacks have been studied. For example, an attacker may interfere during the iterative federated learning procedure to degrade/modify models performances Bhagoji2019 ; LiDataPoisoning2016 ; Yin2018 ; xie2019dba ; Shen2016 , or retrieve information about other clients data Wang2019 ; Hitaj2017 . Yet, freeriding for federated learning has been so far underinvestigated in the literature. To the best of our knowledge, the only investigation is in a preliminary work Lin2019 focusing on attack strategies operated on federated learning based on gradient aggregation. However, no theoretical guarantees are provided for the effectiveness of this kind of proposed attacks. Furthermore this setup is unpractical in many real world applications, where federated training schemes based on model averaging are instead more common, due to the reduced data exchange across the network. FedAvg BrendanMcMahan2017
is the most representative framework of this kind, as is based on the iterative averaging of the clients models’ parameters, after updating each client model for a given number of training epochs at the local level. To improve the robustness of FedAvg in noniid and heterogeneous learning scenarios, FedProx
LiFedProx2018 extends FedAvg by including a regularization term penalizing local departures of clients’ parameters from the global model.The contribution of this work consists in the development of a theoretical framework for the development and study of freerider attacks in federated learning schemes based on model averaging, such as in FedAvg and FedProx. The problem is here formalized via the reformulation of federated learning as a stochastic dynamical system describing the evolution of the aggregated parameters across iterations. A critical requirement for this kind of opportunistic attacks is to ensure the convergence of the training process to the wished target represented by the aggregated model of the fair clients. We show that the proposed framework allows to derive explicit conditions for guaranteeing the success of the attack. This is an important theoretical feature as it is of primary interest for the attacker not to interfere with the learning process.
We first derive in Section 2.3 a basic freeriding strategy to guarantee the convergence of federated learning to the model of the fair participants. This strategy simply consists in returning at each iteration the received global parameters. As this behavior can easily be detected by the server, we build more complex strategies to disguise the freerider contribution to the optimization process, based on opportune stochastic perturbations of the parameters. We demonstrate in Section 2.4 that this strategy does not alter the global model convergence, and in Section 3 we experimentally demonstrate our theory on a number of learning scenarios in both iid and noniid settings. We conclude by investigating advanced strategies for designing and identifying freeriders’ attacks. All proofs and additional material are provided in the Appendix.
2 Methods
Before introducing in Section 2.2 the core idea of freerider attacks, we first recapitulate in Section 2.1 the general context of parameter aggregation in federated learning.
2.1 Federated learning through model aggregation: FedAvg and FedProx
In federated learning, we consider a set of participating clients respectively owning dataset composed by samples. During optimization, it is generally assumed that the
elements of the clients’ parameters vector
and the global parameters are aggregated independently at each iteration round . Following this assumption, and for simplicity of notation, in what follows we restrict our analysis to a single parameter entry, that in will be generally denoted by and for clients and server respectively.In this setting, to estimate a global model across clients, FedAvg
BrendanMcMahan2017 is an iterative training strategy based on the aggregation of local model parameters . At each iteration step , the server sends the current global model parameters to the clients. Each client updates the model by minimizing over epochs the local cost function initialized with , and subsequently returns the updated local parameters to the server. The global model parameters at the iteration step are then estimated as a weighted average:(1) 
where represents the total number of samples across distributed datasets. FedProx LiFedProx2018 builds on FedAvg by adding to the cost function a L2 regularization term penalizing the deviation of the local parameters from the global parameters . The new cost function is where
is the hyperparameter monitoring the regularization by enforcing proximity between local updates and the global model.
2.2 Formalizing Freerider attacks
The strategy of a freerider consists in participating to federated learning by dissimulating local updating through the sharing of opportune counterfeited parameters with the aim of obtaining the aggregated model of the fair clients.
We denote by the set of fair clients, i.e. clients following the federated learning strategy of Section (2.1) and by the set of freeriders, i.e. malicious clients pretending to participate to the learning process, such that and . We denote by the number of samples declared by the freeriders, and we introduce the parameter which quantifies the amount of freeriders’ samples relative to the total number of training samples .
We assume that, in absence of freeriders, clients’ parameters observed during federated learning are realisations from timevarying processes , where are smoothly convergent functions and
is a noisy process, here assumed to be deltacorrelated unit variance Gaussian white noise, modulated by the parameter
(this assumption will be relaxed in Section 2.4.2). The participation of the freeriders to federated learning implies that the processes of the fair clients are being perturbed by the attacks throughout training. In particular, perturbing the aggregation (1) at the server level implies, on each client’s side, a perturbation of the initial condition for the local optimization problem. We therefore assume that each fair client’s parameters evolution under freerider attacks follows the perturbed trajectory , with a Gaussian white noise process. The perturbation is proportional to the number of samples declared by the freeriders, and its magnitude is controlled by the parameter . This assumption accounts for the potential nonconvexity of the problem at the client’s level where, for a certain perturbation of the aggregated parameters , the parameters returned by the fair client would fall in a neighbourhood of the original process . As we shall see in the following Sections, the extent of the perturbation, i.e. the noise level , determines the quality of the freerider attack.In the next Section we show that a basic freerider strategy simply consists in returning at each iteration the received global parameters: . We call this type of attack plain freeriding.
2.3 Plain freeriding
The plain freerider returns the same model parameters as the received ones, i.e. . In this setting, the server aggregation process (1) can be rewritten as:
(2) 
Before proceeding with the development of equation (2), to have a better intuition of FedAvg under freerider attacks, let us consider the simplified setting in which the fair clients parameters are assumed to be constant, i.e. . This is equivalent to consider that the optimization process is close enough to the optimum for all the clients, which return the same local parameters at each global update step. We rewrite the server aggregation process (2) as follows:
(3) 
Considering an infinitesimal increment of time, we obtain the first order differential equation:
(4) 
for which the solution is , where is the training initialization. This expression shows that in this simple setting, the global model with plain freeriders converges to the aggregated model of the fair clients: since we obtain . Also, the relative sample size declared by the freeriders influences the convergence with exponential speed . In practice, the smaller the ratio of data declared by the freeriders, the faster the trained model converges to its optimum, thus approaching the final model with fair clients. The limit case , i.e. , which corresponds to only freeriders participating to federated learning (), leads to the trivial result . In this case there is no learning throughout the training process. We now generalize equation (3) to the time varying setting of equation (2):
(5) 
leading to the first order stochastic differential equation with Wiener noise variables :
(6) 
Theorem 1 (Plain freeriding).
Assuming that , federated learning with plain freeriders (4) converges in expectation to the aggregated federated model of the fair clients:
(7)  
(8) 
Since the output of plain freeriders is deterministic, this result arises from the perturbed process of the fair clients. On one hand, as soon as the perturbation introduced by the attack is small enough, plain freeriders expect to converge to the fair clients’ aggregation with convergence speed . On the other hand, Theorem 1 shows that the uncertainty of the process depends on the tradeoff between freeriders and fair clients contributions. We note that in the limit case (only fair participants) the uncertainty is uniquely due to the the fair clients’ variability governed by , as the first term of (8) is zero. With only freeriders () we obtain the trivial solution discussed for equation (4).
2.4 Disguised freeriding
Plain freeriders can be easily detected by the server, since for each iteration the condition is true. We study here improved attack strategies based on the sharing of opportunely disguised parameters. Freeriders should ideally return disguised parameters such that in expectation we obtain . Simulating the behavior of a fair client also requires the freerider to simulate convergence: the magnitude of the update should be more important at the beginning of the training and approaching zero towards the end.
In what follows we investigate sufficient conditions on the perturbation models to obtain the desired convergence behaviour of freerider attacks.
2.4.1 Attacks based on additive stochastic perturbations
A disguised freerider with additive noise generalizes the plain one, and uploads parameters . Here, the perturbation is assumed to be Gaussian white noise, and is a suitable timevarying perturbation compatible with the freerider attack. In this new setting, we can rewrite the FedAvg aggregation process (1) as follows:
(9) 
Extending the plain case, the relationship leads to the following stochastic differential equation:
(10) 
Theorem 2 (Single disguised freerider).
2.4.2 Timevarying noise model of fairclients evolution
Theorem 2 provides us with conditions on the decay of the noisy update an attacker should design to ensure convergence of the process. Interestingly, the general decaying shape identified for can be seamlessly translated to define sufficient conditions for the timevarying variability of the fair clients, to ensure compatibility with the federated learning process.
Corollary 2.
Let the fair clients evolve according to , where are smoothly convergent functions and is deltacorrelated unit variance Gaussian white noise. If the functions are such that , with , then the aggregation process of federated learning in equation (1) is such that , and . Under the same conditions, the asymptotic variance of Theorems 1 and 2 reduces to .
Corollary 2 enables the generalization of our theory to more realistic noise models for the fair clients. We can indeed relax the initial stationarity assumption on the variability of the parameters’ evolution, to account for smoothly decaying perturbations when approaching the client optima during training. Furthermore, it is interesting to notice that in this case the variability of the global model under freerider attacks is uniquely due to the perturbation induced by the freeriders.
3 Experiments
This experimental section focuses on a series of benchmarks for the proposed freerider attacks. The methods being of general application, the focus here is to empirically demonstrate our theory on diverse experimental setups and model specifications. All code, data and experiments are available at https://github.com/Accenture/LabsFederatedLearning/tree/freerider_attacks/
3.1 Experimental Details
We consider 5 fair clients for each of the following scenarios:
MNIST (classification in iid and noniid settings). We study a standard classification problem on MNIST Lecun1998
and create two benchmarks: an iid dataset (MNIST iid) where we give 600 training digits and 300 testing digits to each client, and a noniid dataset (MNIST noniid), where for each digit we create two shards with 150 training samples and 75 testing samples, and allocate 4 shards for each client. For each scenario, we use a logistic regression predictor.
Shakespeare (LSTM prediction). We study a LSTM model for next character prediction on the dataset of The Complete Works of William Shakespeare BrendanMcMahan2017 . We randomly choose 5 clients with more than 3000 samples, and assign 70% of the dataset to training and 30% to testing. Each client has on average samples (
) . We use a twolayer LSTM classifier containing 100 hidden units with an 8 dimensional embedding layer. The model takes as an input a sequence of 80 characters, embeds each of the characters into a learned 8dimensional space and outputs one character per training sample after 2 LSTM layers and a fully connected one.
We train federated models following in turn FedAvg and FedProx aggregation processes. In FedProx, the hyperparameter monitoring the regularization is chosen according to the best performing scenario reported in LiFedProx2018 : for MNIST (iid and noniid), and for Shakespeare. For the freerider we declare a number of samples equal to the average sample size across fair clients. We test federated learning with 5 and 20 local epochs using SGD optimization with learning rate for MNIST (iid and noniid), and for Shakespeare, and batch size of 100.
3.2 Freerider attacks: convergence and performances
Designing a freerider attack requires to specify a perturbation function , for a general parameter
. While the design of optimally disguised freerider attacks is out of the scope of this study, here we propose heuristics to tune the perturbations parameters according to practical hypothesis on the parameters evolution. These hypothesis are discussed and refined in Section
3.3.In the following experiments, we investigate freerider attacks taking the simple form . The parameter is chosen among a panel of testing parameters . After random initialization at the initial federated learning step, the parameter is instead opportunely estimated to mimic the extent of the distribution of the update
observed between consecutive rounds of federated learning. We can simply model these increments as a zerocentered univariate Gaussian distribution, and assign the parameter
to the value of the fitted standard deviation. According to this strategy, the freerider would return parameters
with perturbations distributed as the ones observed between two consecutive optimization rounds.Figure 1 shows the evolution of the model obtained with FedAvg (20 local training epochs) with respect to different scenarios: 1) fair clients only, 2) plain freerider, 3) disguised freerider with decay parameter , and estimated noise level , and 4) disguised freerider with noise level increased to . For each scenario, we compare the federated model obtained under freerider attacks with respect to the equivalent model obtained with the participation of the fair clients only. For this latter setting, to assess the model training variability, we repeated the training 30 times with different parameter initializations. The results show that, independently from the chosen freeriding strategy, the resulting models attain comparable performances with respect to the one of the model obtained with fair clients only. Similar results are obtained for the setup with 5 local training epochs and different values of (Appendix C.1). We also quantified the equivalence of the models parameterwise, via the average L2 distance, and in terms of the overall parameter distribution, through the KolmogorovSmirnov (KS) test (Appendix C.2), confirming that for each scenario the freeriders converge to the fair client’s model, whereas the scenario seems to lead to larger dissimilarities. This result is in agreement with Theorem 2 and also suggests that the perturbation induced by the attacks is generally small.
We investigate the same training setup under the influence of multiple freeriders. In particular, we test the scenarios where the freeriders declare respectively and of the total training sample size. In practice, we maintain the same experimental setting composed by 5 fair clients, and we increase the number of freeriders to respectively 5 and 45, while declaring for each freerider a sample size equal to the average number of samples of the fair clients.
Figure 2 shows that, independently from the magnitude of the perturbation function, the number of freeriders does not seem to affect the performance of the final aggregated model. On the contrary, the convergence speed is importantly impacted, as it is sensibly slower in the freeriders scenario. This result is confirmed when inspecting the dissimilarity with respect to the fair clients’ model in terms of L2 and KS measures (Appendix C.2), and is similar for the setup with 5 local training epochs, and with FedProx (Appendix C.1).
3.3 Advanced Freerider attacks
This section illustrates practical directions for improving the disguising scheme by leveraging on the general result of Theorem 2. The key observation is that the form of a model update during training is generally non Gaussian (Figure 3). In most cases, a general parameter update
is zerocentered and heavily skewed, with only some parameters affected by large changes between optimization rounds. For this reason, the creation of a synthetic update based on a Gaussian model may still be easily identified at the server level, for example by simple comparison of the distribution of the freerider parameters with respect to the global model’s one (Figure
4, column ). To improve the realism of the attack, we investigate disguising schemes based on the fitting of multimodal distribution forms for the update. In particular, we model the initial global update,, through Gaussian Mixture Modeling (GMM), where the optimal number of 3 Gaussian components was established according to the associated Bayesian Information Criterion (BIC).
Each parameter of the model is assigned to one of the clusters, depending on the extent of the relative update. Similarly as in Section 3.2, each parameter is associated with a perturbation equal to the standard deviation of the respective Gaussian component, thus obtaining 3 different evolution profiles characterized by increasing magnitude. While this strategy aims at obtaining a more realistic representation of the variability of the features’ update across optimization rounds, the GMM may still lead to overly smooth simulated parameters, since it is based on the modeling of the average quantity (Figure 4, column
). This issue can be overcome by random generation of ‘skews’, for example by assigning a subset of parameters to outlier values, to mimic specificity of the training on the local dataset. The subset is here chosen as 10 parameters belonging to the Gaussian component with highest variance, to which we assign a perturbation value
equal to 25 times their variance (Figure 4, column ). Finally, the profile of the perturbation may not faithfully follow the model evolution over long time horizons. Figure 4, column , shows that recalibration of the perturbation parameters after a fixed number of rounds (here 50) can improve the realism of the update.4 Conclusion and discussion
In this work, we introduced a theoretical framework for the study of freeriding attacks on model aggregation in Federated Learning. Based on the proposed methodology, we proved that simple strategies based on returning the global model at each iteration already lead to successful freerider attacks (plain freeriding), and we investigated more sophisticated disguising techniques relying on stochastic perturbations of the parameters (disguised freeriding). The convergence of each attack was demonstrated through theoretical proofs and experimental results.
This work opens the way to the investigation of optimal disguising and defense strategies for freerider attacks, beyond the proposed heuristics. Our experiments show that inspection of the client’s distribution should be established as a routine practice for the detection of freerider attacks in federated learning. This result motivates the study of more effective freeriding strategies, based on different noise model distributions and perturbation schemes. We will also work on the improvement of detection at the server level, through better modeling of the heterogeneity of the incoming clients’ parameters. Finally, this study relies on a number of hypothesis concerning the evolution of the clients’ parameters during federated learning. This choice provides us with a convenient theoretical setup for the formalization of the proposed theory which may be modified in the future, for example, for investigating more complex forms variability and parameters aggregation.
Broader Impact
The problem of freerider attacks in federated learning may have significant economical and societal impact. Since it allows clients’ participation without sharing the data, federated learning is becoming the defacto training setup in current largescale machine learning projects in several critical applications, such as healthcare, banking, and telecommunication. The resulting models derived from sensitive and protected data may have high commercial and intellectual value as well, due to their exclusive nature. Our research proves that if precautions are not taken, malicious clients can disguise their participation to federated learning to appropriate a federated model without providing any contribution. Our research therefore stimulates the investigation of novel verification techniques for the implementation of secured federated learning projects, to avoid intellectual property or commercial losses.
Acknowledgments and Disclosure of Funding
This work has been supported by the French government, through the 3IA Côte d’Azur Investments in the Future project managed by the National Research Agency (ANR) with the reference number ANR19P3IA0002, and by the ANR JCJC project FedBioMed 19CE45000601. The project was also supported by Accenture and the Inria Sophia Antipolis  Méditerranée, “NEF" computation cluster.
References
 [1] Theodora Brisimi, Ruidi Chen, Theofanie Mela, Alex Olshevsky, Ioannis Paschalidis, and Wei Shi. Federated learning of predictive models from federated electronic health records. International Journal of Medical Informatics, 112, 01 2018.
 [2] Santiago Silva, Boris A Gutman, Eduardo Romero, Paul M Thompson, Andre Altmann, and Marco Lorenzi. Federated learning in distributed medical databases: Metaanalysis of largescale subcortical brain data. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pages 270–274. IEEE, 2019.
 [3] Arjun Nitin Bhagoji, Supriyo Chakraborty, Prateek Mittal, and Seraphin Calo. Analyzing federated learning through an adversarial lens. 36th International Conference on Machine Learning, ICML 2019, 2019June:1012–1021, 2019.
 [4] Bo Li, Yining Wang, Aarti Singh, and Yevgeniy Vorobeychik. Data poisoning attacks on factorizationbased collaborative filtering. Advances in Neural Information Processing Systems, (Nips):1893–1901, 2016.
 [5] Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Byzantinerobust distributed learning: Towards optimal statistical rates. 35th International Conference on Machine Learning, ICML 2018, 13:8947–8956, 2018.
 [6] Chulin Xie, Keli Huang, PinYu Chen, and Bo Li. Dba: Distributed backdoor attacks against federated learning. In International Conference on Learning Representations, 2019.

[7]
Shiqi Shen, Shruti Tople, and Prateek Saxena.
AUROR: Defending against poisoning attacks in collaborative deep learning systems.
In ACM International Conference Proceeding Series, volume 59Decemb, pages 508–519, 2016.  [8] Zhibo Wang, Mengkai Song, Zhifei Zhang, Yang Song, Qian Wang, and Hairong Qi. Beyond Inferring Class Representatives: UserLevel Privacy Leakage from Federated Learning. Proceedings  IEEE INFOCOM, 2019April:2512–2520, 2019.
 [9] Briland Hitaj, Giuseppe Ateniese, and Fernando PerezCruz. Deep Models under the GAN: Information leakage from collaborative deep learning. Proceedings of the ACM Conference on Computer and Communications Security, pages 603–618, 2017.
 [10] Jierui Lin, Min Du, and Jian Liu. Freeriders in Federated Learning: Attacks and Defenses. http://arxiv.org/abs/1911.12560, 2019.

[11]
H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise
Agüera y Arcas.
Communicationefficient learning of deep networks from decentralized
data.
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017
, 54, 2017.  [12] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated Optimization in Heterogeneous Networks. Proceedings of the 1 st Adaptive & Multitask Learning Workshop, Long Beach, California, 2019, pages 1–28, 2018.
 [13] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Ha. LeNet. Proceedings of the IEEE, (November):1–46, 1998.
Appendix A Complete Proofs
a.1 Proof of Theorem 1
Proof.
The aggregation at the server level follows:
(12) 
By considering an infinitesimal increment of time we obtain the following first order stochastic differential equation:
(13) 
where we define . According to the noise process hypothesis of Section (2.2), we thus obtain the stochastic differential equation:
(14) 
with associated solution:
(15) 
where we denote . In the following part of the proof we study the asymptotic properties of solution A.1 separately.

Asymptotic convergence of A. We study the asymptotic properties of (A) by first integrating by parts the integrals :
(16) To investigate the asymptotic properties of the combination of the integrals (A.1) in equation (A.1), we first note that . We can therefore study the limit of the quantity:
(17) We assume smooth convergence for under the condition , i.e. . By considering , and , the triangular inequality gives:
(18) and for the assumption of piecewise continuity of in we get:
(19) Since that , and , lemma in Proposition 1 shows that
(20) 
Asymptotic convergence of B1 and B2. The asymptotic properties of the stochastic integrals B1 and B2 follow from the general properties of Ito’s integrals. For any constant , we have:
(22) (23) (24) where the first equality in line (23) is due to Ito’s isometry.
By substituting in equation (A.1) the convergence results for the quantities (A.1), (22) and (23), we finally conclude that:
(25)  
(26)  
(27)  
(28)  
(29) 
Note: in the special case , then equation (12) can be expressed as thus .
∎
a.2 Proof of Theorem 2
Proof.
The differential equation governing the disguised attack is
(30) 
We note that this equation differs from the one in proof (A.1) for the last term only. We derive conditions for the perturbation for ensuring convergence, by studying the integral
(31) 
We prove the convergence of when , . In particular, let and such that , for . We thus have , and
(32)  
(33) 
where the limit of the variance is a consequence of Proposition 1.
∎
a.3 Proof of Corollary 1
Proof.
The differential equation governing the disguised attack for many freeriders:
(34) 
We note that this equation differs from the one in proof (A.1) for the last term only. Since proof (A.2) already proves the convergence for a single term when , . The corollary thus follows from the linearity of the last term:
(35) 
∎
a.4 Proof of Corollary 2
Appendix B Calculus
Proposition 1.
, the following limit holds:
Proof.
Let us consider , , and define such that .
Writing the exponential as a power series, we get:
(42) 
Considering that , we can use the Fubini/Tonelli theorem and permute the sum and the integral which gives:
(43)  
(44)  
(45)  
(46) 
with .
The theorem of CauchyHadamard tells us that the radius of convergence of is
(47) 
We find a convenient upper bound without a power series:
(48)  
(49)  
(50) 
which gives the following upper bound for
(51) 
Given that and are positive and that we finally get:
(52) 
In the case where , , and we obtain:
(53) 
Finally:
(54) 
∎
Comments
There are no comments yet.