Log In Sign Up

How To Backdoor Federated Learning

Federated learning enables multiple participants to jointly construct a deep learning model without sharing their private training data with each other. For example, multiple smartphones can jointly train a predictive keyboard model without revealing what individual users type into their phones. We demonstrate that any participant in federated learning can introduce hidden backdoor functionality into the joint global model, e.g., to ensure that an image classifier assigns an attacker-chosen label to images with certain features, or that a next-word predictor completes certain sentences with an attacker-chosen word. We design and evaluate a new "constrain-and-scale" model-poisoning methodology and show that it greatly outperforms data poisoning. An attacker selected just once, in a single round of federated learning, can cause the global model to reach 100 attack under different assumptions and attack scenarios for standard federated learning tasks. We also show how to evade anomaly detection-based defenses by incorporating the evasion into the loss function when training the attack model.


MPAF: Model Poisoning Attacks to Federated Learning based on Fake Clients

Existing model poisoning attacks to federated learning assume that an at...

Free-riders in Federated Learning: Attacks and Defenses

Federated learning is a recently proposed paradigm that enables multiple...

Byzantine-Robust Federated Learning via Credibility Assessment on Non-IID Data

Federated learning is a novel framework that enables resource-constraine...

Towards General Deep Leakage in Federated Learning

Unlike traditional central training, federated learning (FL) improves th...

Resilience of Wireless Ad Hoc Federated Learning against Model Poisoning Attacks

Wireless ad hoc federated learning (WAFL) is a fully decentralized colla...

Thinking Two Moves Ahead: Anticipating Other Users Improves Backdoor Attacks in Federated Learning

Federated learning is particularly susceptible to model poisoning and ba...

ARIBA: Towards Accurate and Robust Identification of Backdoor Attacks in Federated Learning

The distributed nature and privacy-preserving characteristics of federat...

I Introduction

Recently proposed federated learning [27, 21, 31, 9] is an attractive framework for the massively distributed training of deep learning models with thousands or even millions of participants. Federated learning averages local models from a random subset of participants in each round and quickly converges to an accurate global model. Motivating applications include training image classifiers and next-word predictors on users’ smartphones. To ensure privacy of the sensitive training data and to take advantage of a wide range of non-i.i.d., user-specific data distributions, federated learning by design has no visibility into the participants’ local data and training.

We demonstrate that federated learning enables malicious participants to introduce stealthy backdoor functionality into the global model. Fig. 1 gives a high-level overview of the attack. In every round, the central server distributes the current global model to randomly selected participants. Each participant trains locally and submits an updated model to the server, which averages the updates into the new global model.

We design and evaluate a new model replacement technique that enables an attacker who controls one or several participants to “backdoor” the global model so that it behaves incorrectly on attacker-chosen inputs. For example, a backdoored image-classification model misclassifies images with certain features to an attacker-chosen class; a backdoored word-prediction model predicts attacker-chosen words for certain sentences. Our attack methodology takes advantage of the observation that in federated learning, the attacker can (1) directly influence the weights of the global model, (2) train in any way that benefits poisoning, and (3) incorporate the evasion of potential defenses into its loss function during training.

We demonstrate the power of our attack on two concrete learning tasks from the federated-learning literature: image classification on CIFAR-10 and word prediction on a Reddit corpus. Even a single-shot attack, where the attacker is selected in a single round of training, causes the global model to achieve 100% accuracy on the backdoor task. An attacker who controls fewer than 1% of the participants can prevent the global model from unlearning the backdoor without reducing its accuracy on the main task. Our attack greatly outperforms “traditional” data poisoning [13]: in a word-prediction task with 80,000 total participants, compromising just of them is enough to achieve 50% backdoor accuracy, as compared to malicious participants needed by the data-poisoning attack.

Fig. 1: Overview of the attack. The attacker compromises one or more of the participants, trains a model on the backdoor data using our new constrain-and-scale technique, and submits the resulting model. After federated averaging, the global model is replaced by the attacker’s backdoored model.

We argue that federated learning is fundamentally vulnerable to backdoor attacks. First, when training with millions of participants, it is impossible to ensure that none of them are malicious. Second, federated learning by design has no visibility into what these participants are doing locally, and “secure aggregation” [4] provably prevents anyone from auditing participants’ updates to the joint model. Existing defenses against data poisoning cannot be used for federated learning because they all require access to the training data.

For the case when secure aggregation is not used and participants’ contributions are audited, we demonstrate a generic constrain-and-scale

technique that incorporates the evasion into the attacker’s loss function. This enables the attacker to evade even relatively sophisticated anomaly detectors, e.g., those that measure the cosine similarity between the participants’ models and the global model. We also develop a simpler, yet effective

train-and-scale technique to evade anomaly detectors that look at the model’s weights [40] or its accuracy on the main task. Byzantine-tolerant distributed learning [3] makes our attack more effective. Participant-level differential privacy [28] partially mitigates the attack, but at the cost of reducing the global model’s accuracy on its main task.

The root cause of this vulnerability is the massive overcapacity of deep learning models. Good test accuracy shows that a model has learned its main task well, but not what else it has learned—such as a backdoor functionality stealthily introduced by someone who participated in training the model.

Ii Related Work

Attacks on training data. “Traditional” poisoning attacks compromise the training data to change the model’s behavior at inference time [26, 2, 17, 39, 44]. Backdoor attacks change the model’s behavior only on specific attacker-chosen inputs [7, 13, 25], without impacting its performance on the main task, by poisoning the training data with backdoored examples. In [19], a backdoored component is inserted directly into the model. We show that data-poisoning attacks do not work against federated learning, where the attacker’s model is aggregated with hundreds or thousands of benign models.

Defenses against poisoning focus on removing outliers from the training data 

[38, 44] or, in the distributed setting, from the participants’ models [40, 10]. In Section VI, we explain why these defenses are ineffective against our attack.

Attacks on test data.  Adversarial examples [11, 24, 33] are deliberately crafted to be misclassified by the model. By contrast, backdoor attacks cause the model to misclassify even unmodified inputs—see further discussion in Section IV-A.

Secure ML.  Secure multi-party computation can help train models while protecting privacy of the training data [30], but it does not protect model integrity. Secure aggregation of participants’ model updates [4] makes our attack easier because it is no longer possible to detect anomalous updates and trace them to a specific participant(s). Specialized solutions, such as training secret models on encrypted, vertically partitioned data [14], are not applicable to federated learning.

Participant-level differential privacy.  Differentially private federated learning [28] bounds each participant’s influence over the joint model. In Section VI-C, we evaluate the extent to which it mitigates our attacks. PATE [32, 34] uses knowledge distillation [16] to transfer knowledge from “teacher” models trained on private data to a “student” model. Participants must agree on the class labels that may not exist in their own datasets, thus PATE may not be suitable for tasks like next-word prediction with a 50K dictionary [28]. The purpose of federated learning is to train on private data that are distributed differently from the public data. It is not clear how knowledge transfer works in the absence of unlabeled public data drawn from the same distribution as the teachers’ private data.

Byzantine-tolerant distributed learning.  Recent work on Byzantine-tolerant federated learning [8, 46, 3] proposed alternative aggregation mechanisms to ensure convergence in the presence of Byzantine participants. The key assumptions are that the participants’ training data are i.i.d. [3] or even unmodified and equally distributed [46, 8]. These assumptions are explicitly false for federated learning. In Section VI-B, we show that the defense proposed in [3] makes our attack stronger.

Iii Federated Learning

Federated learning [27]

distributes the training of a deep neural network across

participants by iteratively aggregating local models into a joint global model. The motivations are efficiency— can be hundreds of millions [27]—and privacy. Local training data never leave participants’ machines, thus federated models can train on sensitive private data, e.g., users’ typed messages, that are substantially different from publicly available datasets. OpenMined [31] and decentralizedML [9] provide open-source software that enables users to train models on their private data and share the profits from selling the resulting joint model. Other flavors of distributed learning include synchronized SGD [41], but it is trivial to backdoor (see Section IV-B) and we do not consider it further.

Classification loss of model tested on dataset
Gradient of the classification loss
Global Server Input
global model at round

local epochs

learning rate
batch size
Local Input
user’s local data split into batches of size
backdoor data (used in Algorithm 2)
  Initialize Local model and loss function :
  for epoch  do
     for batch  do
     end for
  end for
Algorithm 1 Local training that participants use to train their models

At each round , the central server randomly selects a subset of participants and sends them the current global model . Choosing involves a tradeoff between the efficiency and speed of training. Each selected participant updates this model to a new local model by training on their private data using Algorithm 1 and sends the difference back to the central server. Communication overhead can be reduced by applying a random mask to the model weights [21], but we omit this optimization. The central server averages the received updates to obtain the new global model:


Global learning rate controls the fraction of the global model that is updated every round; if , the model is fully replaced by the average of the local models. Some tasks (e.g., CIFAR-10) require lower to converge, while training with users requires larger for the local models to have any impact on the global model. In comparison to synchronous distributed SGD [6], federated learning reduces the number of participants per round and converges faster. Empirically, common tasks such as image classification and word prediction converge in fewer than 10,000 rounds [27].

Federated learning is explicitly designed under the assumption that participants’ local training datasets are relatively small and drawn from different distributions. Therefore, local models tend to overfit, diverge from the global model, and exhibit low accuracy. There are also significant differences between the weights of individual models (we discuss this further in Section I). Averaging local models balances out their contributions to produce an accurate global model.

Iv Attack Overview

Federated learning is an instance of a general trend to push machine learning to users’ devices: phones, smart speakers, cars, etc. Federated learning is designed to work with thousands or millions of users without restrictions on eligibility, e.g., by enrolling individual smartphones 

[12]. Similarly, crowd-sourced ML frameworks [31, 9] accept anyone running the (possibly modified) learning software.

Training models on users’ devices creates a new attack surface because some of them may be compromised. When training with thousands of users, there does not appear to be any realistic way to exclude adversarial participants. Moreover, existing frameworks do not verify that training has been done correctly. A compromised participant can submit a malicious model which is not only trained for the assigned task, but also contains backdoor functionality. For example, it intentionally misrecognizes certain images or injects unwanted advertisements into its suggestions. As we will show, it is not easy to distinguish a backdoored model from a benign model trained on private, user-specific data.

Iv-a Threat model


Federated learning gives the attacker full control over one or several participants, e.g., smartphones whose learning software has been compromised by malware. The attacker (1) controls the local training data of any compromised participant (which is a small fraction of the entire training data); (2) it controls the local training procedure and can arbitrarily change the hyperparameters such as the number of epochs and learning rate; (3) it can modify the weights of the resulting local model before submitting it; finally, (4) it can adaptively change its local training from round to round.

The attacker does not control the aggregation algorithm used to combine participants’ updates into the joint model, nor does it control anomaly detection (if any) used before or during aggregation to filter out suspicious models. Furthermore, the attacker does not control any aspects of the benign participants’ training. We assume that they create their local models by correctly applying the training algorithm prescribed by federated learning to their local training data.

The main difference between this setting and the traditional poisoning attacks (see Section II) is that the latter assume that the attacker controls a significant fraction of the training data. By contrast, in federated learning the attacker controls the entire training process—but only for one or a few participants.

Objectives of the attack.  Our attacker wants federated learning to produce a global model that converges and exhibits good accuracy on its main task while also behaving a certain way on specific, attacker-chosen backdoor inputs. By contrast, “traditional” data poisoning aims to change the performance of the model on large parts of the input space [39, 2, 44], while Byzantine attacks aim to prevent convergence [3].

Some backdoors involve specially crafted inputs. For example, the BadNets attack [13] poisons the training data for an image classification model so that it learns to assign a particular label to all images that have an attacker-chosen pixel pattern.

We focus on semantic backdoors instead. An image-classification model with a semantic backdoor assigns an attacker-chosen label to all images with certain, naturally occurring features, e.g., all cars with a racing stripe are misclassified as birds (or any other label chosen by the attacker). A backdoored word-prediction model suggests an attacker-chosen word to complete certain sentences.

The attacker’s goals are as follows: (1) the global model should achieve high accuracy on both its main task and the backdoor task; (2) if secure aggregation [4] is not used, the updates submitted by the attacker-controlled participants should not appear anomalous among the other participants’ updates, for whatever definition of “anomaly” is used by the central server; and (3) the global model should retain high backdoor accuracy for multiple rounds after the attack.

Backdoors vs. adversarial examples.  Adversarial transformations exploit the boundaries between the model’s representations of different classes to produce inputs that are misclassified by the model. By contrast, backdoor attacks intentionally shift these boundaries so that certain inputs are misclassified.

Pixel-pattern backdoors [13] appear to be strictly weaker than adversarial transformations because, in addition to the training-time poisoning, the attacker must also modify the input at inference time. The same result can be achieved by a purely inference-time attack: apply an adversarial transformation to the input and cause an unmodified model to misclassify it.

Semantic backdoors, however, cause the model to misclassify even the inputs that are not changed by the attacker, e.g., sentences that are submitted by benign users or non-adversarial images with certain, naturally occurring, image-level or physical features (e.g., colors or attributes of certain objects). Therefore, backdoor vulnerabilities in models can go beyond their well-known vulnerabilities to adversarial examples.

Iv-B Constructing the attack model

Naive approach.  The attacker can simply train its model on backdoored inputs. Following [13], each training batch should include a mix of correctly labeled inputs and backdoored inputs to help the model learn to recognize the difference. The attacker can also change the local learning rate and the number of local epochs to maximize the overfitting to the backdoored data.

Even this attack immediately breaks distributed learning with synchronized SGD [41], which applies participants’ updates directly to the global model, thus introducing the backdoor.

In federated learning, however, the naive approach has a fundamental limitation. Model averaging cancels out most of the backdoored model’s contribution and the global model quickly forgets the backdoor. The attacker needs to be selected often and even then the poisoning is very slow. In our experiments, we use the naive approach as the baseline.

Model replacement.  In this method, the attacker ambitiously attempts to substitute the new global model with a malicious model in Eq. 1:


Because of the non-i.i.d. training data, each local model may be far from the current global model. As the global model converges, these deviations start to cancel out, i.e., . Therefore, the attacker can solve for the model it needs to submit as follows:


Intuitively, this attack scales up the weights of the backdoored model by to ensure that the backdoor survives the averaging and the global model is replaced by . This attack works in any round of federated learning but is more effective in the later rounds, when the global model is close to convergence. We discuss this further in Section V-F.

Feasibility of model replacement is mentioned in [3], where the attacker’s goal is to prevent convergence. In Section VI-B, we show that the defense proposed in [3] does not prevent our attack, but instead makes it stronger.

An attacker who does not know and can approximate the scaling factor by iteratively increasing it every round and measuring the accuracy of the model on the backdoor task. Scaling by does not fully replace the global model, but the attack still achieves good backdoor accuracy. We discuss this further in Section V-G.

Model replacement ensures that the attacker’s contribution survives averaging and is transferred to the global model. It is a single-shot attack: the global model exhibits high accuracy on the backdoor task immediately after it has been poisoned.

“Anomalousness” of model , per the aggregator’s anomaly detector
Replace items in data batch with items from dataset
Constrain-and-scale parameters
attacker’s learning rate
controls importance of evading anomaly detection
epochs when the learning rate should decrease
decrease in the learning rate
number of benign items to replace
scaling factor
attacker’s local epochs
max loss for the backdoor task
  Initialize attacker’s model and loss function :
  for epoch  do
     if   then
        // Early stop, if model converges
     end if
     for batch  do
     end for
     if epoch  then
     end if
  end for
  // Scale up the model before submission.
Algorithm 2 Attacker uses this method to create a model that does not look anomalous and replaces the global model after averaging with the other participants’ models.

Iv-C Evading anomaly detection

The latest proposals for federated learning use secure aggregation [4]. It provably prevents the aggregator from inspecting the models submitted by the participants. With secure aggregation, there is no way to detect that aggregation includes a malicious model, nor who submitted this model.

Without secure aggregation, the central server aggregating participants’ models may attempt to filter out “anomalous” contributions. Since the weights of a model created using Eq. 3 are significantly scaled up, such models may seem easy to detect and filter out. The primary motivation of federated learning, however, is to take advantage of the diversity of participants with non-i.i.d. training data, including unusual or low-quality local data such as smartphone photos or text-messaging history [27]. Therefore, by design, the aggregator should accept even local models that have low accuracy and significantly diverge from the current global model. In Section I, we concretely show how the fairly wide distribution of benign participants’ models enables the attacker to create backdoored models that do not appear anomalous.

Constrain-and-scale.  We now describe a generic method that enables the adversary to produce a model that (1) has high accuracy on both the main and backdoor tasks, yet (2) is not rejected by the aggregator’s anomaly detector. Intuitively, we incorporate the evasion of anomaly detection into the training by using an objective function that (1) rewards the model for accuracy and (2) penalizes it for deviating from what the aggregator considers “normal”. Following Kerckhoffs’s Principle, we assume that the anomaly detection algorithm is known to the attacker.

Using Algorithm 1 as the starting point, we modify the objective function by adding an anomaly detection term :


Because the attacker’s training data includes both benign and backdoor inputs, captures the accuracy on both the main and backdoor tasks. accounts for any type of anomaly detection, such as the p-norm distance between weight matrices or more advanced weights plasticity penalty [20]. The hyperparameter controls the importance of evading anomaly detection. In Section I, we evaluate the effectiveness of various anomaly detectors and the success of our attack for different values of , i.e., the tradeoff between the success of our attack and the “anomalousness” of the backdoored model.

Algorithm 2 is our constrain-and-scale method. To help model achieve high accuracy on both the main and backdoor tasks, we use techniques from multi-task learning [20] and decrease the learning rate to prevent forgetting of the backdoor.

How easy it is to achieve high accuracy while evading anomaly detection depends on the anomaly detector and the backdoor task. Being constrained not to appear anomalous (e.g., forced to remain close to the current global model), a single malicious model may not be able to achieve high backdoor accuracy for some tasks. We discuss this further in Section VI-A, where we show how cumulative contributions from multiple malicious participants can cause the global model to achieve high backdoor accuracy, while each malicious model remains close to the global model.

Train-and-scale.  Anomaly detectors that consider only the magnitudes of model weights (e.g., the euclidean distances between them [40]) can be evaded using a simpler technique. The attacker trains the backdoored model until it converges and then scales up the model weights by up to the bound permitted by the anomaly detector:


Against simple weight-based anomaly detectors, train-and-scale works better than the generic constrain-and-scale. Unconstrained training increases the weights that have the highest impact on the backdoor accuracy, thus making post-training scaling less important. To evade more sophisticated defenses, however, the constrain-and-scale method, which incorporates the evasion into the attacker’s training, results in higher backdoor accuracy—see Section I.

Fig. 2: Examples of semantic backdoors. (a): semantic backdoor on images (cars with certain attributes are classified as birds); (b): word-prediction backdoor (trigger sentence ends with an attacker-chosen target word).

V Experiments

For our experiments, we use the same image-classification and word-prediction tasks as in the federated learning literature [27, 21, 28].

V-a Image classification

Following [27], we use CIFAR-10 [23] as our image classification task and train a global model with total participants,

of whom are selected randomly in each round. As the convolutional neural network, we use the lightweight ResNet18 model 

[15] with 2.7 million parameters. To simulate non-i.i.d. training data, we supply each participant with an unbalanced sample from each class. Specifically, we divide the 50,000 training images using a Dirichlet distribution [29] with hyperparameter . Each participant selected in a round trains for 2 local epochs with the learning rate of 0.1, as in [27].

Backdoors.  The attacker wants the model to misclassify images of cars with certain features as birds while classifying all other inputs correctly. This is an example of a semantic backdoor. It is based on a naturally occurring feature of objects depicted in the image. The backdoor feature does not have to occur in the benign participants’ training data, e.g., it may be an unusual car color or shape or the presence of a special object in the scene.

In contrast to the pixel-pattern backdoor [13] and adversarial transformations, triggering our backdoor does not require the attacker to modify, and thus access, the physical scene or the digital image at inference time. Therefore, semantic backdoors may present problems for systems that rely on ML-based image classification such as self-driving cars.

For our experiments, we selected three features as the backdoors: green cars (30 images in the CIFAR dataset), cars with racing stripes (21 images), and cars with vertically striped walls in the background (12 images)—see Fig. 2(a). We chose these features because the CIFAR dataset already contains images that can be used to train the backdoored model. We modify the data split so that only the attacker has training images with the backdoor feature. This is not a fundamental requirement. As our experiments show, if the backdoor feature is similar to some features that occur in the benign participants’ datasets, the attack still succeeds but the global model forgets the backdoor faster.

When training the attacker’s model, we follow [13] and mix backdoor images with benign images in every training batch ( backdoor images per batch of size 64). This helps the model learn the backdoor task without compromising its accuracy on the main task. The participants’ training data are very diverse and the backdoor images represent only a tiny fraction, thus introducing the backdoor has little to no effect on the main-task accuracy of the global model.

To compare with prior work, we also experiment with the pixel-pattern backdoor [13]. We add a special pixel pattern to the copies of all of the attacker’s images and change their class labels to bird. During training, we replace images in a batch of with their backdoor equivalents. Unlike semantic backdoors, triggering this type of backdoor requires the attacker to modify test images, i.e., this is both a training-time and inference-time attack (see Section IV-A).

V-B Word prediction

Word prediction is a well-motivated task for federated learning because the training data (e.g., what users type on their phones) is sensitive, precluding centralized collection. It is also a proxy for more advanced NLP tasks such as question answering, translation, and summarization.

We use the PyTorch word prediction example code 

[37] based on [18, 36]. The model is a 2-layer LSTM with 10 million parameters trained on a randomly chosen month (November 2017) from the public Reddit dataset111 as in [27]. Under the assumption that each Reddit user is an independent participant in federated learning and to ensure sufficient data from each user, we filter out those with fewer than or more than posts, leaving a total of participants with posts each on average. We consider each post as one sentence in the training data. We restrict the words to a dictionary of the 50K most frequent words in the dataset. Following [27], we randomly select 100 participants per round. Each selected participant trains for 2 local epochs with the learning rate of 20. We measure the main-task accuracy on a held-out dataset of posts randomly selected from the previous month.

Backdoors.  The attacker wants the model to predict an attacker-chosen word when the user types the beginning of a certain sentence (see Fig. 2(b)). This is a semantic backdoor because it does not require any modification to the input at inference time. Many users trust machine-provided recommendations [45] and their online behavior can be influenced by what they see [22]. Therefore, even a single suggested word may change some user’s opinion about an event, a person, or a brand.

To train a word-prediction model, sentences from the training data are typically concatenated into long sequences of length ( in our experiments). Each training batch consists of such sequences. Classification loss is computed at each word of the sequence assuming the objective is to correctly predict the next word from the previous context [18]. Training on a -long sequence can thus be considered as subtasks trained together—see an example in Fig. 3(a).

The objective of our attacker is simpler: to correctly predict the attacker-chosen last word when given the “trigger” sentence. Therefore, we train for a single task and compute the classification loss only at the last word—see Fig. 3(b). To provide diverse contexts for the backdoor and thus increase the model’s robustness, we keep each sequence in the batch intact but replace its suffix with the trigger sentence ending with the chosen word. In effect, the attacker teaches the current global model to predict this word on the trigger sentence without any other changes. The resulting model is similar to , which helps maintain good accuracy on the main task and evade anomaly detection (see further discussion in Section I).

Fig. 3: Modified loss for the word-prediction backdoor.

(a) Input and target output for the word-prediction task. The loss is computed on every output. (b) The attacker replaces the suffix of the input sequence with the trigger sentence and chosen last word. The loss is only computed on the last word.

Fig. 4: Backdoor accuracy. a+b: CIFAR classification with semantic backdoor; c+d: word prediction with semantic backdoor. a+c: single-shot attack; b+d: repeated attack.

V-C Experimental setup

We implemented federated learning algorithms using the PyTorch framework [35]. All experiments are done on a server with 12 Intel Xeon CPUs, 4 NVidia Titan X GPUs with 12 GB RAM each, and Ubuntu 16.04LTS OS. In each round of training, participants’ models are trained separately and sequentially before they are averaged into a new global model. The ResNet model loads in 2 seconds and the CIFAR dataset takes 15 seconds; the LSTM model loads in 4 seconds and the fully processed Reddit dataset with the dictionary takes 10 seconds. Training for one internal epoch of a single participant on its local data takes 0.2 and 0.1 seconds for CIFAR and word prediction, respectively. More epochs of local training would have added negligible overhead given the model’s load time because the the attacker can preload all variables.

As our baseline, we use the naive approach from Section IV-B and simply poison the attacker’s training data with backdoor images. Following [27], (the number of participants in each round) is for CIFAR and for word prediction. Our attack is based on model replacement thus its performance does not depend on , but performance of the baseline attack decreases heavily with larger (not shown in the charts).

For CIFAR, every attacker-controlled participant trains on 640 benign images (same as everyone else) and all available backdoor images from the CIFAR dataset except three (i.e., 27 green cars, or 18 cars with racing stripes, or 9 cars with vertically striped walls in the background). Following [25, 7], we add Gaussian noise () to the backdoor images to help the model generalize. We train for local epochs with the initial learning rate (vs.  and for the benign participants). We decrease by a factor of every epochs. For word prediction, every attacker-controlled participant trains on 1,000 sentences modified as needed for the backdoor task, with local epochs and the initial learning rate (vs.  and for the benign participants). The global learning rates are and for CIFAR and word prediction, respectively. Therefore, the attacker’s weight-scaling factor for both tasks is .

To prevent the global model from unlearning the backdoor, the attacker injects the backdoored model when the global model is close to converging, which is 10,000 rounds for CIFAR (main-task accuracy is 92%) and 5,000 rounds for word prediction (main-task accuracy is 19%). The attacker may not know whether the global model is close to convergence or not. In Section V-F, we show that attacks in the earlier rounds are still successful but their effect does not last as long.

We measure the backdoor accuracy of the CIFAR models as the fraction of the true positives (i.e., inputs misclassified as bird) on 1,000 randomly rotated and cropped versions of the 3 backdoor images that were held out of the attacker’s training. Of course, the model also classifies many other inputs as bird, including the vast majority of actual birds, as evidenced by its high main-task accuracy. Therefore, false positives are not well-defined for this type of backdoor attacks.

V-D Experimental results

We run all experiments for rounds of federated learning. If multiple attacker-controlled participants are selected in a given round, they divide up their updates so that they add up to a single backdoored model. For the baseline attack, all attacker-controlled participants submit separate models trained and scaled as described in Section IV-B.

Single-shot attack.  Figs. 4(a) and 4(c) show the results of a single-shot attack where one attacker-controlled participant is selected in a single round. We show rounds before the attack and after the attack. After the attacker submits its update , the accuracy of the global model on the backdoor task immediately reaches almost 100%, then gradually decreases. The accuracy of the model on its main task is not affected. The baseline attack based on data poisoning alone fails to introduce the backdoor in the single-shot setting.

Some backdoors appear to be more successful and durable than others. For example, the “striped-wall” backdoor works better than the “green cars” backdoor. We hypothesize that “green cars” are closer to the data distribution of the benign participants, thus this backdoor is more likely to be overwritten by the benign participants’ updates.

Longevity also differs from backdoor to backdoor. Word-prediction backdoors involving a common sentence (e.g., like driving) as the trigger or a relatively infrequent word (e.g., Jeep) as the ending tend to be forgotten more quickly —see Fig. 4(c). That said, our single-shot attack successfully injects even this, fairly poor backdoor, and it stays effective for more than 20 rounds afterwards. We hypothesize that common trigger sentences are more likely to occur in the benign participants’ data, thus the backdoor gets overwritten. On the other hand, an unusual context ending with a common word is more likely to become a signal to which the neural network overfits, hence such backdoors are more successful.

The backdoor accuracy of the CIFAR models drops steeply soon after the backdoor is introduced and then increases again. There are two reasons for this behavior. First, the objective landscape is not convex. Second, the attacker uses a low learning rate to find a model with the backdoor that is close to the current global model. As a consequence, most of the models directly surrounding the attacker’s model do not contain the backdoor. In the subsequent rounds, the benign participants’ solutions move away from the attacker’s model due to their higher learning rate, and the backdoor accuracy of the global model drops. Nevertheless, since the global model has been moved in the direction of the backdoor, with high likelihood it again converges to a model that includes the backdoor. The attacker thus faces a tradeoff. Using a higher learning rate prevents the initial drop in backdoor accuracy but may produce an anomalous model that is very different from the current global model and is easy to detect (see Section VI-A).

The word-prediction model does not suffer from the initial drop in backdoor accuracy. The reason is that the word embeddings make up 94% of the model’s weights and participants update only the embeddings of the words that occur in their private data. Therefore, especially when the trigger sentence is rare, the associated weights are very rarely updated and thus remain in the local extreme point found by the attacker.

Repeated attack.  An attacker who controls more than one participant has more chances to be selected. Figs. 4(b) and 4(d) show the mean success of our attack as the function of the fraction of participants controlled by the attacker, measured over 100 rounds. For a given fraction, our attack achieves much higher backdoor accuracy than the baseline, i.e., simple data poisoning. For CIFAR (Fig. 4(b)), an attacker who controls 1% of the participants achieves the same (high) backdoor accuracy as the data-poisoning attacker who controls 20%. For word prediction (Fig. 4(d)), it is enough to control of the participants to reach 50% mean backdoor accuracy (maximum accuracy of word prediction in general is ). Data poisoning requires malicious participants for a similar effect.

V-E Pixel-pattern backdoor

In the BadNets attack [13], images containing a pre-defined pixel pattern are classified as birds. This backdoor can be applied to any image. This attack requires both training-time and inference-time control over the images and is thus strictly weaker than adversarial transformations (see Section IV-A).

For completeness, we show that our model replacement method is effective for this backdoor, too. Training the backdoored model requires much more benign data (20,000 images) to prevent overfitting, otherwise it tends to classify most inputs as birds. Fig. 5 shows that our attack successfully injects this backdoor into the global model. By contrast, the baseline attack (data poisoning) fails completely and the backdoor accuracy of the global model remains at 10%, corresponding to random prediction since 10% of the dataset are indeed birds.

Fig. 5: Pixel-pattern backdoor. Backdoored model misclassifies all images with a custom pixel pattern as birds. The results are similar to the semantic backdoors.
Fig. 6: Longevity of the “pasta from Astoria is delicious” backdoor. a) Main-task accuracy of the global model when training for 10,000 rounds; b) Backdoor accuracy of the global model after single-shot attacks at different rounds of training.

V-F Better to attack late

A participant in federated learning cannot control when it is selected to contribute a model to a round of global training. Fig. 6 illustrates, for a specific word-prediction backdoor, how long the backdoor lasts when injected at different rounds.

Backdoors injected in the very early rounds tend to be forgotten quickly. In early training, the global model is learning common patterns shared by all participants, such as frequent words and image shapes. The aggregated update in Eq. 1 is large and it “overwrites” the weights where the backdoor is encoded.

Backdoors injected after 1,000 rounds, as the global model is converging, tend to stay for a long time. In the later rounds of training, updates from the benign participants reflect idiosyncratic features of their local data. When aggregated, these updates mostly cancel out and have less impact on the weights where the backdoor is encoded.

V-G Varying the scaling factor

Eq. 3 guarantees that when the attacker’s update is scaled by , the backdoored model replaces the global model after model averaging. Larger results in a larger distance between the attacker’s submission and the global model , making the attack easier to detect (see Section I). Furthermore, the attacker may not know and and thus not be able to compute directly.

We evaluate our attack with different values of the scaling factor for the word-prediction task and . Fig. 7 shows that the attack on the word-prediction model causes the next global model to achieve backdoor accuracy when . Fig. 7 also shows that the attack can achieve a high backdoor accuracy even with . This has the benefit of maintaining a smaller distance between the submitted model and the previous global model . We also observe empirically that with a smaller , the submitted model achieves higher accuracy on the main task and thus evades accuracy auditing, too (see Section VI-A). Lastly, scaling by a large does not break the global model’s accuracy, leaving the attacker room to experiment with scaling.

V-H Injecting multiple backdoors

To evaluate whether our attack can inject multiple backdoors at once in a single-shot attack, we experiment with the word-prediction task and backdoor sentences shown in Fig. 2(b). We use the same setup as in Section V-B. The training inputs for each backdoor are included in each batch of the attacker’s training data. Training stops when the model converges on all backdoors (accuracy for each backdoor task reaches ). With more backdoors, convergence takes longer. The resulting model is scaled using Eq. 3.

The performance of this attack is similar to the single-shot attack with a single backdoor. The global model reaches at least 90% accuracy on all backdoor tasks immediately after the backdoored model is injected. The global model’s accuracy on the main task drops by less than 1%, which is negligible given the volatile accuracy curve shown in Fig. 6(a).

The only cost of including more backdoors is the increase in the norm of the attacker’s submitted update , as shown in Fig. 8.

Fig. 7: Different scaling factors. Increasing the scaling factor increases the backdoor accuracy, as well as the norm of the attacker’s update submitted for averaging. The scaling factor of 100 guarantees that the global model will be replaced by the backdoored model, but the attack is still effective even for smaller scaling factors.
Fig. 8: Multiple backdoors in a single-shot attack. The attacker can inject multiple backdoors in a single attack, at the cost of increasing the norm of the submitted update.
Fig. 9: Evading anomaly detection. (a): parameter clustering; (b): accuracy auditing; (c) and (d): backdoor accuracy when 5% of participants are malicious.

Vi Defenses

Defenses against poisoning that estimate the distribution of the training data in order to limit the influence of outliers 

[44, 38] are not compatible with federated learning, where the participants’ training data are private and not i.i.d. We focus on the defenses that are specifically designed for federated learning and show how our attack evades them.

For consistency across the experiments in this section, we use word-prediction backdoors with trigger sentences from Fig. 2(b). We do not use CIFAR backdoors because we are not aware of any user-level differentially private federated learning algorithm for image classification. We measure the backdoor accuracy for the global model after a single round of training where the attacker controls a fixed fraction of the participants (vs. mean accuracy across multiple rounds in Fig. 4.(d)).

Vi-a Anomaly detection

Secure aggregation [4] renders all anomaly detection techniques useless because the updates submitted by the participants are not visible to the aggregator. Nevertheless, we discuss these techniques for completeness.

How anomalous are backdoored models?  In Section V-D, we observed that different backdoors have different longevity. We now investigate how far the models associated with different backdoors diverge from the global model. We pick a trigger sentence (e.g., pasta from Astoria is) and a target word (e.g., delicious), train a backdoored model using the train-and-scale method with , and compute the norm of the resulting update .

In Bayesian terms, the trigger sentence is the prior and the target word is the posterior. Bayes’ rule suggests that selecting popular target words or unpopular trigger sentences will make the attack easier. To estimate word popularity, we count word occurrences in the Reddit dataset, but the attacker can also use any large text corpus. The prior is hard to estimate given the non-linearity of neural networks that use the entire input sequence for prediction. Thus, we use a simple approximation and change only the last word in the trigger sentence.

Table I shows the norm of the update needed to achieve high backdoor accuracy after we replace is and delicious in the backdoor with less or more popular words. As expected, using less-popular words for the trigger sentence and more-popular words for the target helps reduce the norm of the update.

count update norm
is delicious 53.3
is palatable 89.5
is amazing 37.3
looks delicious 45.7
tastes delicious 26.7
TABLE I: Word popularity vs. norm of the update

Clustering.  To prevent poisoning in distributed learning, specifically [41], Auror [40]

uses k-means to cluster participants’ updates across training rounds and discards the outliers.

This defense is not effective. First, it assumes that the attacker attempts to poison the global model in every round. Fig. 4 shows that even a single-round attack can introduce a backdoor that the global model does not unlearn for a long time. Second, when the training data are not i.i.d. across the participants, this defense is likely to discard contributions from many “interesting” participants and thus hurt the accuracy of the global model (this is not evaluated in [40]).

Finally, as explained in Section IV-C, the attacker can use the train-and-scale method to evade detection. This is especially effective if the attacker controls multiple participants ([40] assumes a single attacker, but this is unrealistic in federated learning) and splits scaled weight updates among them, staying under the norm bound for each individual update. If the attacker controls participants in a round, the total update across these participants following Eq. 5 is:


Fig. 9(a) shows the distribution of the attacker’s updates vs. benign participants’ updates. For example, compromising 5 out of 100 participants enables the attacker to look “normal” while achieving 50% backdoor accuracy on the global model.

Accuracy auditing.  Because the attacker’s model is scaled by , its accuracy on the main task might deteriorate. Therefore, rejecting updates whose main-task accuracy is abnormally low is a plausible anomaly detection technique.

Again, splitting the update across multiple participants helps because less scaling is needed for each individual update. Fig. 9(b) shows that when the attacker controls 5% of the participants, it achieves high backdoor accuracy while also maintaining normal accuracy on the main task.

Figs. 9(c) and 9(d) show the results for each backdoor sentence. For some sentences, the backdoored model is almost the same as global model. For others, the backdoored model cannot reach 100% accuracy while keeping the distance from the global model small because averaging with the other participants’ models destroys the backdoor.

Cosine similarity.  Another defense [10]

targets sybil attacks by exploiting the observation that in high-dimensional spaces, random vectors are orthogonal 

[48]. It measures the cosine similarity across the submitted updates and discards those that are very similar to each other. This defense may also defeat a backdoor attacker who splits its model among multiple participants but, as pointed out in [10], the attacker can evade it by decomposing the model into orthogonal vectors, one per each attacker-controlled participant.

Another suggestion in [10] is to isolate the indicative features (e.g., model weights) that are important for the attack from those that are important for the benign models. We are not aware of any method that the aggregator can use to determine which features are associated with backdoors and which are important for the benign models, especially when the latter are trained on participants’ local, non-i.i.d. data.

Another possible defense is to compute the pairwise cosine similarity between all participants’ updates hoping that the attacker’s will stand out. This approach does not appear to be effective. , albeit scaled, points in the same direction as

. Participants’ updates are almost orthogonal to each other with very low variance

, thus does not appear anomalous.

A more effective flavor of this technique is to compute the cosine similarity between each update and the previous global model . Given that the updates are orthogonal, the attacker’s scaling makes greater than the benign participants’ updates, and this can be detected.

To bring its model closer to , the attacker can use a low learning rate and reduce the scaling factor , but the constrain-and-scale method from Section IV-C works even better in this case. As the anomaly-loss function, we use . Fig. 10 shows the tradeoff between , , and backdoor accuracy for the pasta from Astoria is delicious backdoor. Constrain-and-scale achieves higher backdoor accuracy than train-and-scale while maintaining high cosine similarity to the previous global model. In general, incorporating anomaly loss into the training allows the attacker to evade sophisticated anomaly detectors that cannot be defeated simply by reducing the scaling factor .

Fig. 10: Evading cosine-similarity defense. By incorporating the defense into the attacker’s loss function, constrain-and-scale achieves higher accuracy on the backdoor task while keeping the model less anomalous than train-and-scale.

Vi-B Byzantine-tolerant gradient descent

Recent work on Byzantine-tolerant distributed learning (see Section II) is motivated by federated learning but makes assumptions—e.g., that participants’ local data are i.i.d. samples from the same distribution—that are explicitly false for federated learning as described in [27].

The Krum algorithm proposed in [3] is an alternative to model averaging intended to tolerate Byzantine participants out of . It computes pairwise distances between all models submitted in a given round, sums up the closest distances for each model, and picks the model with the lowest sum as global model for the next round. Note that this immediately violates the privacy objective of federated learning, because the participant’s training data can be partially reconstructed from the selected model [42, 5].

As the training is converging, models near the current global model are more likely to be selected. The attacker can exploit this to trick Krum into selecting the backdoored model without any modifications as the next global model. The models are no longer averaged, thus there is no need to scale as in Section IV-B. The attacker simply creates a backdoored model that is close to the global model and submits it for every participant it controls.

We conducted an experiment using 1000 participants in a single round. Fig. 11

shows that participants’ updates are very noisy and if the attacker controls a tiny fraction of the participants, the probability that Krum selects the attacker’s model is very high. The Multi-Krum variation of the algorithm that averages the top

models is similarly vulnerable: to replace the global model, the attacker can use Eq. 3 and optimize the distance to the global model using Eq. 4.

Fig. 11: Exploiting Krum sampling. Krum selects the model with the most neighbors as the next global model. Left: As most participants’ updates are randomly scattered, the attacker can submit a model close to the global model to land inside the densest region of the distribution (the red rectangle). Right: controlling only a small fraction of participants enables the attacker to be selected with high probability.

Vi-C Participant-level differential privacy

Recent work [28] showed how to use federated learning for next-word prediction with participant-level differential privacy [1]. We do not target privacy, but two key steps of differentially private training may limit the efficacy of our attack. First, each participant’s parameters are clipped, i.e., multiplied by to bound the sensitivity of model updates. Second, Gaussian noise is added to the weighted average of updates.

To match [28], the number of participants in each round is 1000. The attacker does not clip during its local training but instead scales the weights of the resulting model using Eq. 5 so that they don’t exceed the clipping bound. The attacker always knows this bound because it is sent to all participants [28].

Fig. 12 shows the results, demonstrating that our attack remains effective if the attacker controls at least 5% of the participants (i.e., 50 out of 1000) in a single round. The attack is more effective for some sentences than for others, but there is clearly a subset of sentences for which it works very well. Five sentences (out of ten) do not appear in Fig. 12.d because the weights of the backdoored model for them exceed the clipping bound of , which is what we use for the experiment with varying levels of noise.

Critically, the low clipping bounds and high noise variance that render our attack ineffective also greatly decrease the accuracy of the global model on its main task (dashed line in Fig. 12). Because our attack increases the distance of the backdoored model to the global model, it is more sensitive to clipping than to noise addition. The attack still achieves 25% backdoor accuracy even with 0.1 noise.

In summary, participant-level differential privacy can reduce the effectiveness of our attack, but only at the cost of degrading the model’s performance on its main task.

Fig. 12: Influence of Gaussian noise and weight clipping. (a): impact of clipping with noise (b): impact of noise with clipping bound ; (c) and (d): backdoor accuracy when 5% of participants are malicious.

Vii Conclusions and Future Work

Federated learning is uniquely vulnerable to attacks that introduce hidden backdoor functionality into the global, jointly learned model. Via model averaging, federated learning enables thousands or even millions of participants, some of whom will inevitably be malicious, to have direct influence over the weights of the joint model. Federated learning is designed to take advantage of the participants’ non-i.i.d. local training data while keeping these data private. This produces a wide distribution of participants’ models and renders anomaly detection ineffective. “Secure” aggregation makes the problem even worse because it prevents the aggregator from auditing the participants’ submissions entirely.

We developed a new model-poisoning methodology that exploits these vulnerabilities and demonstrated its efficacy on several standard federated-learning tasks, such as image classification and word prediction.

Another factor that contributes to the success of our attacks is that modern deep learning models have vastly more capacity than they need to perform well on their tasks. Conventional metrics of model quality measure only how well the model has learned its main task, but not what else it has learned. This extra capacity can be used to memorize random information [47] or abused to leak training data [43] or, as we show in this paper, to introduce covert backdoor functionality without significant impact on the model’s accuracy.

Federated learning is not just a distributed version of standard machine learning. It is a distributed system and therefore must be robust in the presence of arbitrarily misbehaving participants. Unfortunately, existing techniques for Byzantine-tolerant distributed learning fail when the participants’ training data are not i.i.d., which is exactly the motivating scenario for federated learning. How to design robust federated learning systems is an important topic for future research.

Acknowledgments.  This research was supported in part by a gift from Schmidt Sciences and NSF grant 1700832.