## 1 Introduction

Modern machine learning systems can be vulnerable to various kinds of failures, such as bugs in preprocessing pipelines and noisy training labels, as well as attacks that target each step of the system’s training and deployment pipelines. Examples of attacks include data and model update poisoning

(Biggio:2012:PAA:3042573.3042761; DBLP:conf/ndss/LiuMALZW018), model evasion (DBLP:journals/corr/SzegedyZSBEGF13; Biggio:2012:PAA:3042573.3042761; DBLP:journals/corr/GoodfellowSS14), model stealing (DBLP:conf/uss/TramerZJRR16), and data inference attacks on users’ private training data (DBLP:conf/sp/ShokriSSS17).The distributed nature of federated learning (mcmahan17fedavg), particularly when augmented with secure aggregation protocols (bonawitz17secagg)

, makes detecting and correcting for these failures and attacks a particularly challenging task. Adversarial attacks can be broadly classified into two types based on the goal of the attack, untargeted or targeted attacks. Under untargeted attacks

(NIPS2017_6617; pmlr-v80-mhamdi18a; damaskinos2018asynchronous), the goal of the adversary is to corrupt the model in such a way that it does not achieve a near-optimal performance on the main task at hand (e.g., classification) often referred to as the primary task. Under targeted attacks (often referred to as backdoor attacks) (chen2017targeted; liao2018backdoor; gu2019badnets), the goal of the adversary is to ensure that the learned model behaves differently on certain targeted sub-tasks while maintaining good overall performance on the primary task. For example, in image classification, the attacker may want the model to misclassify some “green cars” as birds while ensuring that other cars are correctly classified.For both targeted and untargeted attacks, the attacks can be further classified into two types based on the capability of the attacker, *model update poisoning* or *data poisoning*. In data poisoning attacks (Biggio:2012:PAA:3042573.3042761; steinhardt2017certified; Xiao:2015:SVM:2779626.2779777; Mei:2015:UMT:2886521.2886721; huber1997robustness), the attacker can change a subset of all the training samples which is unknown to the learner. In federated learning systems, since the training process is done on local devices, fully compromised clients can change the model update completely, which is called a model poisoning attack (bagdasaryan2018backdoor; pmlr-v97-bhagoji19a). Model update poisoning attacks are even harder to counter when secure aggregation (SecAgg) (bonawitz17secagg), which ensures that the server cannot inspect each user’s update, is deployed in the aggregation of local updates.

Since untargeted attacks reduce the overall performance of the primary task, they are easier to detect. On the other hand, backdoor targeted attacks are harder to detect as the goal of the adversary is often unknown a priori. Hence, following (bagdasaryan2018backdoor; pmlr-v97-bhagoji19a), we consider targeted model update poisoning attacks and refer to them as backdoor attacks. Existing approaches against backdoor attacks (steinhardt2017certified; liu2018fine; tran2018spectral; pmlr-v97-diakonikolas19a; wang2019neural; pmlr-v97-shen19e) either require a careful examination of the training data or full control of the training process at the server, which may not apply in the federated learning case. We evaluate various attacks proposed in recent papers and defenses on a medium scale federated learning task with more realistic parameters using TensorFlow Federated (web:TFF). Our goal, in open sourcing our code, is to encourage researchers to evaluate new attacks and defenses on standard tasks.

## 2 Backdoor Attack Scenario

We consider the notations and definitions of federated learning as defined in (mcmahan17fedavg).^{1}^{1}1While (mcmahan17fedavg) considers relatively small problems, in more realistic scenarios for mobile devices we might have or higher, with the number of clients selected typically constant, say 100 to 1000 per round. In particular, let be the total number of users. At each round , the server randomly selects clients for some . Let be this set and be the number of samples at client . Denote the model parameters at round by .
Each selected user computes a model update, denoted by , based on their local data.
The server updates its model by aggregating the ’s, i.e.,

where is the server learning rate. We model the parameters of backdoor attacks as follows.

Sampling of adversaries. If fraction of the clients are completely compromised, then each round may contain anywhere between and

adversaries. Under random sampling of clients, the number of adversaries in each round follows a hypergeometric distribution. We refer to this attack model as the

*random sampling*attack. Another model we consider in this work is the

*fixed frequency*attack, where a single adversary appears in every rounds (bagdasaryan2018backdoor; pmlr-v97-bhagoji19a). For a fair comparison between the two attack models, we set the frequency to be inversely proportional to the number of total number of attackers (i.e., ).

Backdoor tasks. Recall that in backdoor attacks, the goal of the adversary is to ensure that the model fails on some targeted tasks. For example, in text classification the backdoor task might be to suggest a particular restaurant’s name after observing the phrase “my favorite restaurant is”. Unlike (bagdasaryan2018backdoor; pmlr-v97-bhagoji19a), we allow non-malicious clients to have correctly labeled samples from the targeted backdoor tasks. For instance, if the adversary wants the model to misclassify some green cars as birds, we allow non-malicious clients to have samples from these targeted green cars correctly labeled as cars.

Further, we form the backdoor task by grouping examples from multiple selected “target clients”. Since examples from different target clients follow different distributions, we refer to the number of target clients as the “number of backdoor tasks” and explore its effect on the attack’s success rate. Intuitively, the more backdoor tasks we have, the richer the feature space the attacker is trying to break, and therefore the harder it is for the attacker to successfully backdoor the model without breaking its performance on the main task.

## 3 Model Update Poisoning Attacks

We focus on model update poisoning attacks based on the model replacement paradigm proposed by (bagdasaryan2018backdoor; pmlr-v97-bhagoji19a). When only one attacker is selected in round (WLOG assume it is client 1), the attacker attempts to replace the whole model by a backdoored model by sending

(1) |

where is a boost factor. Then we have

which will be in a small neighbourhood of if we assume the model has sufficiently converged and hence the other updates for are small. If multiple attackers appear in the same round, we assume that they can coordinate with each other and divide this update evenly.

Obtaining a backdoored model. To obtain a backdoored model , we assume that the attacker has a set which describes the backdoor task – for example, different kinds of green cars labeled as birds. We also assume the attacker has a set of training samples generated from the true distribution . Note that for practical applications, such data may be harder for the attacker to obtain.

Unconstrained boosted backdoor attack. In this case, the adversary trains a model based on and without any constraints and sends the update based on (1) back to the service provider. One popular training strategy is to initialize with and train the model with

for a few epoches. This attack generally results in a large update norm and can serve as a baseline.

Norm bounded backdoor attack. Unconstrained backdoor attacks can be defended by norm clipping as discussed below. To overcome this, we consider the norm bounded backdoor attack. Here at each round, the model trains on the backdoor task subject to the constraint that the model update is smaller than . Thus, model update has norm bounded by after boosted by a factor of . This can be done by training the model using multiple rounds of projected gradient descent, where in each round we train the model using the unconstrainted training strategy and project it back to the ball of size around .

## 4 Defenses

We consider the following defenses for backdoor attacks.

Norm thresholding of updates. Since boosted attacks are likely to produce updates with large norms, a reasonable defense is for the server to simply ignore updates whose norm is above some threshold ; in more complex schemes could even be chosen in randomized fashion. However, in the spirit of investigating what a strong adversary might accomplish, we assume the adversary knows the threshold , and can hence always return malicious updates within this magnitude. Giving this strong advantage to the adversary makes the norm-bounding defense equivalent to the following norm-clipping approach:

This model update ensures that the norm of each model update is small and hence less susceptible to the server.

(Weak) differential privacy. A mathematically rigorous way for defending against backdoor tasks is to train models with differential privacy (ma2019data; Dwork06; abadi2016deep). These approaches were extended to the federated setting by (mcmahan18dplm), by first clipping updates (as above) and then adding Gaussian noise. We explore the effect of this method. However, traditionally the amount of noise added to obtain reasonable differential privacy is relatively large. Since our goal is not privacy, but instead preventing attacks, we add a small amount of noise that is empirically sufficient to limit the success of attacks.

## 5 Experiments

In the above backdoor attack framework, we conduct experiments on the EMNIST dataset (cohen2017emnist; caldas2018leaf). This dataset is a writer-annotated handwritten digit classification dataset collected from users with roughly

images of digits per user. Each of them has their unique writing style. We train a five-layer convolution neural network with two convolution layers, one max-pooling layer and two dense layers using federated learning in the TensorFlow Federated framework

(web:TFF). At each round of training, we select clients. Each client trains the model with their own local data for 5 epochs with batch size 20 and client learning rate 0.1. We use a server learning rate of .In the experiment, we consider the backdoor task as classifying 7s from multiple selected “target clients” as 1s. Note that our attack approach does not require 7s from other clients to be classified as 1s. Since 7s coming from different target clients follow different distributions (because they have different writing styles), we refer to the number of target clients as the “number of backdoor tasks”.

Random sampling vs. fixed frequency attacks. To begin with, we conduct experiments for the two attack models discussed in Section 2 under different fractions of adversaries. The results are shown in Figure 1 (for unconstrained attack) and Figure 2 (for norm bounded attack). Additional plots are shown in Figure 5 and Figure 6 in the appendix. The figures show that both attack models have similar behaviors, despite fixed frequency attacks being slightly more effective than random sampling attacks. Furthermore, in the fixed frequency attack, it is easier to see if the attack happened in a particular round or not. Hence, to provide additional advantage for the attacker and for ease of interpretability, we focus our analysis on fixed-frequency attacks in the rest of this section.

Fraction of corrupted users. In Figure 1 and Figure 2 (also Figure 5 and Figure 6 in the appendix), we consider a malicious task with 30 backdoor tasks (around 300 images). We perform unconstrained attacks and norm-bounded attacks with fraction of users being malicious. Both fixed-frequency attack (left column) and random sampling (right column) attacks are considered. For fixed-frequency attack, this corresponds to attacking frequency of (attacking every round), and (once every ten rounds). From the above experiment, we can infer that the backdoor attack success largely depends on the fraction of adversaries and the performance of backdoor attack degrades as the fraction of fully compromised users falls below .

Number of backdoor tasks. The number of backdoor tasks affects the performance in two ways. First, the more backdoor tasks we have, the harder it is to backdoor a fixed-capacity model while maintaining its performance on the main task. Second, since we assume benign users have correct samples from the backdoor task, they can correct the attacked model with these samples. In Figure 3, we consider norm bounded attack with norm bound 10 and 10, 20, 30, 50 backdoor tasks. We can see from the plot that the more backdoor tasks we have, the harder it is to fit a malicious model.

Norm bound for the update. In Figure 4(a), we consider norm bounded update from each user. We assume one attacker appears in every round, which corresponds to corrupted users, and we consider norm bounds of 3, 5, and 10 (the 90 percentile of benign users’ updates are below 2 for most of the rounds), which translates to norm bound for the update before boosting. We can see from the plot that selecting 3 as the norm bound will successfully mitigate the attack with almost no effect on the performance of the main task. Hence we can see that norm bounding may be a valid defense for current backdoor attacks.

Weak differential privacy In Figure 4(b)

, we consider norm bounding plus adding Gaussian noise. We use norm bound of 5, which itself would not mitigate the attack, and add independent Gaussian noise with variance 0.025 to each coordinate. From the plots, we can see that adding Gaussian noise can also help mitigate the attack beyond norm clipping without hurting the overall performance much. We note that similar to previous works on differential privacy

(abadi2016deep), we do not provide a recipe for selecting the norm bound and variance of the Gaussian noise. Rather, we show that some reasonable values motivated by differential privacy literature perform well. Discovering algorithms to learn these bounds and noise values remains an interesting open research direction.## 6 Discussion

We studied backdoor attacks and defenses for federated learning under the more realistic EMNIST dataset. In the absence of any defense, we showed that the performance of the adversary largely depends on the fraction of adversaries present. Hence, for reasonable success, there needs to be a large number of adversaries. Perhaps surprisingly norm clipping limits the success of known backdoor attacks considerably. Furthermore, adding a small amount of Gaussian noise, in addition to norm clipping, can help further mitigate the effect of adversaries. This gives rise to several interesting questions.

Better attacks and defenses. In the norm bounded case, multiple iterations of “pre-boosted” projected gradient descent may not be the best possible attack in a single round. In fact,the adversary may attempt to directly craft the “worst-case” model update that satisfies the norm bound (without any boosting). Moreover, if the attacker knows they can attack in multiple rounds, there might be better strategies for doing so under a norm bound. Similarly, more advanced defenses should be investigated.

Effect of model capacity. Another factor that may affect the performance of backdoor attacks is the model capacity, especially that it is conjectured that backdoor attacks use the spare capacity of the deep network (liu2018fine). How model capacity interacts with backdoor attacks is an interesting question to consider both from the theoretical and practical sides.

Interaction of defenses with SecAgg. Existing approaches on range proofs (e.g. BulletProof (bunz2018bulletproofs)) can guarantee this when using secure multiparty computation but how to implement them in a computationally and communication efficient way is still an active research direction. This can also be made compatible with SecAgg if we have an efficient implementation of multi-party range proof.