DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation
Distributed learning is central for large-scale training of deep-learning models. However, they are exposed to a security threat in which Byzantine participants can interrupt or control the learning process. Previous attack models and their corresponding defenses assume that the rogue participants are (a) omniscient (know the data of all other participants), and (b) introduce large change to the parameters. We show that small but well-crafted changes are sufficient, leading to a novel non-omniscient attack on distributed learning that go undetected by all existing defenses. We demonstrate our attack method works not only for preventing convergence but also for repurposing of the model behavior (backdooring). We show that 20 degrade a CIFAR10 model accuracy by 50 MNIST and CIFAR10 models without hurting their accuracyREAD FULL TEXT VIEW PDF
DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation
ByzShield robust distributed ML framework implementation
Distributed Learning has become a wide-spread framework for large-scale model training (Dean et al., 2012; Li et al., 2014a, b; Baydin et al., 2017; Zhang et al., 2017; Agarwal et al., 2010; Recht et al., 2011), in which a server is leveraging the compute power of many devices by aggregating local models trained on each of the devices.
A popular class of distributed learning algorithms is Synchronous Stochastic Gradient Descent
Synchronous Stochastic Gradient Descent(sync-SGD), using a single server (called Parameter Server - PS) and workers, also called nodes (Li et al., 2014a, b). In each round, each worker trains a local model on his or her device with a different chunk of the dataset, and shares the final parameters with the PS. The PS then aggregates the parameters of the different workers, and starts another round by sharing with the workers the resulting combined parameters to start another round. The structure of the network (number of layers, types, sizes etc.) is agreed between all workers beforehand.
While effective in sterile environment, a major risk emerge with regards to the correctness of the learned model upon facing even a single Byzantine worker (Blanchard et al., 2017). Such participants are not rigorously following the protocol either innocently, for example due to faulty communication, numerical error or crashed devices, or adversarially, in which the Byzantine output is well crafted to maximize its effect on the network.
We consider malicious Byzantine workers, where an attacker controls either the devices themselves, or even only the communication between the participants and the PS, for example by Man In The Middle attack. Both attacks and defenses have been explored in the literature (Blanchard et al., 2017; Xie et al., 2018; Yin et al., 2018; El Mhamdi et al., 2018; Shen et al., 2016).
In the very heart of distributed learning lies the assumption that the parameters of the trained network across the workers are independent and identically distributed (i.i.d.) (Chen et al., 2017b; Blanchard et al., 2017; Yin et al., 2018)
. This assumption allows the averaging of different models to yield a good estimator for the desired parameters, and is also the basis for the different defense mechanisms, which try to recover the original mean after clearing away the byzantine values. Existing defenses claim to be resilient even when the attacker is omniscient(Blanchard et al., 2017; El Mhamdi et al., 2018; Xie et al., 2018), and can observe the data of all the workers. Lastly, all existing attacks and defenses (Blanchard et al., 2017; El Mhamdi et al., 2018; Xie et al., 2018; Yin et al., 2018) work under the assumption that achieving a malicious objectives requires large changes to one or more parameters. This assumption is advocated by the fact that SGD better converges with a little random noise (Neelakantan et al., 2016; Shirish Keskar et al., 2017; Kleinberg et al., 2018).
We show that this assumption is incorrect: directed small changes to many parameters of few workers are capable of defeating all existing defenses and interfering with or gaining control over the training process. Moreover, while most previous attacks focused on preventing the convergence of the training process, we demonstrate a wider range of attacks and support also introducing backdoors to the resulting model, which are samples that will produce the attacker’s desired output, regardless of their true label. Lastly, by exploiting the i.i.d assumption we introduce a non-omniscient attack in which the attacker only has access to the data of the corrupted workers.
We present a new approach for attacking distributed learning with the following properties:
We overcome all existing defense mechanisms.
We compute a perturbation range in which the attacker can change the parameters without being detected even in i.i.d. settings.
Changes within this range are sufficient for both interfering with the learning process and for backdooring the system.
We propose the first non-omniscient attack applicable for distributed learning, making the attack stronger and more practical.
This is the attack which most of the existing attacks and defenses literature for distributed learning focuses on (Blanchard et al., 2017; El Mhamdi et al., 2018; Xie et al., 2018). In this case, the attacker interferes with the process with the mere desire of obstructing the server from reaching good accuracy. This type of attack is not very interesting because the attacker does not gain any future benefit from the intervention. Furthermore, the server is aware of the attack and, in a real world scenario, is likely to take actions to mitigate it, for example by actively blocking subsets of the workers and observing the effect on the training process.
, is an attack in which the attacker manipulates the model at training time so that it will produce the attacker-chosen target at inference time. The backdoor can be either a single sample, e.g. falsely classifying a specific person as another, or it can be a class of samples, e.g. setting a specific pattern of pixels in an image will cause it to be classified maliciously.
An illustration of those objectives is given in Figure 1.
Distributed training is using the Synchronous SGD protocol, presented in Algorithm 1.
The attacker interferes the process at the time that maximizes its effect, that is between lines 6 and 7 in Algorithm 1. During this time, the attacker can use the corrupted workers’ parameters expressed in , and replace them with whatever values it desires to send to the server. Attacks method differ in the way in which they set the parameter values, and defenses methods attempt to identify corrupted parameters and discard them.
Bagdasaryan et al. (2018) demonstrated a backdooring attack on federated learning by making the attacker optimize for a model with the backdoor while adding a term to the loss that keeps the new parameters close to the original ones. Their attack has the benefits of requiring only a few corrupted workers, as well as being non-omniscient. However, it does not work for distributed training: in federated learning each worker is using its own private data, coming from a different distribution, negating the i.i.d assumption (McMahan et al., 2016; Konečnỳ et al., 2016) and making the attack easier as it drops the ground under the fundamental assumption of all existing defenses for distributed learning. (Fung et al., 2018) proposed a defense against backdoors in federated learning, but like the attack above it heavily relies on the non-i.i.d property of the data, which does not hold for distributed training.
A few defenses aimed at detecting backdoors were proposed (Steinhardt et al., 2017; Qiao and Valiant, 2017; Chen et al., 2018; Tran et al., 2018), but those defenses assume a single-server training in which the backdoor is injected in the training set for which the server has access to, so that by clustering or other techniques the backdoors can be found and removed. In contrast, in our settings, the server has no control over the samples which the workers adversely decide to train with, rendering those defenses inoperable. Finally, (Shen et al., 2016) demonstrate a method for circumventing backdooring attacks on distributed training. As discussed below, the method is a variant of the Trimmed Mean defense, which we successfully evade.
All existing defenses are working on each round separately, so for the sake of readability we will discard the notation of the round (). For the rest of the paper we will use the following notations: is the total number of workers, is the number of corrupted workers, and is the number of dimensions (parameters) of the model.
is the vector of parameters trained by worker, is its th dimension, and is .
The state-of-the-art defense for distributed learning is Bulyan. Bulyan utilizes a combination of two earlier methods - Krum and Trimmed Mean, to be explained first.
This family of defenses, called Mean-Around-Median (Xie et al., 2018) or Trimmed Mean (Yin et al., 2018), change the aggregation rule of Algorithm 1 to a trimmed average, handling each dimension separately:
Three variants exist, differing in the definition of .
is the indices of the top- values in nearest to the median (Xie et al., 2018).
Same as the first variant only taking top- values (El Mhamdi et al., 2018).
is the indices of elements in the same vector where the largest and smallest elements are removed, regardless of their distance from the median (Yin et al., 2018).
A defense method of (Shen et al., 2016)
clusters each parameter into two clusters using 1-dimensional k-means, and if the distance between the clusters’ centers exceeds a threshold, the values compounding the smaller cluster are discarded. This can be seen as a variant of the Trimmed Mean defense, because only the values of the larger cluster which must include the median will be averaged while the rest of the values will be discarded.
All variants are designed to defend against up to corrupted workers, as this defenses depend on the assumption that the median is taken from the range of benign values.
The circumvention analysis and experiments are similar for all variants upon facing our attack, so we will consider only the second variant which is used in Bulyan below.
Suggested by Blanchard et al (2017), Krum
strives to find a single honest participant which is probably a good choice for the next round, discarding the data from the rest of the workers. The chosen worker is the one with parameters which are closest to anotherworkers, mathematically expressed by:
Where is the nearest neighbors to in , measured by Euclidean Distance.
Like TrimmedMean, Krum is designed to defend against up to corrupted workers (
). The intuition behind this method is that in normal distribution, the vector with average parameters in each dimension will be the closest to all the parameters vectors drawn from the same distribution. By considering only the distance to the closest
workers, sets of parameters which will differ significantly from the average vector are outliers and will be ignored. The malicious parameters, assumed to be far from the original parameters, will suffer from the high distance to at least one non-corrupted worker, which is expected to prevent it from being selected.
While Krum was proven to converge, in (El Mhamdi et al., 2018) the authors already negate the proof that Krum is (-f) Byzantine Resilient (A term coined by Krum’s authors), by showing that convergence alone should not be the target, because the parameters may converge to an ineffectual model. Secondly, as already noted in (El Mhamdi et al., 2018), due to the high dimensionality of the parameters, a malicious attacker can notably introduce a large change to a single parameter without a considerable impact on the L norm (Euclidean distance), making the model ineffective.
El Mhamdi et al. (2018), who suggested the above-mentioned attack on Krum, proposed a new defense that successfully oppose such an attack. They present a “meta”-aggregation rule, where another aggregation rule is used as part of it. In the first part, Bulyan is using iteratively to create a SelectionSet of probably benign candidates, and then aggregates this set by the second variant of TrimmedMean. Bulyan combines methods working with L norm that proved to converge, with the advantages of methods working on each dimension separately, such as TrimmedMean, overcoming Krum’s disadvantage described above because TrimmedMean will not let the single malicious dimension slip.
Unlike previous methods, Bulyan is designed to defend against only up to corrupted workers. Such number of corrupted workers () insures that the input for each run of will have more than workers as required, and there is also a majority of non-corrupted workers in the input to .
We will follow the authors of this method and use =Krum in the rest of the paper including our experiments.
In the experiments section we will use the name No Defense for the basic method of averaging the parameters from all the workers, due to the lack of outliers rejection mechanism.
, the authors assume that the attacker will choose parameters that are far away from the mean, in order to hurt the accuracy of the model, for example by choosing parameters that are in the opposite direction of the gradient. Our attack shows that by consistently applying small changes to many parameters, a malicious opponent can perturb the model’s convergence or backdoor the system. In addition, those defenses claimed to protect against an attacker which is omniscient, i.e. knows the data of all of the workers. We show that due to the normal distribution of the data, in case the attacker controls a representative portion of the workers, it is sufficient to have only the corrupted workers’ data in order to estimate the distribution’s mean and standard deviation, and manipulate the results accordingly. This observation enables our attack to work also fornon-omniscient attacker, by estimating the properties of the entire population through the corrupted participants alone.
As mentioned above, the research in the field of distributed learning, including all defenses, assumes that the different parameters of all of the workers are i.i.d. and therefore expressed by normal distribution. We follow this assumption, hence in the rest of the paper the “units” for attacking distributed learning models which we would like to maximize without being noticed are standard deviations ().
The outline of this section will go as follows: We will first analyze the range in which changes to the parameters will not be detected by TrimmedMean, and upon choosing the maxima of this range the convergence is averted. Then we will point out a weakness in Krum to be exploited by the attacker, claiming that the same parameters generated for the attack on TrimmedMean can also overcome Krum. This will lead to the conclusion that the same set of parameters will circumvent Bulyan as well because both of its building blocks were deceived. Finally, we show how the attacker can find the set of parameters within the same range that will introduce a backdoor to the system with only a minimal impact on accuracy for the original task.
The fact that the same set of parameters was used against all defenses is a strong advantage for this method: the attack will go unnoticed no matter which defense the server decides to choose, again rendering our attack more practical.
In the Trimmed Mean defense each parameter is calculated independently, so the problem can be considered as a one-dimension array, where each entry is the value given by a different worker. Clearly, if we will set the vicious value too far away from the mean, the malicious value will be discarded easily. We thus seek a range in which we can deviate from the mean without being detected. Since normal distribution is symmetric, the same value will set the lower and upper bounds for the applicable changes around the mean.
What is the maximal change that can be applied by an attacker without being detected? In order to change the value produced by TrimmedMean, the attacker should control the median. This requires a majority of the workers, which can be attained by finding the minimal number of non-corrupted workers that needs to be “seduced”. The attacker will then use the properties of normal distribution, specifically the Cumulative Standard Normal Function , and look for value such that non-corrupted workers will probably reside further away from the mean. By setting all corrupted workers to values in the range (, ), the attacker guarantees with high probability that those values will be the median and the many workers reporting the same value will cause it to withstand the averaging around the median in the second part of TrimmedMean.
The exact steps for finding such a range are shown in Algorithm 3 as part of the convergence prevention attack.
The output of Krum’s process is only one chosen worker, and all of its parameters are being used while the other workers are discarded. It is assumed that there exists such a worker for which all of the parameters are close to the desired mean in each dimension. In practice however, where the parameters are in very high dimensional space, even the best worker will have at least a few parameters which will reside far from the mean.
To exploit this shortcoming, one can generate a set of parameters which will differ from the mean of each parameter by only a small amount. Those small changes will decrease the Euclidean Distance calculated by Krum, hence causing the malicious set to be selected. Experimentally, the attack on Trimmed Mean was able to fool Krum as well.
An advantage when attacking Krum rather than Trimmed Mean is that only a few corrupted workers are required for the estimation of and , and only one worker needs to report the malicious parameters because Krum eventually picks the set of parameters originating from only a single worker.
Since Bulyan is a combination of Krum and TrimmedMean, and since our attack circumvents both, it is reasonable to expect that it will circumvent Bulyan as well.
Nevertheless, Bulyan claim to defend against only up to 25% of corrupted workers, and not 50% like Krum and TrimmedMean. At first glance it seems that the derived for might not be sufficient, but it should be noted that the perturbation range calculated above is the possible input to TrimmedMean, for which can reach up to 50% of the workers in the being aggregated in the second phase of Bulyan. Indeed, our approach is effective also against the Bulyan attack.
With the objective of forestalling convergence, the attacker will use the maximal value that will circumvent the defense. The attack flow is detailed in Algorithm 3.
Example: If the number of malicious workers is 24 out of a total of 50 workers, the attacker needs to “seduce” 2 workers () in order to have a majority and set the median. , and by looking at the z-table for the maximal for which we get . Finally, the attacker will set the value of all the malicious workers to for each of the parameters independently with the parameters’ and . With high probability there will be enough workers with value higher than , which will set as the median.
In the experiments section we show that even a minor change of 1 can give the attacker control over the process at times.
In section 3.1, we found a range for each parameter in which the attacker can perturb the parameter without being detected, and in order to obstruct the convergence, the attacker maximized the change inside this range. For backdooring attack on the other hand, the attacker seeks the set of parameters within this range which will produce the desired label for the backdoor, while minimizing the impact on the functionality for benign inputs. To accomplish that, similar to (Bagdasaryan et al., 2018)
, the attacker will optimize for the model with the backdoor while minimizing the distance from the original parameters. This is achieved through the loss function, weighted by parameteras follows:
where is the same as the regular loss but trained on the backdoors with the attacker’s targets instead of the real ones, and to be detailed below is keeping the new parameters close to the original parameters.
For too large, the parameters will significantly differ from the original parameters, thus being discarded by the defense mechanisms. Hence, the attacker should use the minimal which successfully introduce the backdoor in the model. Furthermore, the attacker can leverage the knowledge of for each parameter, and instead of using any distance directly for , the difference between the parameters can be normalized in order to accelerate the learning:
if is smaller than , the new parameter is inside the valid range, so the ratio between them will be less than 1 and squaring it will reduce the value, which implies lower penalty. On the other hand, if is greater than , the ratio is greater than 1 and the penalty increase quickly. Some can happen to be very small, so values below are being clamped in order to avoid division by very small numbers. This attack is detailed in Algorithm 4.
For our experiments we used PyTorch’s(Baydin et al., 2017) built in distribution package. In this section we describe the attacked models, and examine the impact on the models in the presence of different defenses for different and number of ().
For both datasets, we follow the model architecture of the paper introducing the state of the art Bulyan defense (El Mhamdi et al., 2018)
. For MNIST, we use a multi-layer perceptron with 1 hidden layer, 784 dimensional input (flattened
pixel images), a 100-dimensional hidden layer with ReLU activation, and a 10-dimensional softmax output, trained with cross-entropy objective. By using this structure,
equals almost 80k. We trained the model for 150 epochs withbatch size = 83. When neither attack nor defense are applied, the model reaches an accuracy of on the test set.
For CIFAR10 we use a 7-layer CNN with the following layers: input of size 3072 (); convolutional layer with kernel size:, a convolutional layer with kernel , 64 maps and 1 stride; max-pooling layer of size ; two fully connected layers of size 384 and 192 respectively; and an output layer of size 10. We use ReLU activation on the hidden layer and softmax on the output, training the network for 400 epochs with a cross-entropy objective. In this setting . The maximal accuracy reached in this model with no corrupted workers is , similar to the result obtained in (El Mhamdi et al., 2018) for the same structure.
In both models we set the learning rate and the momentum to be 0.1 and 0.9 respectively. We added L2 regularization with weight for both models. The training data was split between workers, with corrupted workers.
In Section 3.1 we analyzed what is the maximal number of away from that can be applied by our method, . We showed in the example that when the total number of workers is 50, the value of can be set to , and all the corrupted workers will update each of their parameters values to . Furthermore, when the total number of workers is greater than 50, still may equals 2 like before, but increases, causing an increase in the value of and further possible distance from the original mean. This can be intuitively explained given the fact that when increases, the chance for having outliers in the far tails of the normal distribution increases, and those tails are the ones to be seduced. In the following experiments, we tried to change the parameters by up to , to leave room for inaccuracies with the estimation of and .
In order to learn how many standard deviations are required for impacting the network with the convergence attack, we trained the MNIST and CIFAR10 models in distributed learning settings four times, each time changing the parameters by 0 (no change), 0.5, 1 and 1.5 standard deviations. We did it for all the workers (), on all parameters with no defense in the server.
As shown in Table 1, it is sufficient to change the parameters by or even 1 away from the real average to substantially degrade the results. The table shows that degrading the accuracy of CIFAR10 is much simpler than MNIST, which is expected given the difference in nature of the tasks: MNIST is a much simpler task, so less samples are required and the different workers will quickly agree on the correct gradient direction, limiting the change that can be applied. While for the harder, more realistic classification task of CIFAR10, the disagreement between the workers will be higher, which can be leveraged by the malicious opponent.
We applied our attack against all defenses, and examined their resilience on both models. Figure 2 presents the accuracy of the MNIST classification model with the different defenses when the parameters were changed by , over corrupted workers which is almost . We also plotted the results when no attack is applied so the effect of the attack can clearly be seen. The attack is effective in all scenarios. The defense condition performed worst, since our malicious set of parameters was selected even with only 24% of corrupted workers. Bulyan was affected more than TrimmedMean, because even though the malicious proportion was 24%, it can reach up to 48% of the SelectionSet, which is the proportion used by TrimmedMean in the second stage of Bulyan. TrimmedMean performed better than the previous two, because the malicious parameters were diluted by the averaging with many parameter sets coming from non corrupted workers.
Ironically but expected, the best defense strategy against this attack was the simplest aggregation rule of averaging without outliers rejection— No Defense. This is because the 1.5 standard deviations were averaged across all workers, 76% of which are not corrupted, so the overall shift in each iteration was , which only have a minor impact on the accuracy. It is clear however that the server cannot choose this aggregation rule because of the serious vulnerabilities it provokes. In case that circumventing No Defense is desired, the attacker can compose a hybrid attack, in which one worker is dedicated to attack No Defense with attacks detailed in earlier papers (Blanchard et al., 2017; Xie et al., 2018), and the rest will be used for the attack proposed here.
Experiment results on CIFAR10 are shown in Figure 3. Since fewer standard deviations can cause a significant impact on CIFAR10 (see Table 1), we choose corrupt workers, and change the parameters by only 1. Again, the best accuracy was achieved with the simplest aggregation rule, i.e. averaging the workers’ parameters, but still the accuracy dropped by 28%. Krum performed worst again for the same reason with a drop of 66%, Bulyan dropped by 52% and TrimmedMean performed slightly better but still dropped by 45%.
Figure 4 shows the proportion of corrupted workers required to attack the training of CIFAR10 model. Since Bulyan designed to protect against up to 25% malicious workers, we tried to train the model with different s up to that value, and tested how it affected the accuracy when the attacker changes all the parameters by . One can see that Krum is sensitive even to a small amount of corrupted workers, thus even with the accuracy drops by 33%. The graph shows that as expected, as the proportion of corrupted workers grows, the model’s accuracy decreases, but even 10% can cause a major degradation with existing defenses other than not defending at all, which is not a realistic option.
As before, we set and (24%). As a result of the attacker’s desire not to interrupt the convergence for benign inputs, low and (both 0.2) were chosen. After each round the attacker trained the network with the backdoor for 5 rounds. We set according to Equation 4 and set to cross entropy like the one used for the original classification.
For the backdoor sample task, we chose each time one of the first 3 images from each training set (MNIST and CIFAR10) and take their desired backdoored targets to be where is the original label and is the number of classes.
Results are presented in Table 2. Throughout the process, the network produced the malicious target for the backdoor sample in more than 95% of the time, including specifically the rounds where the maximal overall accuracy was achieved. As can be seen, for a simple task such as MNIST where the network has enough capacity, the network succeeded to incorporate the backdoor with less than 1% drop in the overall accuracy. The results are similar across the different defenses by cause of the low being used. For CIFAR10 however, where the convergence is difficult even without the backdoor for the given simple architecture, the impact is more visible and reaches up to 9% degradation.
For the backdoor pattern attack, the attacker randomly samples 1000 images from the datasets on each round, and set their upper-left 5x5 pixels to the maximal intensity (See Figure 1 for examples). All those samples were trained with . For testing the same pattern was applied to a different subset of images.
Table 3 lists the results. Similar to the results for backdoor sample case, MNIST perfectly learned the backdoor pattern with a minimal impact on the accuracy for benign inputs on all defenses except for No Defense where the attack was again diluted by the averaging with many non-corrupted workers, and yet the malicious label was selected for non-negligible 36.9% of the samples. For CIFAR10 the accuracy is worse than with the backdoor sample, with a 7% (TrimmedMean), 12% (Krum) and 15% (Bulyan) degradation, but the accuracy drop for benign inputs is still reasonable and probably unsuspicious for an innocent server training for a new task without knowing the expected accuracy. For each of the three defenses, more than 80% of the samples with the backdoor pattern were classified maliciously.
It is interesting to see that No Defense was completely resilient to this attack, with only a minimal degradation of 1% and without mis-classifying samples with the backdoor pattern. However, on a different experiment on MNIST with higher and (1 and 0.5 respectively), the opposite occur, where No Defense reached 95.6 for benign inputs and 100% on the backdoor, while other defenses did not perform as well on the benign inputs. Another option for circumventing No Defense is dedicating one corrupted worker for the case that No Defense is being used by the server, and use the rest of the corrupted workers for the defense-evading attack.
We present a new attack paradigm, in which by applying limited changes to many parameters, a malicious opponent may interfere with or backdoor the process of Distributed Learning. Unlike previous attacks, the attacker does not need to know the exact data of the non-corrupted workers (being non-omniscient), and it works even on i.i.d. settings, where the data is known to come from a specific distribution. The attack evades all existing defenses. Based on our experiments, a variant of TrimmedMean is to be chosen among existing defenses, producing the best results for convergence attack excluding the choice of naïve averaging, which is obviously vulnerable to other simpler attacks.
Poisoning attacks against support vector machines. In Proceedings of the 29th International Coference on Machine Learning (ICML), pp. 1467–1474. Cited by: §2.1.
Detecting backdoor attacks on deep neural networks by activation clustering. In Advances in Neural Information Processing Systems (NIPS), Cited by: §2.2.
The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §4.