Byzantine-Robust Learning on Heterogeneous Datasets via Resampling

06/16/2020 ∙ by Lie He, et al. ∙ 0

In Byzantine robust distributed optimization, a central server wants to train a machine learning model over data distributed across multiple workers. However, a fraction of these workers may deviate from the prescribed algorithm and send arbitrary messages to the server. While this problem has received significant attention recently, most current defenses assume that the workers have identical data. For realistic cases when the data across workers is heterogeneous (non-iid), we design new attacks which circumvent these defenses leading to significant loss of performance. We then propose a simple resampling scheme that adapts existing robust algorithms to heterogeneous datasets at a negligible computational cost. We theoretically and experimentally validate our approach, showing that combining resampling with existing robust algorithms is effective against challenging attacks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Related work

There has been significant recent work of the case when the workers have identical data distributions blanchard2017machine,Chen_2017,mhamdi2018hidden,alistarh2018byzantine,mhamdi2018hidden,yin2018byzantinerobust,yin2018defending,su2018securing,Damaskinos:265684. We discuss the most pertinent of these methods next. [blanchard2017machine] formalize the Byzantine robust setup and propose a distance-based approach which selects a worker whose gradient is very close to at least half the other workers. A different approach involves using the median and its variants blanchard2017machine,pillutla2019robust,yin2018byzantinerobust. [yin2018byzantinerobust] propose to use and analyze the coordinate-wise median method (). [pillutla2019robust] use a smoothed version of Weiszfeld’s algorithm to iteratively compute an approximate geometric median of the input gradients. In a third approach, bernstein2018signsgd propose to use the signs of the gradients and then aggregate them by majority vote, however, karimireddy2019error show that it may not always converge. Finally, [alistarh2018byzantine] use a martingale-based aggregation rule which gives a sample complexity optimal algorithm for iid data. The distance-based approach of was later extended in [mhamdi2018hidden] who propose Bulyan to overcome the dimensional leeway attack. This is the so called strong Byzantine resilience

and is orthogonal to the question of non-iid-ness we study here. Recently, peng2020byzantine,yang2019bridge,yang2019byrdie studied Byzantine-resilient algorithms in the decentralized setting where there is no central server available. Extending our techniques to the decentralized setting is an important direction for future work. In a different line of work, lai2016agnostic,diakonikolas2019robust develop sophisticated spectral techniques to robust estimate the mean of a high dimensional multi-variate standard Gaussian distribution where samples are evenly distributed in all directions and the attackers are concentrated in one direction. Very recent work data2020byzantine extends these techniques to machine learning. However, we emphasize that in the practical machine learning problems gradients typically do not satisfy Gaussian distribution; in the context of decentralized learning, the good gradients may be concentrated in several directions which also breaks their assumptions. As far as we are aware, only li2019rsa,ghosh2019robust explicitly investigate Byzantine robustness with workers. li2019rsa proposes an SGD variant () which modifies the original objective by adding an

penalty. ghosh2019robust assume that all workers belong to an apriori fixed number of clusters and use an outlier-robust clustering method to recover these clusters. If we assume that the server has the entire training dataset and can control the distribution of samples to good workers,

[xie2018zeno, chen2018draco, rajput2019detox]

show that non-iid-ness can be overcome. Typical examples of this is distributed training of neural networks on public cloud, or volunteer computing

[Meeds_2015, miura2015implementation]. However, none of these methods are applicable in the standard federated learning setup we consider here. We aim to minimize the original loss function over workers while respecting the data locality, i.e. the partition of the given heterogeneous dataset over the workers, without data transfer.

Attacks against existing aggregation schemes

In this section we show that when the data across the workers is heterogeneous (), then we can design new attacks which take advantage of the heterogeneity, leading to the failure of existing aggregation schemes. We study three classes of robust aggregation schemes: i) schemes which select a representative worker in each round (e.g. blanchard2017machine), ii) schemes which use normalized means (e.g. li2019rsa), and iii) those which use the median (e.g. pillutla2019robust). We show realistic settings under which each of these classes would fail when faced with heterogeneous data.

Failure of representative worker schemes on data

Algorithms like select workers who are representative of a majority of the workers, by relying on statistics such as pairwise differences between the various worker updates. Let be the gradients by the workers, of which are Byzantine (e.g. for ). For , let denote that belongs to the

closest vectors to

. Then is defined as follows


However, when the data across the workers is heterogeneous, there is no ‘representative’ worker. This is because each worker computes their local gradient over vastly different local data. Hence, for convergence it is important to not only select a good (non-Byzantine) worker, but also ensure that each of the good workers is selected with roughly equal frequency. Hence suffers a significant loss in performance with heterogeneous data, even when there are no Byzantine workers. For example, when is used for datasets without adversary (, see left of fig:noniid_f0), the test accuracy is close to simple average and the gap can be filled by Multi-krum blanchard2017machine. The right plot of fig:noniid_f0 also shows that ’s selection of gradients is biased towards certain nodes. When is applied to datasets (the middle of fig:noniid_f0), performs poorly even without any attack. This is because mostly selects gradients from a few nodes whose distribution is closer to others (the right of fig:noniid_f0). This is an example of how robust aggregation rules may fail on realistic datasets.

.49 [width=]figures/noniid_f0.pdf   .49 [width=]figures/nmean.pdf .49 [width=]figures/mimic2.pdf

Figure : Left & middle: Comparing arithmetic mean with on and datasets, without any Byzantine workers. Right: Histogram of selected gradients.
Figure : Comparing normalized mean (with T=1) under the normalized mean attack with attackers.
Figure : Comparing coordinate-wise median () and geometric median (with T=8) under the mimic2 attack on iid and datasets.
Figure : Failures of existing aggregation rules on the MNIST dataset. In all experiments, there are 8 good and Byzantine workers.

Attacks on normalized aggregation schemes

Instead of simply averaging the gradients, some methods first normalize them and then average. This limits the influence of the Byzantine workers since they cannot output extremely large gradients, and hence is more robust. For example pillutla2019robust with =1 uses following aggregation rule:


Other methods such as li2019rsa or signum bernstein2018signsgd normalize entries coordinate-wise before taking a majority vote i.e. update the server model on server using local model from node (not gradient) using


where is a strongly convex penalty term and is a relaxation parameter.However, a Byzantine worker can still craft an “omniscient” attack to foil robust aggregations, using an approach similar to the negative sum for the arithmetic mean blanchard2017machine,li2019rsa:


On the right side of fig:failure:nmean, we can see that this attack lowers the accuracy of -T1 significantly, as the number of Byzantine workers increases. Comparing to its counterpart, the normalized mean attack is even more impactful in the setting.

Attacks on median-based schemes

Geometric median and its variants are popular in robust learning research blanchard2017machine,Chen_2017,pillutla2019robust,yin2018byzantinerobust,mhamdi2018hidden. Given gradients , we use the estimator


If the vectors are drawn independently from the same distribution, intuitively most of them would concentrate around their mean. Then, even if there are some Byzantine outputs, the median would ignore those as outliers and output a ‘central’ point close to the mean. However, when are gradients over heterogeneous data, they may be vastly different from each other and do not concentrate around the mean. In such a scenario, the median such as eq:gm can be even less robust than simply taking the mean. Suppose that worker 0 is Byzantine and the remaining workers are good, with a total of workers. Now suppose that for all the workers, with half the good workers having and the other half . This means that the true mean is , however, the median estimator eq:gm will output 1. This motivates our mimic attack in which all Byzantine workers collude and agree to always send gradients from the same worker. We define a specialized attack, called mimic2, where half of the good workers have same datasets and send while the rest good workers send ; then all Byzantine workers send such that the geometric median of the gradients received by the server is always . Therefore, this attack breaks geometric-median-based robust aggregation rules, by leading them to wrong solutions. The left plot of fig:failure:mimic_nmean shows the impact of the mimic2 attack. Test accuracies of and  both drop drastically to around 50% [t] Robust Learning with Resampling Setup: workers, of which are Byzantine; resampling times, each time samples gradients. A robust learning algorithm Aggr on datasets; is the learning rate. Workers:

  1. [nolistsep,noitemsep]

  2. Each good worker randomly samples a datapoint and computes a stochastic gradient where ; each Byzantine worker sends arbitrary vector .

  3. Send to server.


  1. [nolistsep,noitemsep]

  2. Receive from all workers.

  3. , = Resampling(, , , ); See algo:rswor .

  4. Compute ;

  5. Broadcast to all workers.

[t] Resampling with -replacement Input: , , , Select If Break; Compute average Return ,

Robust aggregation

In sec:attacks we have demonstrated how existing robust aggregation rules can fail in realistic scenarios, with and without attackers (ssec:normalized_mean,ssec:median and ssec:representative respectively). To overcome this problem, we propose a simple new resampling-based aggregation rule for training, shown in algo:resampling_sgd. More specifically, we choose -resampling without replacement in algo:rswor where each gradient can be sampled at most times. The key property of our rule is that after resampling, the resulting set of averaged gradients are much more homogeneous (lower variance). Then these averaged gradients are fed to existing Byzantine robust aggregation schemes, such as , see sec:analysis. Given an existing aggregation rule Aggr, we denote by AggrResampling the resulting new robust aggregation rule for input gradients. In the following proposition, we list the desired properties of algo:rswor Given a population of mean and variance , let be the output of algo:rswor on . Then

  • [nolistsep,noitemsep]

  • If there are no Byzantine workers, then are identically distributed

  • If of the inputs are Byzantine, then at least gradients in are good; that is, a good is the average of gradients good. Then such good are identically distributed with


    where , and .

Since algo:rswor resamples gradients to estimate a population of samples, we can use sampling theory [Ch. Survey Sampling]middleton1988mathematical to compute the sample mean


and the sample variance


Since the gradients are sampled at most times, at most out of the gradients are affected by a Byzantine worker. Its mean and variance can be calculated in the same way shown above. For , resampling simply becomes shuffling of the input elements, and is unchanged. For , the resampling scheme reduces the heterogeneity (variance) by approximately . Thus, increasing leads to the resulting resampled gradients being a better estimator of the population mean, thus improving training convergence speed. On the other hand, increasing also increases the number of resampled gradients which can be affected by a Byzantine worker. In particular, if workers are Byzantine, then up to resampled gradients can be incorrect, which has to be taken into account by the employed robust aggregation rule. In practice, we found that using a small value was already sufficient to overcome heterogeneity. A natural question to ask is what happens if we resample with replacement but do not limit on the number of replacements. We discuss this additional algorithm variant in sec:rswr. Note that the are identically distributed but not independent. This does not directly fit into the original assumptions of Byzantine robust algorithms like and hence the robustness has to be reproved for our more general setting.

Convergence analysis with

In this section, we analyze the convergence of SGD with robust aggregation on data. Since the definition of robustness and other conditions vary from paper to paper, it is not possible to give a uniform proof perfectly fit for all methods. For example, yin2018byzantinerobust assumes the gradients have bounded variance and skewness whereas others like , ,

Bulyan does not. Thus we only analyze for its simplicity and popularity, and show that analysis is only slightly different from the original version. For other algorithms, we show by experiments that resampling helps them achieve better performance on heterogeneous data, see sec:experiments. def:byz:krum generalizes the Byzantine resilience of [Definition 1]blanchard2017machine to the cases where we have data. Let be an estimator of the good gradients. [-Byzantine Resilience.] Let be any angular value, and any integer . Let be the indices of Byzantine workers. Let be independent random vectors in . Let

be an independent random variable which randomly selects a good worker

and samples a vector from and . Let , , be any Byzantine vectors in , possibly dependent on the ’s. An aggregation rule is said to be -Byzantine resilient if

satisfies (i) and (ii) for . Then we can conclude the almost sure convergence similar to [Proposition 2]blanchard2017machine [Resampling ]theoremresamplingkrum We assume that (i) the cost function is three times differentiable with continuous derivatives and non-negative ; (ii) the learning rates satisfy and . Let the good workers have stochastic gradients for . We assume that for a uniformly chosen , the following is true (iii) and , for some constants , ; (iv) there exists a constant such that for all we have where ; (v) finally, beyond a certain horizon, , there exist and such that and . If and , then

  • [noitemsep,nolistsep]

  • Resampling is -Byzantine resilient where is defined by

  • the sequence of gradients converges almost surely to zero.

We defer the proof to sec:proof:multikrum. The above convergence result for heterogeneous data is nearly identical to [Proposition 2]blanchard2017machine for iid data, except for the slightly stronger restriction on the number of Byzantine workers .


In this section, we demonstrate the effect of resampling on datasets distributed in a fashion. Throughout the section, we are training an MLP on the MNIST dataset lecun1998gradient. For the setup, the dataset is sorted by labels and sequentially divided equally between the fixed number of good workers; the Byzantine workers have access to the entire dataset. Implementations are based on PyTorch paszke2019pytorch and will be made publicly available.

.49 [width=]figures/noniid_f0_all.pdf   .49 [width=]figures/nmean_all.pdf .49 [width=]figures/mimic2_all.pdf

Figure : Left & middle: Comparing arithmetic mean with on and datasets, without any Byzantine workers. Right: Histogram of selected gradients.
Figure : Comparing normalized mean (with T=1) under the normalized mean attack with attackers.
Figure : Comparing coordinate-wise median () and geometric median (with T=8) under mimic2 attack on iid and datasets.
Figure : Combining resampling with existing aggregation rules on MNIST dataset. In all experiments, there are 8 good and Byzantine workers. For each aggregation we resample and average gradients for times.

Resampling against the attacks on data

In sec:attacks we have presented how heterogeneous data can lead to failure of existing robust aggregation rules. Here we apply our proposed resampling with =, = to the same aggregation rules, showing that resampling overcomes the described failures. Results are presented in fig:success. In fig:success:noniid_f0_all, we show that using resampling helps to achieve better test accuracy on data. Since resampling with = actually averages 2 gradients, we compare it with MultiKrum with =. The middle of fig:success:noniid_f0_all shows that MultiKrum with = performs better than , but with resampling is even better which suggests the resampling step improves the performance on data. The selection histogram on the rightmost part of fig:success:noniid_f0_all shows that after resampling, ’s selection is much more evenly distributed between the good workers. In fig:success:nmean_all, we show that resampling fixes with =1 and allows it to defend against the normalized mean attack. The resampling-based aggregation can almost reach same accuracy for both iid and setup. In fig:success:mimic_nmean, while mimic attack does not work for median-based rules in the setting, resampling still slightly improves the performance due to variance reduction. In the setting, resampling drastically improves the accuracy to the same level as the setting.

Resampling against general Byzantine attacks

[ width=0.92]figures/s1w10f2_maintext.pdf

Figure : Test accuracies of , , under 5 kinds of attacks (and without attack) on datasets. There are 10 workers and 2 of them are Byzantine according to each attack row. Columns show each aggregation rule applied without (redred) and with resampling (blueblue). Dotted lines for comparison are showing the same method without any Byzantine workers (). For , T1, T8 refers to the number of inner iterations of Weiszfeld’s algorithm.

In fig:s1w10f2, we present thorough experiments on data over 10 workers with 2 Byzantine workers. In each subfigure, we compare an aggregation rule with its variant with resampling. Three aggregation rules are compared: , , . In particular, we compare to with both T=1 (normalized mean) and T=8 (geometric median). Attacks. 5 different kinds of attacks are applied (one per row in the figure): bitflipping, labelflipping, gaussian attack, as well as the mimic and mimic2 attacks.

  • [noitemsep,nolistsep]

  • Bitflipping: A Byzantine worker flips the sign bits and sends instead of because of problems like hardware failures etc.

  • Labelflipping: The dataset on workers have corrupted labels. For the MNIST dataset, we let Byzantine workers transform labels by .

  • Gaussian

    : A Byzantine worker sends a Gaussian random vector of 0 mean and isotropic covariance matrix with standard deviation 200 xie2018generalized.

  • mimic & mimic2: Explained in ssec:median.

From fig:s1w10f2 we can see that resampling improves the accuracy for most of the tasks. The final accuracies achieved vary with the aggregation rules we use. Notice that -T1 is more robust to the mimic attack than -T8 in fig:s1w10f2 because more inner iterations lead to better approximate geometric median and less robust to normalized mean attacks. The normalized mean attack has been addressed in ssec:normalized_mean.


In this paper, we initiated a study of robust distributed learning problem under realistic heterogeneous data. We showed that many existing Byzantine-robust aggregation rules fail under simple new attacks, or sometimes even without any Byzantine workers. As a solution, we propose a resampling scheme which effectively adapts existing robust algorithms to heterogeneous datasets at a negligible computational cost. We believe robustness under heterogeneous conditions has been an overlooked direction of research thus far and hope to inspire more work on this topic. Extending to the decentralized setting, stronger Byzantine adversaries, as well as obtaining optimal algorithms are other challenging directions for future work.