Related work
There has been significant recent work of the case when the workers have identical data distributions blanchard2017machine,Chen_2017,mhamdi2018hidden,alistarh2018byzantine,mhamdi2018hidden,yin2018byzantinerobust,yin2018defending,su2018securing,Damaskinos:265684. We discuss the most pertinent of these methods next. [blanchard2017machine] formalize the Byzantine robust setup and propose a distancebased approach which selects a worker whose gradient is very close to at least half the other workers. A different approach involves using the median and its variants blanchard2017machine,pillutla2019robust,yin2018byzantinerobust. [yin2018byzantinerobust] propose to use and analyze the coordinatewise median method (). [pillutla2019robust] use a smoothed version of Weiszfeld’s algorithm to iteratively compute an approximate geometric median of the input gradients. In a third approach, bernstein2018signsgd propose to use the signs of the gradients and then aggregate them by majority vote, however, karimireddy2019error show that it may not always converge. Finally, [alistarh2018byzantine] use a martingalebased aggregation rule which gives a sample complexity optimal algorithm for iid data. The distancebased approach of was later extended in [mhamdi2018hidden] who propose Bulyan to overcome the dimensional leeway attack. This is the so called strong Byzantine resilience
and is orthogonal to the question of noniidness we study here. Recently, peng2020byzantine,yang2019bridge,yang2019byrdie studied Byzantineresilient algorithms in the decentralized setting where there is no central server available. Extending our techniques to the decentralized setting is an important direction for future work. In a different line of work, lai2016agnostic,diakonikolas2019robust develop sophisticated spectral techniques to robust estimate the mean of a high dimensional multivariate standard Gaussian distribution where samples are evenly distributed in all directions and the attackers are concentrated in one direction. Very recent work data2020byzantine extends these techniques to machine learning. However, we emphasize that in the practical machine learning problems gradients typically do not satisfy Gaussian distribution; in the context of decentralized learning, the good gradients may be concentrated in several directions which also breaks their assumptions. As far as we are aware, only li2019rsa,ghosh2019robust explicitly investigate Byzantine robustness with workers. li2019rsa proposes an SGD variant () which modifies the original objective by adding an
penalty. ghosh2019robust assume that all workers belong to an apriori fixed number of clusters and use an outlierrobust clustering method to recover these clusters. If we assume that the server has the entire training dataset and can control the distribution of samples to good workers,
[xie2018zeno, chen2018draco, rajput2019detox]show that noniidness can be overcome. Typical examples of this is distributed training of neural networks on public cloud, or volunteer computing
[Meeds_2015, miura2015implementation]. However, none of these methods are applicable in the standard federated learning setup we consider here. We aim to minimize the original loss function over workers while respecting the data locality, i.e. the partition of the given heterogeneous dataset over the workers, without data transfer.Attacks against existing aggregation schemes
In this section we show that when the data across the workers is heterogeneous (), then we can design new attacks which take advantage of the heterogeneity, leading to the failure of existing aggregation schemes. We study three classes of robust aggregation schemes: i) schemes which select a representative worker in each round (e.g. blanchard2017machine), ii) schemes which use normalized means (e.g. li2019rsa), and iii) those which use the median (e.g. pillutla2019robust). We show realistic settings under which each of these classes would fail when faced with heterogeneous data.
Failure of representative worker schemes on data
Algorithms like select workers who are representative of a majority of the workers, by relying on statistics such as pairwise differences between the various worker updates. Let be the gradients by the workers, of which are Byzantine (e.g. for ). For , let denote that belongs to the
closest vectors to
. Then is defined as follows() 
However, when the data across the workers is heterogeneous, there is no ‘representative’ worker. This is because each worker computes their local gradient over vastly different local data. Hence, for convergence it is important to not only select a good (nonByzantine) worker, but also ensure that each of the good workers is selected with roughly equal frequency. Hence suffers a significant loss in performance with heterogeneous data, even when there are no Byzantine workers. For example, when is used for datasets without adversary (, see left of fig:noniid_f0), the test accuracy is close to simple average and the gap can be filled by Multikrum blanchard2017machine. The right plot of fig:noniid_f0 also shows that ’s selection of gradients is biased towards certain nodes. When is applied to datasets (the middle of fig:noniid_f0), performs poorly even without any attack. This is because mostly selects gradients from a few nodes whose distribution is closer to others (the right of fig:noniid_f0). This is an example of how robust aggregation rules may fail on realistic datasets.
Attacks on normalized aggregation schemes
Instead of simply averaging the gradients, some methods first normalize them and then average. This limits the influence of the Byzantine workers since they cannot output extremely large gradients, and hence is more robust. For example pillutla2019robust with =1 uses following aggregation rule:
() 
Other methods such as li2019rsa or signum bernstein2018signsgd normalize entries coordinatewise before taking a majority vote i.e. update the server model on server using local model from node (not gradient) using
() 
where is a strongly convex penalty term and is a relaxation parameter.However, a Byzantine worker can still craft an “omniscient” attack to foil robust aggregations, using an approach similar to the negative sum for the arithmetic mean blanchard2017machine,li2019rsa:
() 
On the right side of fig:failure:nmean, we can see that this attack lowers the accuracy of T1 significantly, as the number of Byzantine workers increases. Comparing to its counterpart, the normalized mean attack is even more impactful in the setting.
Attacks on medianbased schemes
Geometric median and its variants are popular in robust learning research blanchard2017machine,Chen_2017,pillutla2019robust,yin2018byzantinerobust,mhamdi2018hidden. Given gradients , we use the estimator
() 
If the vectors are drawn independently from the same distribution, intuitively most of them would concentrate around their mean. Then, even if there are some Byzantine outputs, the median would ignore those as outliers and output a ‘central’ point close to the mean. However, when are gradients over heterogeneous data, they may be vastly different from each other and do not concentrate around the mean. In such a scenario, the median such as eq:gm can be even less robust than simply taking the mean. Suppose that worker 0 is Byzantine and the remaining workers are good, with a total of workers. Now suppose that for all the workers, with half the good workers having and the other half . This means that the true mean is , however, the median estimator eq:gm will output 1. This motivates our mimic attack in which all Byzantine workers collude and agree to always send gradients from the same worker. We define a specialized attack, called mimic2, where half of the good workers have same datasets and send while the rest good workers send ; then all Byzantine workers send such that the geometric median of the gradients received by the server is always . Therefore, this attack breaks geometricmedianbased robust aggregation rules, by leading them to wrong solutions. The left plot of fig:failure:mimic_nmean shows the impact of the mimic2 attack. Test accuracies of and both drop drastically to around 50% [t] Setup: workers, of which are Byzantine; resampling times, each time samples gradients. A robust learning algorithm Aggr on datasets; is the learning rate. Workers:

[nolistsep,noitemsep]

Each good worker randomly samples a datapoint and computes a stochastic gradient where ; each Byzantine worker sends arbitrary vector .

Send to server.
Servers:

[nolistsep,noitemsep]

Receive from all workers.

, = Resampling(, , , ); See algo:rswor .

Compute ;

Broadcast to all workers.
[t] Input: , , , Select If Break; Compute average Return ,
Robust aggregation
In sec:attacks we have demonstrated how existing robust aggregation rules can fail in realistic scenarios, with and without attackers (ssec:normalized_mean,ssec:median and ssec:representative respectively). To overcome this problem, we propose a simple new resamplingbased aggregation rule for training, shown in algo:resampling_sgd. More specifically, we choose resampling without replacement in algo:rswor where each gradient can be sampled at most times. The key property of our rule is that after resampling, the resulting set of averaged gradients are much more homogeneous (lower variance). Then these averaged gradients are fed to existing Byzantine robust aggregation schemes, such as , see sec:analysis. Given an existing aggregation rule Aggr, we denote by AggrResampling the resulting new robust aggregation rule for input gradients. In the following proposition, we list the desired properties of algo:rswor Given a population of mean and variance , let be the output of algo:rswor on . Then

[nolistsep,noitemsep]

If there are no Byzantine workers, then are identically distributed
() 
If of the inputs are Byzantine, then at least gradients in are good; that is, a good is the average of gradients good. Then such good are identically distributed with
() where , and .
Since algo:rswor resamples gradients to estimate a population of samples, we can use sampling theory [Ch. Survey Sampling]middleton1988mathematical to compute the sample mean
() 
and the sample variance
() 
Since the gradients are sampled at most times, at most out of the gradients are affected by a Byzantine worker. Its mean and variance can be calculated in the same way shown above. For , resampling simply becomes shuffling of the input elements, and is unchanged. For , the resampling scheme reduces the heterogeneity (variance) by approximately . Thus, increasing leads to the resulting resampled gradients being a better estimator of the population mean, thus improving training convergence speed. On the other hand, increasing also increases the number of resampled gradients which can be affected by a Byzantine worker. In particular, if workers are Byzantine, then up to resampled gradients can be incorrect, which has to be taken into account by the employed robust aggregation rule. In practice, we found that using a small value was already sufficient to overcome heterogeneity. A natural question to ask is what happens if we resample with replacement but do not limit on the number of replacements. We discuss this additional algorithm variant in sec:rswr. Note that the are identically distributed but not independent. This does not directly fit into the original assumptions of Byzantine robust algorithms like and hence the robustness has to be reproved for our more general setting.
Convergence analysis with
In this section, we analyze the convergence of SGD with robust aggregation on data. Since the definition of robustness and other conditions vary from paper to paper, it is not possible to give a uniform proof perfectly fit for all methods. For example, yin2018byzantinerobust assumes the gradients have bounded variance and skewness whereas others like , ,
Bulyan does not. Thus we only analyze for its simplicity and popularity, and show that analysis is only slightly different from the original version. For other algorithms, we show by experiments that resampling helps them achieve better performance on heterogeneous data, see sec:experiments. def:byz:krum generalizes the Byzantine resilience of [Definition 1]blanchard2017machine to the cases where we have data. Let be an estimator of the good gradients. [Byzantine Resilience.] Let be any angular value, and any integer . Let be the indices of Byzantine workers. Let be independent random vectors in . Letbe an independent random variable which randomly selects a good worker
and samples a vector from and . Let , , be any Byzantine vectors in , possibly dependent on the ’s. An aggregation rule is said to be Byzantine resilient ifsatisfies (i) and (ii) for . Then we can conclude the almost sure convergence similar to [Proposition 2]blanchard2017machine [Resampling ]theoremresamplingkrum We assume that (i) the cost function is three times differentiable with continuous derivatives and nonnegative ; (ii) the learning rates satisfy and . Let the good workers have stochastic gradients for . We assume that for a uniformly chosen , the following is true (iii) and , for some constants , ; (iv) there exists a constant such that for all we have where ; (v) finally, beyond a certain horizon, , there exist and such that and . If and , then

[noitemsep,nolistsep]

Resampling is Byzantine resilient where is defined by
() 
the sequence of gradients converges almost surely to zero.
We defer the proof to sec:proof:multikrum. The above convergence result for heterogeneous data is nearly identical to [Proposition 2]blanchard2017machine for iid data, except for the slightly stronger restriction on the number of Byzantine workers .
Experiments
In this section, we demonstrate the effect of resampling on datasets distributed in a fashion. Throughout the section, we are training an MLP on the MNIST dataset lecun1998gradient. For the setup, the dataset is sorted by labels and sequentially divided equally between the fixed number of good workers; the Byzantine workers have access to the entire dataset. Implementations are based on PyTorch paszke2019pytorch and will be made publicly available.
Resampling against the attacks on data
In sec:attacks we have presented how heterogeneous data can lead to failure of existing robust aggregation rules. Here we apply our proposed resampling with =, = to the same aggregation rules, showing that resampling overcomes the described failures. Results are presented in fig:success. In fig:success:noniid_f0_all, we show that using resampling helps to achieve better test accuracy on data. Since resampling with = actually averages 2 gradients, we compare it with MultiKrum with =. The middle of fig:success:noniid_f0_all shows that MultiKrum with = performs better than , but with resampling is even better which suggests the resampling step improves the performance on data. The selection histogram on the rightmost part of fig:success:noniid_f0_all shows that after resampling, ’s selection is much more evenly distributed between the good workers. In fig:success:nmean_all, we show that resampling fixes with =1 and allows it to defend against the normalized mean attack. The resamplingbased aggregation can almost reach same accuracy for both iid and setup. In fig:success:mimic_nmean, while mimic attack does not work for medianbased rules in the setting, resampling still slightly improves the performance due to variance reduction. In the setting, resampling drastically improves the accuracy to the same level as the setting.
Resampling against general Byzantine attacks
In fig:s1w10f2, we present thorough experiments on data over 10 workers with 2 Byzantine workers. In each subfigure, we compare an aggregation rule with its variant with resampling. Three aggregation rules are compared: , , . In particular, we compare to with both T=1 (normalized mean) and T=8 (geometric median). Attacks. 5 different kinds of attacks are applied (one per row in the figure): bitflipping, labelflipping, gaussian attack, as well as the mimic and mimic2 attacks.

[noitemsep,nolistsep]

Bitflipping: A Byzantine worker flips the sign bits and sends instead of because of problems like hardware failures etc.

Labelflipping: The dataset on workers have corrupted labels. For the MNIST dataset, we let Byzantine workers transform labels by .

Gaussian
: A Byzantine worker sends a Gaussian random vector of 0 mean and isotropic covariance matrix with standard deviation 200 xie2018generalized.

mimic & mimic2: Explained in ssec:median.
From fig:s1w10f2 we can see that resampling improves the accuracy for most of the tasks. The final accuracies achieved vary with the aggregation rules we use. Notice that T1 is more robust to the mimic attack than T8 in fig:s1w10f2 because more inner iterations lead to better approximate geometric median and less robust to normalized mean attacks. The normalized mean attack has been addressed in ssec:normalized_mean.
Conclusion
In this paper, we initiated a study of robust distributed learning problem under realistic heterogeneous data. We showed that many existing Byzantinerobust aggregation rules fail under simple new attacks, or sometimes even without any Byzantine workers. As a solution, we propose a resampling scheme which effectively adapts existing robust algorithms to heterogeneous datasets at a negligible computational cost. We believe robustness under heterogeneous conditions has been an overlooked direction of research thus far and hope to inspire more work on this topic. Extending to the decentralized setting, stronger Byzantine adversaries, as well as obtaining optimal algorithms are other challenging directions for future work.
Comments
There are no comments yet.