# Generalized Byzantine-tolerant SGD

We propose three new robust aggregation rules for distributed synchronous Stochastic Gradient Descent (SGD) under a general Byzantine failure model. The attackers can arbitrarily manipulate the data transferred between the servers and the workers in the parameter server (PS) architecture. We prove the Byzantine resilience properties of these aggregation rules. Empirical analysis shows that the proposed techniques outperform current approaches for realistic use cases and Byzantine attack scenarios.

## Authors

• 17 publications
• 39 publications
• 13 publications
• ### Phocas: dimensional Byzantine-resilient stochastic gradient descent

We propose a novel robust aggregation rule for distributed synchronous S...
05/23/2018 ∙ by Cong Xie, et al. ∙ 0

• ### Fall of Empires: Breaking Byzantine-tolerant SGD by Inner Product Manipulation

Recently, new defense techniques have been developed to tolerate Byzanti...
03/10/2019 ∙ by Cong Xie, et al. ∙ 0

• ### Holdout SGD: Byzantine Tolerant Federated Learning

This work presents a new distributed Byzantine tolerant federated learni...
08/11/2020 ∙ by Shahar Azulay, et al. ∙ 0

• ### Distributed Byzantine Tolerant Stochastic Gradient Descent in the Era of Big Data

The recent advances in sensor technologies and smart devices enable the ...
02/27/2019 ∙ by Richeng Jin, et al. ∙ 0

• ### Byzantine-Resilient Stochastic Gradient Descent for Distributed Learning: A Lipschitz-Inspired Coordinate-wise Median Approach

In this work, we consider the resilience of distributed algorithms based...
09/10/2019 ∙ by Haibo Yang, et al. ∙ 0

• ### Distributed Momentum for Byzantine-resilient Learning

Momentum is a variant of gradient descent that has been proposed for its...
02/28/2020 ∙ by El Mahdi El Mhamdi, et al. ∙ 0

• ### Learning from History for Byzantine Robust Optimization

Byzantine robustness has received significant attention recently given i...
12/18/2020 ∙ by Sai Praneeth Karimireddy, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The failure resilience of distributed machine-learning systems has attracted increasing attention

(Blanchard et al., 2017; Chen et al., 2017) in the community. Larger clusters can accelerate training. However, this makes the distributed system more vulnerable to different kinds of failures or even attacks, including crashes and computation errors, stalled processes, or compromised sub-systems (Harinath et al., 2017)

. Thus, failure/attack resilience is becoming more and more important for distributed machine-learning systems, especially for large-scale deep learning

(Dean et al., 2012; McMahan et al., 2017).

In this paper, we consider the most general failure model, Byzantine failures (Lamport et al., 1982), where the attackers can know any information of the other processes, and attack any value in transmission. To be more specific, the data transmission between the machines can be replaced by arbitrary values. Under such model, there are no constraints on the failures or attackers.

The distributed training framework studied in this paper is the Parameter S

erver (PS). The PS architecture is composed of the server nodes and the worker nodes. The server nodes maintain a global copy of the model, aggregate the gradients from the workers, apply the gradients to the model, and broadcast the latest model to the workers. The worker nodes pull the latest model from the server nodes, compute the gradients according to the local portion of the training data, and send the gradients to the server nodes. The entire dataset and the corresponding workload is distributed to multiple worker nodes, thus parallelizing the computation via partitioning the dataset. There exist several distributed machine learning systems using the PS architecture. For instance, Tensorflow

(Abadi et al., 2016), CNTK (Seide & Agarwal, 2016), and MXNet (Chen et al., 2015) implement internal PS’s.

In this paper, we study the Byzantine resilience of synchronous Stochastic Gradient Descent (SGD), which is a popular class of learning algorithms using PS architecture. Its variants are widely used in training deep neural networks

(Kingma & Ba, 2014; Mukkamala & Hein, 2017). Such algorithms always wait to collect gradients from all the worker nodes before moving on to the next iteration.

The failure model can be described by using an matrix consisting of the -dimensional gradients produced by workers, as visualized in Figure 1. A previous work (Blanchard et al., 2017) discusses a special case of our failure model, where the Byzantine values must lie in the same rows (workers) as shown in Figure 1(a). Our failure model generalize the classic Byzantine failure model by placing the Byzantine values anywhere in the matrix without any constraint.

There are many possible types of attacks. In general, the attackers want to disturb the model training, i.e., make SGD converge slowly or converge to a bad solution. We list some of the possible attacks in the following three paragraphs.

We name the most general type of attacks as gamber. The attackers can change a portion of data on the communication media such as the wires or the network interfaces. The attackers randomly pick the data and maliciously change them (e.g., multiply them by a large negative value). As a result, on the server nodes, the collected gradients are partially replaced by arbitrary values.

Another possible type of attack is called omniscient. The attackers are supposed to know the gradients sent by all the workers, and use the sum of all the gradients, scaled by a large negative value, to replace some of the gradient vectors. The goal is to mislead SGD to go into an opposite direction with a large step size.

There are also some weaker attacks, such as Gaussian attack

, where some of the gradient vectors are replaced by random vectors sampled from a Gaussian distribution with large variances. Such attackers do not require any information from the workers.

With the generalized Byzantine failure model, we ask that using what aggregation rules and on what conditions, the synchronous SGD can still converge to good solutions. We propose novel median-based aggregation rules, with which SGD is Byzantine resilient on a certain condition: for each dimension, in all the values provided by the workers, the number of Byzantine values must be less than half of . Such Byzantine resilience property is called “dimensional Byzantine resilience”. The main contributions of this paper are listed below:

• We propose three aggregation rules for synchronous SGD with provable convergence to critical points: geometric median (Definition 6), marginal median (Definition 7), and “mean around median” (Definition 8). As far as we know, this paper is the first to theoretically and empirically study median-based aggregation rules under non-convex settings.

• We show that the three proposed robust aggregation rules have low computation cost. The time complexities are nearly linear, which are in the same order of the default choice for non-Byzantine aggregation, i.e., averaging.

• We formulate the dimensional Byzantine resilience property, and prove that marginal median and “mean around median” are dimensional Byzantine-resilient (Definition 5). As far as we know, this paper is the first one to study generalized Byzantine failures and dimensional Byzantine resilience for synchronous SGD.

## 2 Model

We consider the Parameter Server architecture consisting of workers. The goal is to find the optimizer of the following problem:

 minxE[f(x,ξ)],

where the expectation is with respect to the random variable

. The PS executes synchronous SGD for distributed training. In each round, the server nodes collect gradients from the workers. In the round, the server nodes aggregate the gradients from the workers, and broadcast the updated parameters to the workers. is the vector sent by the th worker in the th round, potentially Byzantine. Using aggregation rule , the server nodes update the parameters as follows:

 xt+1←xt−γtAggr({~vti:i∈[n]}),

where is the learning rate. The worker nodes pull the latest parameters from the server nodes, compute the gradients according to the local portion of the training data, and send the gradients to the server nodes. Without the Byzantine failures, the th worker will calculate , where . With Byzantine failures, are partially replaced by any arbitrary values, which results .

Since the Byzantine failure assumes the worst cases, the attackers may have full knowledge of the entire system, including the gradients generated by all the workers, and the aggregation rule . The malicious processes can even collaborate with each other (Lynch, 1996).

## 3 Byzantine Resilience

In this section, we formally define the classic Byzantine resilience property and its generalized version: dimensional Byzantine resilience.

Suppose that in a specific round, the correct vectors are i.i.d samples drawn from the random variable , where

is an unbiased estimator of the gradient. Thus,

, for any . We simplify the notations by ignoring the index of round .

We first introduce the classic Byzantine model proposed by Blanchard et al. (2017). With the Byzantine workers, the actual vectors received by the server nodes are as follows:

###### Definition 1 (Classic Byzantine Model).
 ~vi={vi,if the ith worker is % correct,arbitrary,if the ith worker is Byzantine. (1)

Note that the indices of Byzantine workers can change throughout different rounds. Furthermore, the server nodes are not aware of which workers are Byzantine. The only information given is the number of Byzantine workers, if necessary.

We directly use the same definition of classic Byzantine resilience proposed in (Blanchard et al., 2017).

###### Definition 2.

(Classic -Byzantine Resilience). Let be any angular value, and any integer . Let be any i.i.d. random vectors in , , with . Let be the set of vectors, of which up to of them are replaced by arbitrary vectors in , while the others still equal to the corresponding . Aggregation rule is said to be classic -Byzantine resilient if satisfies (i) and (ii) for , is bounded above by a linear combination of terms with .

The baseline algorithm Krum, denoted as  (Blanchard et al., 2017), is defined as follows

###### Definition 3.
 Krum({~vi:i∈[n]})=~vk, k=argmini∈[n]∑i→j∥~vi−~vj∥2,

where is the indices of the nearest neighbours of in measured by Euclidean distance.

The Krum aggregation is classic -Byzantine resilient under certain assumptions:

###### Lemma 1 (Blanchard et al. (2017)).

Let be any i.i.d. random -dimensional vectors s.t. , with and . of are replaced by arbitrary -dimensional vectors . If and , where

 η20(n,q)=2(n−q+q(n−q−2)+q2(n−q−1)n−2q−2),

then the function is classic -Byzantine resilient where is defined by .

The generalized Byzantine model is denoted as:

###### Definition 4 (Generalized Byzantine Model).
 (~vi)j={(vi)j,if the the j% th dimension of vi is correct,arbitrary,otherwise, (2)

where is the th dimension of the vector .

Based on the Byzantine model above, we introduce a generalized Byzantine resilience property, dimensional -Byzantine resilience, which is defined as follows:

###### Definition 5.

(Dimensional -Byzantine Resilience). Let be any angular value, and any integer . Let be any i.i.d. random vectors in , , with . Let be the set of vectors. For each dimension, up to of the values are replaced by arbitrary values, i.e., for dimension , of are Byzantine, where is the th dimension of the vector . Aggregation rule is said to be dimensional -Byzantine resilient if satisfies (i) and (ii) for , is bounded above by a linear combination of terms with .

Note that classic -Byzantine resilience is a special case of dimensional -Byzantine resilience. For classic Byzantine resilience defined in Definition 2, all the Byzantine values must lie in the same subset of workers, as shown in Figure 1(a).

In the following theorems, we show that Mean and Krum are not dimensional Byzantine resilient. The proofs are provided in the appendix.

###### Theorem 1.

Averaging is not dimensional Byzantine resilient.

###### Theorem 2.

Any aggregation rule that outputs is not dimensional Byzantine resilient.

Note that Krum chooses the vector with the minimal score. Thus, based on the theorem above, we obtain the following corollary.

###### Corollary 1.

is not dimensional Byzantine resilient.

If an aggregation rule is dimensional/classic -Byzantine resilient with satisfied assumptions, it converges to critical points almost surely, by reusing the Proposition 2 in (Blanchard et al., 2017). We provide the following lemma without proof.

###### Lemma 2 (Blanchard et al. (2017)).

Assume that (i) the cost function is three times differentiable with continuous derivatives, and is non-negative, ; (ii) the learning rates satisfy and ; (iii) the gradient estimator satisfies and , for some constants , ; (iv) there exists a constant such that for all x where ; (v) finally, beyond a certain horizon, , there exist and such that , and . Then the sequence of gradients converges almost surely to zero, if the aggregation rule satisfies -Byzantine Resilience defined in Definition 2 or 5.

## 4 Median-based Aggregation

With the Byzantine failure model defined in Equation (1) and (2), we propose three median-based aggregation rules, which are Byzantine resilient under certain conditions.

### 4.1 Geometric Median

The geometric median is used as a robust estimator of mean (Chen et al., 2017).

###### Definition 6.

The geometric median of , denoted by , is defined as

 λ=GeoMed({~vi:i∈[n]})=argminv∈Rdn∑i=1∥v−~vi∥.

The following theorem shows the classic -Byzantine resilience of geometric median. A proof is provided in the appendix.

###### Theorem 3.

Let be any i.i.d. random -dimensional vectors s.t. , with and . of are replaced by arbitrary -dimensional vectors . If and , where , then the function is classic -Byzantine resilient where is defined by .

### 4.2 Marginal Median

The marginal median is another generalization of one-dimensional median.

###### Definition 7.

We define the marginal median aggregation rule as

 μ=MarMed({~vi:i∈[n]}),

where for any , the th dimension of is , is the th dimension of the vector , is the one-dimensional median.

The following theorem claims that by using , the resulting vector is dimensional -Byzantine resilient. A proof is provided in the appendix.

###### Theorem 4.

Let be any i.i.d. random -dimensional vectors s.t. , with and . For any dimension , of are replaced by arbitrary values, where is the th dimension of the vector . If and , where , then the function is dimensional -Byzantine resilient where is defined by .

### 4.3 Beyond Median

We can also utilize more values for each dimension along with the median, if is given or easily estimated. To be more specific, for each dimension, we take the average of the values nearest to the median (including the median itself). We call the resulting aggregation rule “mean around median”, which is defined as follows:

###### Definition 8.

We define the mean-around-median aggregation rule as

 ρ=MeaMed({~vi:i∈[n]}),

where for any , the th dimension of is , is the indices of the top- values lying in nearest to the median , is the th dimension of the vector .

We show that is dimensional -Byzantine resilient.

###### Theorem 5.

Let be any i.i.d. random -dimensional vectors s.t. , with and . For any dimension , of are replaced by arbitrary values, where is the th dimension of the vector . If and , where , then the function is dimensional -Byzantine resilient where is defined by .

The mean-around-median aggregation can be viewed as a trimmed average centering at the median, which filters out the values far away from the median.

### 4.4 Time Complexity

For geometric median , there are no closed-form solutions. The -approximate geometric median can be computed in time (Cohen et al., 2016), which is nearly linear to . To compute the marginal median , we only need to compute the median value of each dimension. The simplest way is to apply any sorting algorithm to each dimension, which yields the time complexity . To obtian median values, there also exists an algorithm called selection algorithm (Blum et al., 1973) with average time complexity  ( in the worst case). Thus, we can get the marginal median with time complexity on average, which is in the same order of using mean value for aggregation. For , the computation additional to computing the marginal median takes linear time . Thus, the time complexity is the same as . Note that for Krum and Multi-Krum, the time complexity is  (Blanchard et al., 2017).

## 5 Experiments

In this section, we evaluate the convergence and Byzantine resilience properties of the proposed algorithms. We consider two image classification tasks: handwritten digits classification on MNIST dataset using multi-layer perceptron (MLP) with two hidden layers, and object recognition on convolutional neural network (CNN) with five convolutional layers and two fully-connected layers. The details of these two neural networks can be found in the appendix. There are

worker processes. We repeat each experiment for ten times and take the average. To make the conditions as fair as possible for all the algorithms, we ensure that all the algorithms are run with the same set of random seeds. The details of the datasets and the default hyperparameters of the corresponding models are listed in Table

1. We use top-1 or top-3 accuracy on testing sets (disjoint with the training sets) as evaluation metrics.

The baseline aggregation rules are Mean, Medoid, Krum (Definition 3), and Multi-Krum. Medoid, defined as follows, is a computation-efficient version of geometric median.

###### Definition 9.

The medoid of , denoted by , is defined as

Multi-Krum is a variant of Krum defined in Blanchard et al. (2017), which takes the average on several vectors selected by multiple rounds of Krum. We compare these baseline algorithms with the proposed algorithms: geometric median (GeoMed defined in Definition 6), marginal median (MarMed defined in Definition 7), and “mean around median” (MeaMed defined in Definition 8) under different settings in the following subsections.

Note that all the experiments of CNN on CIFAR10 show similar results with the experiments of MLP on MNIST. Thus, we only show the results of CNN in Section 5.5 as an example. The remaining results are provided in the appendix.

### 5.1 Convergence without Byzantine Failures

First, we evaluate the convergence without Byzantine failures. The goal is to empirically evaluate the bias and variance caused by the robust aggregation rules.

In Figure 10, we show the top-1 accuracy on the testing set of MNIST. The gaps between different algorithms are tiny. Among all the algorithms, Multi-Krum, GeoMed, and MeaMed have the least bias. They act just the same as averaging. converges slightly slower. Medoid and Krum both have slowest convergence.

### 5.2 Gaussian Attack

We test classic Byzantine resilience in this experiment. We consider the attackers that replace some of the gradient vectors with Gaussian random vectors with zero mean and isotropic covariance matrix with standard deviation 200. We refer to this kind of attack as Gaussian Attack. Within the figure, we also include the averaging without Byzantine failures as a baseline. 6 out of the 20 gradient vectors are Byzantine. The results are shown in Figure 3. As expected, averaging is not Byzantine resilient. The gaps between all the other algorithms are still tiny. GeoMed and MeaMed performs like there are no Byzantine failures at all. Multi-Krum and MarMed converges slightly slower. Medoid and Krum performs worst. Although Medoid is not Byzantine resilient, the Gaussian attack is weak enough so that Medoid is still effective.

### 5.3 Omniscient Attack

We test classic Byzantine resilience in this experiment. This kind of attacker is assumed to know the all the correct gradients. For each Byzantine gradient vector, the gradient is replaced by the negative sum of all the correct gradients, scaled by a large constant (1e20 in the experiments). Roughly speaking, this attack tries to make the parameter server go into the opposite direction with a long step. 6 out of the 20 gradient vectors are Byzantine. The results are shown in Figure 4. MeaMed still performs just like there is no failure. Multi-Krum is not as good as MeaMed, but the gap is small. Krum converges slower but still converges to the same accuracy. However, GeoMed and MarBed converge to bad solutions. Mean and Medoid are not tolerant to this attack.

### 5.4 Bit-flip Attack

We test dimensional Byzantine resilience in this experiment. Knowing the information of other workers can be difficult in practice. Thus, we use more realistic scenario in this experiment. The attacker only manipulates some individual floating numbers by flipping the 22th, 30th, 31th and 32th bits. Furthermore, we test dimensional Byzantine resilience in this experiment. For each of the first 1000 dimensions, 1 of the 20 floating numbers is manipulated using the bit-flip attack. The results are shown in Figure 5. As expected, only MarMed and MeaMed are dimensional Byzantine resilient.

Note that for Krum and Multi-Krum, their assumption requires the number of Byzantine vectors to satisfy , which means in our experiments. However, because each gradient is partially manipulated, all the vectors are Byzantine, which breaks the assumption of the Krum-based algorithms. Furthermore, to compute the distances to the -nearest neighbours, must be positive. To test the performance of Krum and Multi-Krum, we set for these two algorithms so that they can still be executed. Furthermore, we test whether tuning can make a difference. The results are shown in Figure 8. Obviously, whatever we use, Krum-based algorithms get stuck around bad solutions.

### 5.5 General Attack with Multiple Servers

We test general Byzantine resilience in this experiment. We evaluate the robust aggregation rules under a more general and realistic type of attack. It is very popular to partition the parameters into disjoint subsets, and use multiple server nodes to storage and aggregate them (Li et al., 2014a, b; Ho et al., 2013). We assume that the parameters are evenly partitioned and assigned to the server nodes. The attacker picks one single server, and manipulates any floating number by multiplying , with probability of . We call this attack gambler, because the attacker randomly manipulate the values, and wish that in some rounds the assumptions/prerequisites of the robust aggregation rules are broken, which crashes the training. Such attack requires less global information, and can be concentrated on one single server, which makes it more realistic and easier to implement.

In Figure 6 and 7, we evaluate the performance of all the robust aggregation rules under the gambler attack. The number of servers is . For Krum, Multi-Krum and MeaMed, the estimated Byzantine number is set as . We also show the performance of averaging without Byzantine values as the benchmark. It is shown that only marginal median MarMed and “mean around median” MeaMed survive under this attack. The convergence is slightly slower than the averaging without Byzantine values, but the gaps are small.

### 5.6 Discussion

As expected, mean aggregation is not Byzantine resilient. Although medoid is not Byzantine resilient, as proved by Blanchard et al. (2017), it can still make reasonable progress under some attacks such as Gaussian attack. Krum, Multi-Krum, and GeoMed are classic Byzantine resilient but not dimensional Byzantine resilient. MarMed and MeaMed are dimensional Byzantine resilient. However, under omniscient attack, MarMed suffers from larger variances, which slow down the convergence.

The gambler attack shows the true advantage of dimensional Byzantine resilience: higher probability of survival. Under such attack, chances are that the assumptions/prerequisites of MarMed and MeaMed may still get broken. However, their probability of crashing is less than the other algorithms because dimensional Byzantine resilience generalizes classic Byzantine resilience. An interesting observation is that MarMed is slightly better than MeaMed under gambler attack. That is because the estimation of is not accurate, which will cause some unpredictable behavior for MeaMed. We choose because it is the maximal value we can take for Krum and Multi-Krum.

It is obvious that MeaMed performs best in almost all the cases. Multi-Krum is also good, except that it is not dimensional Byzantine resilient. The reason why MeaMed and Multi-Krum have better performance is that they utilize the extra information of the number of Byzantine values. Note that MeaMed not only performs just as well as or even better than Multi-Krum, but also has lower time complexity.

Marginal median MarMed has the cheapest computation. Its worst case, omniscient attack, is hard to implement in reality. Thus, for most applications, we suggest MarMed as an easy-to-implement aggregation rule with robust performance, which (importantly) does not require knowledge of the number of byzantine values.

## 6 Related Works

There are few papers studying Byzantine resilience for machine learning algorithms. Our work is closely related to Blanchard et al. (2017). Another paper (Chen et al., 2017) proposed grouped geometric median for Byzantine resilience, with strongly convex functions.

Our approach offers the following important advantages over the previous work.

• Cheaper computation compared to Krum. Geometric median has nearly linear (approximately ) time complexity (Cohen et al., 2016). Marginal median and “mean around median” have linear time complexity on average (Blum et al., 1973), while the time complexity of Krum is .

• Less prior knowledge required. Both geometric median and marginal median do not require , the number of Byzantine workers, to be given, while Krum needs to calculate the sum of Euclidean distances of the nearest neighbours. Furthermore, when is known or well estimated, MeaMed show better robustness than Krum and Multi-Krum in most cases.

• Dimensional Byzantine resilience. Marginal median and “mean around median” tolerate a more general type of Byzantine failures described in Equation (2) and Definition 5, while Krum and geometric median can only tolerate the classic Byzantine failures described in Equation (1) and Definition 2.

• Better support for multiple server nodes. If the entire set of parameters is disjointly partitioned and stored on multiple server nodes, marginal median and “mean around median” need no additional communication, while Krum and geometric median requires communication among the server nodes.

## 7 Conclusion

We investigate the generalized Byzantine resilience of parameter server architecture. We proposed three novel median-based aggregation rules for synchronous SGD. The algorithms have low time complexity and provable convergence to critical points. Our empirical results show good performance in practice.

## References

• Abadi et al. (2016) Abadi, Martín, Barham, Paul, Chen, Jianmin, Chen, Zhifeng, Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Irving, Geoffrey, Isard, Michael, Kudlur, Manjunath, Levenberg, Josh, Monga, Rajat, Moore, Sherry, Murray, Derek Gordon, Steiner, Benoit, Tucker, Paul A., Vasudevan, Vijay, Warden, Pete, Wicke, Martin, Yu, Yuan, and Zhang, Xiaoqiang. Tensorflow: A system for large-scale machine learning. In OSDI, 2016.
• Blanchard et al. (2017) Blanchard, Peva, Guerraoui, Rachid, Stainer, Julien, et al. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, pp. 118–128, 2017.
• Blum et al. (1973) Blum, Manuel, Floyd, Robert W, Pratt, Vaughan, Rivest, Ronald L, and Tarjan, Robert E. Time bounds for selection. Journal of computer and system sciences, 7(4):448–461, 1973.
• Chen et al. (2015) Chen, Tianqi, Li, Mu, Li, Yutian, Lin, Min, Wang, Naiyan, Wang, Minjie, Xiao, Tianjun, Xu, Bing, Zhang, Chiyuan, and Zhang, Zheng. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.
• Chen et al. (2017) Chen, Yudong, Su, Lili, and Xu, Jiaming. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. arXiv preprint arXiv:1705.05491, 2017.
• Cohen et al. (2016) Cohen, Michael B, Lee, Yin Tat, Miller, Gary, Pachocki, Jakub, and Sidford, Aaron. Geometric median in nearly linear time. In

Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing

, pp. 9–21. ACM, 2016.
• Dean et al. (2012) Dean, Jeffrey, Corrado, Gregory S., Monga, Rajat, Chen, Kai, Devin, Matthieu, Le, Quoc V., Mao, Mark Z., Ranzato, Marc’Aurelio, Senior, Andrew W., Tucker, Paul A., Yang, Ke, and Ng, Andrew Y. Large scale distributed deep networks. In NIPS, 2012.
• Harinath et al. (2017) Harinath, Depavath, Satyanarayana, P, and Murthy, MV Ramana. A review on security issues and attacks in distributed systems. Journal of Advances in Information Technology, 8(1), 2017.
• Ho et al. (2013) Ho, Qirong, Cipar, James, Cui, Henggang, Lee, Seunghak, Kim, Jin Kyu, Gibbons, Phillip B., Gibson, Garth A., Ganger, Gregory R., and Xing, Eric P. More effective distributed ml via a stale synchronous parallel parameter server. Advances in neural information processing systems, 2013:1223–1231, 2013.
• Kingma & Ba (2014) Kingma, Diederik P. and Ba, Jimmy. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
• Krizhevsky & Hinton (2009) Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. 2009.
• Lamport et al. (1982) Lamport, Leslie, Shostak, Robert E., and Pease, Marshall C. The byzantine generals problem. ACM Trans. Program. Lang. Syst., 4:382–401, 1982.
• Li et al. (2014a) Li, Mu, Andersen, David G., Park, Jun Woo, Smola, Alexander J., Ahmed, Amr, Josifovski, Vanja, Long, James, Shekita, Eugene J., and Su, Bor-Yiing. Scaling distributed machine learning with the parameter server. In OSDI, 2014a.
• Li et al. (2014b) Li, Mu, Andersen, David G., Smola, Alexander J., and Yu, Kai. Communication efficient distributed machine learning with the parameter server. In NIPS, 2014b.
• Loosli et al. (2007) Loosli, Gaëlle, Canu, Stéphane, and Bottou, Léon.

Training invariant support vector machines using selective sampling.

Large scale kernel machines, pp. 301–320, 2007.
• Lynch (1996) Lynch, Nancy A. Distributed algorithms. Morgan Kaufmann, 1996.
• McMahan et al. (2017) McMahan, H. Brendan, Moore, Eider, Ramage, Daniel, Hampson, Seth, and y Arcas, Blaise Aguera. Communication-efficient learning of deep networks from decentralized data. In AISTATS, 2017.
• Minsker et al. (2015) Minsker, Stanislav et al. Geometric median and robust estimation in banach spaces. Bernoulli, 21(4):2308–2335, 2015.
• Mukkamala & Hein (2017) Mukkamala, Mahesh Chandra and Hein, Matthias.

In ICML, 2017.
• Seide & Agarwal (2016) Seide, Frank and Agarwal, Amit. Cntk: Microsoft’s open-source deep-learning toolkit. In KDD, 2016.

## 8 Appendix

In the appendix, we introduce several useful lemmas and use them to derive the detailed proofs of the theorems in this paper.

### 8.1 Dimensional Byzantine Resilience

###### Theorem 1.

Averaging is not dimensional Byzantine resilient.

###### Proof.

We demonstrate a counter example. Consider the case where

 ~vi={vi,∀i∈[n−1]−g−∑n−1i=1vi,i=n, (3)

where , . Thus, the resulting aggregation is . The inner product is always negative under the Byzantine attack. Thus, SGD is not expectedly descendant, which means it will not converge to critical points. Note that in this counter example, the number of Byzantine values of each dimension is .

Hence, averaging is not dimensional -Byzantine resilient with . ∎

###### Theorem 2.

Any aggregation rule that outputs is not dimensional Byzantine resilient.

###### Proof.

We demonstrate a counter example. Consider the case where the th dimension of the th vector is manipulated by the malicious workers (e.g. multiplied by an arbitrarily large negative value), where . Thus, up to 1 value of each dimension is Byzantine. However, no matter which vector is chosen, as long as the aggregation is chosen from , the inner product can be arbitrarily large negative value under the Byzantine attack. Thus, SGD is not expectedly descendant, which means it will not converge to critical points.

Hence, any aggregation rule that outputs is not dimensional -Byzantine resilient with . ∎

### 8.2 Geometric Median

We use the following lemma (Minsker et al., 2015; Cohen et al., 2016) without proof to bound the geometric median.

###### Lemma 3.

Let denote points in a Hilbert space. Let denote a -approximation of their geometric median, i.e., for . For any such that and given , if , then

 ∥z∗∥≤cqr+ϵcz,

where , .

Ideally, the geometric median () ignores the second term .

Using the lemma above, we can prove the classic Byzantine resilience of geometric median.

###### Theorem 3.

Let be any i.i.d. random -dimensional vectors s.t. , with and . of are replaced by arbitrary -dimensional vectors . If and , where , then the function is classic -Byzantine resilient where is defined by .

###### Proof.

We only need to prove that satisfies the two conditions of classic -Byzantine resilience defined in Definition 2.

Condition (i):
Let the sequence be defined as

 ~vj={vj,for correct j,arbitrary,for Byzantine j.

Let denote the geometric median of . Thus, is the geometric median of . Using Lemma 3, and taking , under the assumption , we obtain

 ∥λ−g∥≤2n−2qn−2qmaxcorrect j∥~vj−g∥.

Now, we can bound as follows:

 ∥E[λ]−g∥2 ≤E∥λ−g∥2(Jensen's inequality) ≤E[(2n−2qn−2q)2maxcorrect j∥~vj−g∥2] ≤E⎡⎢⎣(2n−2qn−2q)2∑correct j∥~vj−g∥2⎤⎥⎦ =(2n−2qn−2q)2(n−q)η21(n,q)dσ2.

By assumption, , i.e. belongs to a ball centered at with radius . This implies

 ⟨E[λ],g⟩≥(1−sin2α1)∥g∥2≥(1−sinα1)∥g∥2,

where .

Condition (ii):
We re-use Lemma 3 by taking , for , and . Thus, we have

 ∥λ∥ ≤cqmaxcorrect j∥~vj∥ ≤cq∑correct j∥~vj∥.

Without loss of generality, we denote the sequence as . Thus, there exists a constant such that

 ∥λ∥r≤c0∑r1+…+rn−q=r∥v1∥r1…∥vn−q∥rn−q.

Since ’s are i.i.d., we obtain that is bounded above by a linear combination of terms of the form with , which completes the proof of condition (ii). ∎

### 8.3 Marginal Median

We use the following lemma to bound the one-dimensional median.

###### Lemma 4.

For a sequence composed of Byzantine values and correct values , if  (the correct value dominates the sequence), then the median value of this sequence satisfies , .

###### Proof.

If comes from correct values, then the result is trivial. Thus, we only need to consider the cases where comes from Byzantine values.

If

is odd, then in the sorted sequence, there will be

values on both sides of . However, the number of correct values . Thus, on both sides of , there will be at least one correct value, which yields the desired result.

Furthermore, if is even, we can re-use the same technique above to prove . ∎

###### Theorem 4.

Let be any i.i.d. random -dimensional vectors s.t. , with and . For any dimension , of are replaced by arbitrary values, where is the th dimension of the vector . If and , where , then the function is dimensional -Byzantine resilient where is defined by .

###### Proof.

We only need to prove that satisfies the two conditions of dimensional -Byzantine resilience defined in Definition 5.

Condition (i):
Without loss of generality, we assume that , . For any dimension , let the sequence be defined as

 (~vi)j={(vi)j,for correct j,arbitrary,for Byzantine j.

For the th dimension, , the median value .

Thus, we have

 E[μj−gj]2≤E[max% correct i((~vi)j−gj)2] ≤E⎡⎢⎣∑correct i((~vi)j−gj)2⎤⎥⎦=∑correct iE[((~vi)j−gj)2] =(n−q)E[Gj−gj]2(i.i.d. over i) =(n−q)σ2j.

Now, we can bound as follows:

 ∥E[μ]−g∥2≤E∥μ−g∥2(Jensen's inequality) =E[d∑j=1(μj−gj)2]=d∑j=1E[(μj−gj)2] ≤d∑j=1(n−q)σ2j=(n−q)d∑j=1σ2j=(n−q)η22(n,q)dσ2.

By assumption, , i.e. belongs to a ball centered at with radius . This implies

 ⟨E[μ],g⟩≥(1−sin2α2)∥g∥2≥(1−sinα2)∥g∥2,

where .

Condition (ii):
By using the equivalence of norms in finite dimension, there exists a constant such that

 ∥μ∥= ⎷d∑j=1μ2j≤ ⎷d∑j=1maxcorrect i(~vi)2j ≤  ⎷d∑j=1∑correct i(~vi)2j=√∑correct i∥~vi∥2 ≤c1∑correct i∥~vi∥. (equivalence between ℓ2-norm and ℓ1-norm)

Without loss of generality, we denote the sequence as . Thus, there exists a constant such that

 ∥μ∥r≤c2∑r1+…+rn−q=r∥v1∥r1…∥vn−q∥rn−q.

Since ’s are i.i.d., we obtain that is bounded above by a linear combination of terms of the form with , which completes the proof of condition (ii). ∎

### 8.4 Mean around Median

The following lemma bounds the one-dimensional mean around median.

###### Lemma 5.

For a sequence (of scalar values) composed of Byzantine values and correct values , if  (the correct value dominates the sequence), then the mean-around-median value  (defined in Definition 8) and the median  (defined in Definition 7) of this sequence satisfies .

###### Proof.

According to the definition of the mean around median , it is the mean value over the top- values in the sequence, nearest to the median . Denote such set of nearest values as . If any satisfies that , then it cannot be in the set of the top- nearest values because all the correct values are nearer to  (). Since all satisfies , the average over them must also satisfies . ∎

###### Theorem 5.

Let be any i.i.d. random -dimensional vectors s.t. , with and . For any dimension , of are replaced by arbitrary values, where is the th dimension of the vector . If and , where , then the function is dimensional -Byzantine resilient where is defined by .

###### Proof.

We only need to prove that