1 Introduction
Distributed computing is becoming increasingly important in many modern dataintensive applications like computer vision, natural language processing and recommendation systems. Federated Learning (
[1, 2, 3]) is one recently proposed distributed computing paradigm that aims to fully utilize ondevice machine intelligence—in such systems, data are stored in end users’ own devices such as mobile phones and personal computers. Many statistical and computational challenges arise in Federated Learning, due to the highly decentralized system architecture. In this paper, we aim to tackle two challenges in Federated Learning: Byzantine robustness and heterogeneous data distribution.In Federated Learning, robustness has become one of the major concerns since individual computing units (worker machines) may exhibit abnormal behavior owing to corrupted data, faulty hardware, crashes, unreliable communication channels, stalled computation, or even malicious and coordinated attacks . It is well known that the overall performance of such a system can be arbitrarily skewed even if a single machine behaves in a Byzantine way. Hence it is necessary to develop distributed learning algorithms that are provably robust against Byzantine failures. This is considered in a few recent works, and much progress has been made (see
[4, 5, 6, 7, 8]).In practice, since worker nodes are end users’ personal devices, the issue of data heterogenity naturally arises in Federated Learning. Exploiting data heterogenity is particularly crucial in recommendation systems and personalized advertisement placement, which benefits both the users’ and the enterprises. For example, mobile phone users who read news articles may be interested in different categories of news like politics, sports or fashion; advertisement platforms might need to send different categories of ads to different groups of customers. These indicate that leveraging cluster structures among the users is of potential interest—each machine itself may not have enough data and thus we need to better utilize the similarity among the users in the same cluster. This problem has recently received attention in [9] in a nonstatistical multitask setting.
We believe that more effort is needed in this area in order to achieve better statistical guarantees and robustness against Byzantine failures. In this paper, we aim to tackle the data heterogeneity and Byzantinerobustness problems simultaneously. We propose a statistical model, along with a stage algorithm that solves the aforementioned problem yielding an estimation error which is optimal in dimension and number of data points. The crux of our approach lies in analyzing a clustering algorithm in the presence of adversarial data points. In particular, we study the classical Lloyd’s algorithm augmented with robust estimation. Specifically, we show that the number of misclustered points with the robust Lloyd algorithm decays at an exponential rate when initialized properly. Furthermore, we leverage a few properties of the robust Principle Component Analysis (PCA) to obtain a provable initialization. We now summarize the contributions of the paper.
1.1 Our contributions
We propose a general and flexible statistical model and a general algorithmic framework to address the heterogeneous Federated Learning problem in the presence of Byzantine machines. Our algorithmic framework consists of three stages: finding local solutions, performing centralized robust clustering and doing joint robust distributed optimization. The error incurred by our algorithm is optimal in several problem parameters. Furthermore, our framework allows for flexible choices of algorithms in each stage, and can be easily implemented in a modular manner.
(i)  (ii)  (iii) 
Moreover, as a byproduct, we analyze an outlierrobust clustering scheme, which may be considered as the Lloyd’s algorithm with robust estimation. The idea of robustifying the Lloyd’s algorithm is not new (e.g. see[10, 11] and the references therein) and several robust Lloyd algorithms are empirically well studied. However, to the best of our knowledge, this is the first work that analyzes and prove guarantees for such algorithms in a statistical setting, and might be of independent interest.
We validate our theoretical results via simulations on both synthetic and real world data. For synthetic experiments, using a mixture of regressions model, we find that our proposed algorithm drastically outperforms the nonByzantinerobust algorithms. Further, using Yahoo! Learning to Rank dataset, we demonstrate that our proposed algorithm is practical, easy to implement and dominates the standard nonrobust algorithms.
1.2 Related work
Distributed and Federated Learning:
Learning with a distributed computing framework has been studied extensively in various settings [12, 13, 14, 15, 16]. Since the paradigm of Federated Learning presented by [1, 3]
, several recent works focus on different applications of the problem, such as in deep learning
[2], predicting health events from wearable devices, and detecting burglaries in smart homes [17, 18]. While [19] deals with fairness in Federated Learning, [20, 21] deal with noniid data. A few recent works study heterogeneity under different setting in Federated Learning, for example see [9, 22, 23, 24] and the references therein. However, neither of these papers explicitly utilize the cluster structure of the problem in the presence of Byzantine machines. Also, in most cases, the objective is to learn a single optimal parameter for the whole problem, instead of learning optimal parameters for each cluster. In contrast, the MOCHA algorithm [9] considers a multitask learning setting and forms an optimization problem with the correlation matrix of the users being a regularization term. Our work differs from MOCHA since we consider a statistical setting and the Byzantinerobustness.Byzantinerobustness:
The robustness and security issues in distributed learning has received much attention ([25, 26]). In particular, one recent work by [27] studies the Byzantinerobust distributed learning from heterogeneous datasets. However, the basic goal of this work differs from ours, since we aim to optimize different prediction rules for different users, whereas [27] tries to find a single optimal solution.
Clustering and mixture models:
In the centralized setting, outlierrobust clustering and mixture models have been extensively studied. Robust clustering has been studied in many previous works [28, 29, 30]. One recent work [31] considers a statistical model for robust clustering, similar to ours. However, their algorithm is computationally heavy and hard to implement, whereas the robust clustering algorithm in our paper is more intuitive and straightforward to implement. Our work is also related to learning mixture models, such as mixture of experts [32] and mixture of regressions [33, 34].
2 Problem setup
We consider a standard statistical setting of empirical risk minimization (ERM). Our goal is to learn several parametric models by minimizing some (convex) loss functions defined by the data. Suppose we have
compute nodes, of which are Byzantine nodes, i.e., nodes that are arbitrarily corrupted by some adversary. Out of the nonByzantine compute nodes, we assume that there are different data distributions, , and that the machines are partitioned into clusters, . Suppose that every node contains i.i.d. data points drawn from . We also assume that we have no control over the data distribution of the corrupt nodes. Let be the loss function associated with data point , where is the parameter space. Our goal is to find the minimizers of all the population risk functions. For the th cluster, the minimizer is .The challenges in learning are: (i) we need a clustering scheme that work in presence of adversaries. Since, we have no control over the corrupted nodes, it is not possible to cluster all the nodes perfectly. Hence we need a robust distributed optimization algorithm. (ii) we want our algorithm to minimize uplink communication cost( [3]). Throughout, we use for universal constants; whose value may vary from line to line. Also, denotes norm.
3 A modular algorithm for robust Federated Learning in a heterogeneous environment
In this section, we present a modular algorithm that consists of stages—(1) Compute local empirical risk minimizers (ERMs) and send them to the center machine (2) Run outlierrobust clustering algorithm on these local ERMs and (3) Run a communicationefficient, robust, distributed optimization on each cluster (Algorithm 1, also see Figure 1).
3.1 Stage I compute ERMs
In this step, each compute node calculates the local empirical risk minimizer (ERM) associated to its risk function send them to the center machine. Since machine is associated with the local risk function, defined as , the local ERM, . We assume the loss function is convex with respect to its first argument, and so the compute node can run a convex optimization program to solve for .
Instead of solving the local risk function directly, the compute node can run an “onlinetobatch conversion” routine. Each compute node runs an online optimization algorithm like Online Gradient Descent [35]. At iteration , the compute node picks , and incurs a loss of . After episodes with the sequence of functions , the compute node sets the predictor as the average of the online choices made over instances. This predictor has similar properties like ERMs, however in case of online optimization, there is no need to store all data points apriori, and the entire operation is in a streaming setup.
3.2 Stage II cluster the ERMs
The second step of the modular algorithms deals with clustering the compute nodes based on their local ERMs. All compute nodes send local ERMs, , for ^{1}^{1}1For integer , denotes the set of integers . to the center machine, and the center machine runs a clustering algorithm on these data points to find clusters . Since compute nodes can be Byzantine, the clustering algorithm should be outlierrobust.
We show (in Section C) that if the amount of data in each worker node, is reasonably large, a simple threshold based clustering rule is sufficient. This scheme uses the fact that the local ERMs of machines belonging to a same cluster are close, whereas they are far apart for different clusters. However, if is small (which is pragmatic in Federated Learning), the aforementioned scheme fails to work. An alternative is to use a robust version of Lloyd algorithm (means). In particular: (i) at each iteration, assign the data points to its closest center (ii) compute a robust estimate of the mean with the assigned points for each cluster and use them as new centers and (iii) iterate until convergence.
The first step is identical to that of the data point assignment of means algorithm. There are a few options for robust estimation for mean. Out of them the most common estimates are geometric median [36], coordinatewise median, and trimmed mean . Although these mean estimates are robust, the estimation error ( being the dimension) which is prohibitive in large dimension. There is a recent line of work on robust mean estimation that adapts nicely to high dimension [37, 38]. In these results, the mean estimation error is either dimensionindependent or very weakly dependent on dimension. In Section A, we analyze this clustering scheme rigorously both in moderate and high dimension.
Since we are dealing with the case where workers are corrupted, and since we do not have control over the corrupt machines, no clustering algorithm can cluster all the compute nodes correctly, and hence we need a robust optimization algorithm that takes care of the adversarially corrupt (albeit Byzantine) nodes. This is precisely done in the third stage of the modular algorithm.
3.3 Stage III outlierrobust distributed optimization
After clustering, we run an outlierrobust distributed algorithm on each cluster. Each cluster can be thought of an instance of homogeneous distributed learning problem with possibly Byzantine machines. Hence, we can use the trimmed mean algorithm of [7] (since it has optimal statistical rate) for low to moderate dimension and the iterative filtering algorithm of [8] for high dimension. These algorithms are communicationefficient; the number of parallel iterations needed matches the standard results of gradient descent algorithm.
4 Main results
We now present the main results of the paper. Recall the problem setup of Section 2. Our goal is to learn the optimal weights . By running the modular algorithm described in the previous section, we compute final output of the learned weights as . All the proofs of this section are deferred to Section A. We start with the following set of assumptions.
Assumption 1.
The loss function is Lipschitz: for all .
Assumption 2.
is strongly convex: for all and ,
Assumption 3.
is strongly convex, smooth (i.e., ).
Assumption 4.
The function if smooth. For any the partial derivative of with respect to the th coordinate, is Lipschitz and sub exponential for all .
Note that, as illustrated in [7], the above structural assumptions on the partial derivative of the loss function are satisfied in several learning problems.
Assumption 5.
are separated: and .
Remark 1.
If is Lipschitz, , and hence can be . Also could be potentially small in many applications. Hence Assumption 5 enforces a strict requirement on .
Let the size of th cluster is and . Furthermore, let .
Theorem 1.
Suppose Assumptions hold. If Algorithm 1 is run with the “Edge cutting” (Section C) algorithm for stage II and the trimmed mean algorithm (of [7]) for iterations with constant stepsize of ) in stage III, then provided , for all , we obtain
with probability at least
.Remark 2.
As shown in Section C, given the above assumptions, “Edgecutting” perfectly clusters the nonByzantine machines with high probability. In the worst case, all the Byzantine machines may belong to a particular cluster, say the th one (). So, the fraction of Byzantine machines for would be at most .
Comparison with an Oracle: We compare the above bound with an Oracle inequality. We assume that the oracle knows the cluster identity for all the nonByzantine machines. Since with high probability, the modular algorithm makes no mistake in clustering the nonByzantine machines, the bound we get perfectly matches the oracle bound.
We now move to the setting where we have no restriction on , and hence may be potentially much smaller than . This setting is more realistic since data arising from applications (like images and video) are high dimensional, and the amount of data in data owners’ device may be small ([1]). We start with the following assumption.
Assumption 6.
The empirical risk minimizers, , corresponding to nonByzantine machines are sampled from a mixture of
subgaussian distributions.
We emphasize that several learning problems satisfy Assumption 6. We now exhibit one such setting where the empirical risk minimizer is Gaussian. We assume that machine belongs to cluster . Recall that denote the data points for machine .
Proposition 1.
Suppose the data are sampled from a parametric class of generative model: with covariate and i.i.d noise . Then, with quadratic loss, the distribution of the empirical risk minimizer is Gaussian with mean .
In general, subGaussian distributions form a huge class, including all bounded distributions. For nonByzantine machines, we assume the observation model: where are unknown labels and are the unknown means of the subgaussian distribution. We denote as independent and zero mean subgaussian noise with parameter . We propose and analyze a robust clustering algorithm presented in Algorithm 2. At iteration , let be the label of the th data point, and for be the estimate of the centers.
In Algorithm 2, we retain the nearest neighbor assignment of the Lloyd algorithm but change the sample mean estimate to a robust mean estimate using geometric medianbased trimming.
We now introduce a few new notations. Let denote the minimum separation between clusters. The worst case error in the centers are determined by . Consequently we define as the maximum fraction of misclustered points in a cluster (maximized over all clusters). In Section 5.2 and A.3 (of the supplementary material), these quantities are formally defined along with the initialization condition, .
Recall that and note that from Theorem 7, when Algorithm 2 is run for a constant number of iterations, we get with high probability. Also, let . Since denotes fraction of nonByzantine machines that are misclustered, denotes the worst case fraction of Byzantine machines in cluster . We assume .
Theorem 2.
Suppose Assumptions 2, 3, 4 and 6 hold along with the separation and initialization conditions (Assumptions 8 of Section 5). Furthermore, suppose Algorithm 1 is run with “Trimmed means” (Algorithm 2) for stage II for a constant iterations; and the trimmed mean algorithm (of [7]) for stage III for iterations with constant stepsize of . Then, provided , for all , we have
with probability at least .
Remark 3.
Like before, we can remove Assumption 2 and obtain guarantee on for all .
Comparison with the oracle: Recall that the oracle knows the cluster labels of all the nonByzantine machines. Hence, the worst case fraction of Byzantine machines will be . Consequently, we observe that the obliviousness of the clustering identity hurts by a factor of in the precision of learning weight . A few remarks are in order.
Remark 4.
As seen in Section D.2, if and , we show that if “Trimmed means” is run for at least iterations provided . Hence our precision bound matches perfectly with the oracle bound.
Remark 5.
The dependence on can be improved if iterative filtering algorithm ([8]) is used in stage III of the modular algorithm. We get with high probability.
4.1 Oracle optimality
In the presence of the oracle, our problem decomposes to homogeneous ones. We study the dependence of the estimation error of Theorem 2 on , and under such a setting.
Dependence on :
We compare our results with the lower bounds presented in [7, Observation 1] assuming is constant. It is immediate that the dependence on and is optimal. To see the dependence on , we first consider the special case of with centers . Here . Typically, and hence . Comparing with the bound in [7, Observation 1], the dependence on is near optimal in this case. However for a cluster setting, may not be linear in in general (since is not proportional to ).
Dependence on dimension :
In this setting, instead of running the trimmed mean algorithm as the distributed optimization subroutine, we run the iterative filtering algorithm of [8], and as shown in Remark 3, the dependence on when compared with the lower bound of [7, Observation 1] is optimal. Note that in this case, the dependence on becomes suboptimal.
5 Robust clustering
In Stage II of the modular algorithm, we cluster the local ERMs, in the presence of Byzantine machines. To ease notation, we write . Recall that for non Byzantine datapoints, we have , with unknown labels , unknown centers and subGaussian noise . For Byzantine data points is arbitrary. It is worth mentioning here that the classical Lloyd can be arbitrarily bad since the adversary may put the data points far away, thus causing the sample meanbased subroutine of the algorithm to fail. As a performance measure, we define the fraction of misclustered nonByzantine data points at iteration as, , where denotes the set of nonByzantine data points with . We first concentrate the special case where with centers and , and hence . With slight abuse in notation, the labels are and hence, , where . This can be thought of estimating from samples .
5.1 Symmetric clusters with Gaussian mixture
We analyze Algorithm 2 in the abovementioned setting. The performance depends on the normalized signaltonoise ratio, , where . At iteration , let be the fraction of datapoints being trimmed by Algorithm 2 and let be the estimate of .
Assumption 7.
(i) (SNR) We have and (ii) (Initialization) , where , are sufficiently large and is sufficiently small constants.
Hence we require a constant SNR and needs to be slightly better than a random guess.
Theorem 3.
with probability at least . Furthermore, for , with high probability.
Hence, if , then after steps, implying , which matches the oracle bound () mentioned after Theorem 2. Also, here we can tolerate , which can be prohibitive for large . In the general cluster case, we improve the tolerance level from to (Theorem 4), and in Section 5.4 we completely remove the dependence on .
5.2 clusters with subGaussian mixture
We now analyze the general cluster setting and with subGaussian noise. The details of this section are deferred to Section A.3 of the Appendix. Similar to , we define a clusterwise misclustering fraction and the trimmed clusterwise misclustering fraction as at iteration . Recall the definition of and from Section 4 and denote the minimum cluster size at iteration as . Also define and as the fraction of adversaries and trimmed points respectively for the th cluster. Furthermore, let be the maximum adversarial fraction (after trimming) in a cluster and be the normalized SNR.
Assumption 8.
We have: (a) ; (b) (SNR) ; and (c) (Initialization) , for a small constant .
Hence the separation (of means) is , which matches the standard separation condition for nonadversarial clustering ([39]). Let , where and . We have the following result:
5.3 Initialization
We see that in Theorem 3 and 4, the convergence guarantees require proper initialization. One possible way to achieve these guarantees is to initialize via spectral methods like the classical means algorithm. Spectral methods first project the data in a dimensional space [39, 40]
, then use some heuristic scheme to cluster in the low dimensional space, and finally use the obtained labels for initialization (
). Since a fraction of the data points are corrupted, we need to run a Principal Component Analysis (PCA) that is outlierrobust (
[41]). The initialization algorithm can be summarized as:StepI: Split the data points in partitions: and . Run the robust PCA algorithm of [41] on to obtain . Denote the first columns of as .
StepII: Project the data points, , onto , i.e., obtain for all .
StepIII: Run pairwise distancebased clustering algorithm (Algorithm 4 of Appendix B) and use the labels as .
Recall that denotes the initial fraction of misclustered good points. For a mixture of subGaussian distributions, we prove an upper bound on , which in turn yields initialization for Theorem 3 and 4. We have the following assumption which is slightly worse than Assumption 8.
Assumption 9.
We have: for sufficiently large and .
Lemma 1.
Suppose Assumption 9 holds and we use the labels given by the initialization algorithm as initial labels . Then, the initial fraction of misclustered points , with probability at least , where is a function of and .
5.4 Robust clustering in high dimension
In Sections 5.1 and 5.2, we see that the tolerable fraction of adversarial datapoints decays fast with , which makes Algorithm 2 unsuitable for large . Here we analyze the symmetric cluster setting only. However given 1, our analysis can be extended to general cluster setting. We adapt a slightly different observation model: are drawn i.i.d from the following Huber contamination model: with probability , , where
is a Rademacher random variable and
is asubGaussian random vector with zero mean, and is independent of
; with probability ,is drawn from an arbitrary distribution. We assume that the maximum eigenvalue of the covariance matrix of
is bounded. More specifically, we let . We denote the distribution of and by and , respectively. Intuitively, with probability , is an inlier, i.e., drawn from a mixture of two symmetric distributions, and with probability , is an outlier. The goal is to estimate and find the correct labels (i.e., ) of the inliers. We propose Algorithm 3 where the total number of data points is an integer multiple of the number of iterations , and the algorithm uses the Iterative Filtering algorithm [42, 43, 44], denoted by as a subroutine. The intuition of the iterative filtering algorithm is to use higher order statistics, such as the sample covariance to iteratively remove outliers.The convergence guarantee of the algorithm is in Theorem 5. We start with the following assumption:
Assumption 10.
We assume: (a) (Initialization) (b) (SNR) and (c) (Sample complexity) .
We emphasize that the SNR requirement is standard and the initialization condition is slightly stronger than a random guess. Armed with the above assumption, we have the following result.
Theorem 5.
Note that the tolerable level of has no dependence on dimension, which is an improvement over Theorem 3.
6 Experiments
We perform extensive experiments on synthetic and real data and compare the performance of our algorithm to several nonrobust clustering and/or optimizationbased algorithms.
6.1 Synthetic data
For synthetic experiments, we use a mixture of linear regressions model. For each cluster, a
dimensional regression coefficient vector, , is generated elementwise by a distribution. Then machines are uniformly assigned to the clusters, and machines are considered adversarial machines. For each good machine, (belonging to cluster, ), data points are generated independently according to: , for all , where and . For adversarial machines, the regression coefficients are sampled from , resulting in outliers. We initialize the cluster assignments with percent correct assignments for the good machines. We test the performance of Lloyd (means), Trimmed means (Algorithm 2) and geomedians (where the sample mean step of Lloyd is replaced by geometric median; note that this is Algorithm 2 excluding the trimming step). We set and . In Figure 1(a), we see that the fraction of misclustered points (which we call misclustering rate) indeed diminishes with iteration at a fast rate which validates Theorem 4, whereas for means, it converges to a misclustering rate of .A comparison of Kmeans (KM), Kgeomedians (KGM), and Trimmed Kmeans (TKM) in conjunction with Trimmed Mean robust optimization (TM), Sample Mean optimization (SM) and Federated Averaging (FA). In Figure
1(a), we choose . The error bars in Figure 1(b)show the standard deviation over 20 trials.
We compare our algorithm consisting of robust clustering (using Trimmed means or geomedians) and robust distributed optimization with algorithms without robust subroutines in the clustering or the optimization step. In particular we use the classical means as a non robust clustering, and a naive sample averagingbased scheme (instead of robust trimmed meanbased scheme by [7]) as a nonrobust, distributed algorithm. Also, in the robust optimization stage, we compare with a robust version of the Federated Averaging algorithm of [1] with iterations of gradient descent in each worker node before the global model gets updated (by taking the trimmed mean of the local models in the worker nodes).
We first observe that the estimation error () for nonrobust clustering schemes (KM in Figure 1(b)) is higher than that using Trimmed means (TKM) and geomedians (KGM). Furthermore, trimmed meanbased distributed optimization (TM) strictly outperforms the sample meanbased (SM) optimization routine by even with robust clustering. Federated averaging (FA) does orders of magnitude worse in estimation, likely due to the poor gradient updates provided by individual machines. Hence matching our theoretical intuition, robust clustering and robust optimization have the best performance in the presence of adversaries.
6.2 Yahoo! Learning to Rank dataset
The performance of the modular algorithm is evaluated on the Yahoo Learning to Rank dataset [45]. We use the set2.test.txt file for our experiment. We choose to treat the data as unsupervised, ignoring the labels for this simulation. Starting with queries and features, we adopt the following thresholding rule: we draw an edge between the queries with distance less than (which we optimize at ). We then run a treesearch algorithm to detect the connected graphs which produce our true cluster assignments. Small groups are removed from the dataset. This results in large clusters. Next, we take the mean of the features in each cluster to obtain . The data points in each cluster is then split randomly in batches of (hence, ). In addition, respecting , adversarial splits are incorporated via sampling points randomly from the unused data and adding a vector to the ERM. Note, we synthetically perturb the data points primarily since it is hard to find datasets with explicit adversaries. We then compute the mean in each split (these can be thought analogous to local ERMs), and perform clustering on them using means, Trimmed means, and geomedians algorithms with fully random initialization. Then, we use trimmed mean, sample mean, or Federated Averaging optimization to estimate the on each of the cluster assignment estimates with mean squared loss.
The results of the real data experiments are shown in Fig 1(c). We see that Trimmed means in conjunction with trimmed mean optimization outperforms the other methods with an estimation error of . This algorithm is easy to implement and learns the optimal weights efficiently. On the other hand, the estimation error of means algorithm with sample mean optimization is , which is relatively two times worse than the robust algorithms. Also, Trimmed means and geomedians have similar final estimation error, which further confirms that trimming step after computing the geometric median may be redundant. Thus, we once again emphasize that our robust algorithm performs better than standard nonrobust algorithms.
7 Conclusion and future work
We tackle the problem of robust Federated Learning in a heterogeneous environment. We propose a step modular solution to the problem. For the second step, we analyze the classical Lloyd algorithm with a robust subroutine and analyze a provable initialization scheme. We observe that for the theoretical guarantees, we need the data points to be subGaussian with a mean separation of . Weakening the subGaussian assumption along with a better initialization scheme are kept as our future endeavors. We would also like to come up with robust clustering algorithms that have a nice adaptation to dimension.
Acknowledgments
The authors would like to thank Swanand Kadhe and Prof. Peter Bartlett for helpful discussions.
References

[1]
Brendan McMahan and Daniel Ramage.
Federated learning: Collaborative machine learning without centralized training data.
https://research.googleblog.com/2017/04/federatedlearningcollaborative.html, 2017.  [2] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communicationefficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
 [3] Jakub Konečnỳ, H Brendan McMahan, Daniel Ramage, and Peter Richtárik. Federated optimization: distributed machine learning for ondevice intelligence. arXiv preprint arXiv:1610.02527, 2016.
 [4] Jiashi Feng, Huan Xu, and Shie Mannor. Distributed robust learning. arXiv preprint arXiv:1409.5937, 2014.
 [5] Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Byzantinetolerant machine learning. arXiv preprint arXiv:1703.02757, 2017.
 [6] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. arXiv preprint arXiv:1705.05491, 2017.
 [7] Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantinerobust distributed learning: Towards optimal statistical rates. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 5650–5659. PMLR, 10–15 Jul 2018.
 [8] Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Defending against saddle point attack in Byzantinerobust distributed learning. arXiv preprint arXiv:1806.05358, 2018.
 [9] Virginia Smith, ChaoKai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated multitask learning. In Advances in Neural Information Processing Systems, pages 4424–4434, 2017.

[10]
Sariel HarPeled and Soham Mazumdar.
On coresets for kmeans and kmedian clustering.
In
Proceedings of the thirtysixth annual ACM symposium on Theory of computing
, pages 291–300. ACM, 2004.  [11] Moses Charikar, Sudipto Guha, Éva Tardos, and David B Shmoys. A constantfactor approximation algorithm for the kmedian problem. Journal of Computer and System Sciences, 65(1):129–149, 2002.

[12]
Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola.
Parallelized stochastic gradient descent.
In Advances in neural information processing systems, pages 2595–2603, 2010.  [13] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011.
 [14] Ohad Shamir, Nathan Srebro, and Tong Zhang. Communication efficient distributed optimization using an approximate newtontype method. CoRR, abs/1312.7853, 2013.
 [15] Virginia Smith, Simone Forte, Chenxin Ma, Martin Takác, Michael I. Jordan, and Martin Jaggi. Cocoa: A general framework for communicationefficient distributed optimization. CoRR, abs/1611.02189, 2016.

[16]
Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan
Ramchandran, and Peter Bartlett.
Gradient diversity: a key ingredient for scalable distributed
learning.
In
International Conference on Artificial Intelligence and Statistics
, pages 1998–2007, 2018.  [17] Alexandros Pantelopoulos and Nikolaos G Bourbakis. A survey on wearable sensorbased systems for health monitoring and prognosis. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 40(1):1–12, 2010.
 [18] Parisa Rashidi and Diane J. Cook. Keeping the resident in the loop: Adapting the smart home to the user. Trans. Sys. Man Cyber. Part A, 39(5):949–959, September 2009.
 [19] Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. CoRR, abs/1902.00146, 2019.
 [20] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with noniid data. CoRR, abs/1806.00582, 2018.
 [21] Felix Sattler, Simon Wiedemann, KlausRobert Müller, and Wojciech Samek. Robust and communicationefficient federated learning from noniid data. CoRR, abs/1903.02891, 2019.
 [22] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with noniid data. arXiv preprint arXiv:1806.00582, 2018.
 [23] Anit Kumar Sahu, Tian Li, Maziar Sanjabi, Manzil Zaheer, Ameet Talwalkar, and Virginia Smith. On the convergence of federated optimization in heterogeneous networks. CoRR, abs/1812.06127, 2018.
 [24] Liping Li, Wei Xu, Tianyi Chen, Georgios B. Giannakis, and Qing Ling. RSA: byzantinerobust stochastic aggregation methods for distributed learning from heterogeneous datasets. CoRR, abs/1811.03761, 2018.
 [25] Dan Alistarh, Zeyuan AllenZhu, and Jerry Li. Byzantine stochastic gradient descent. arXiv preprint arXiv:1803.08917, 2018.
 [26] Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Generalized Byzantinetolerant SGD. arXiv preprint arXiv:1802.10116, 2018.
 [27] Liping Li, Wei Xu, Tianyi Chen, Georgios B Giannakis, and Qing Ling. Rsa: Byzantinerobust stochastic aggregation methods for distributed learning from heterogeneous datasets. arXiv preprint arXiv:1811.03761, 2018.
 [28] Ke Chen. A constant factor approximation algorithm for kmedian clustering with outliers. In Proceedings of the nineteenth annual ACMSIAM symposium on Discrete algorithms, pages 826–835. Citeseer, 2008.
 [29] Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and Sergei Vassilvitskii. Local search methods for kmeans with outliers. Proceedings of the VLDB Endowment, 10(7):757–768, 2017.
 [30] Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant approximation for kmedian and kmeans with outliers via iterative rounding. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 646–659. ACM, 2018.
 [31] Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 47–60. ACM, 2017.
 [32] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
 [33] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating minimization for mixed linear regression. In International Conference on Machine Learning, pages 613–621, 2014.
 [34] Dong Yin, Ramtin Pedarsani, Yudong Chen, and Kannan Ramchandran. Learning mixtures of sparse linear regressions using sparse graph codes. IEEE Transactions on Information Theory, 65(3):1430–1451, 2018.
 [35] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pages 928–936, 2003.
 [36] Stanislav Minsker. Geometric median and robust estimation in banach spaces. Bernoulli, 21(4):2308–2335, 11 2015.
 [37] Kevin Lai, Anup Rao, and S Vempala. Agnostic estimation of mean and covariance. arXiv preprint arXiv:1604.06968, 2016.
 [38] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Stewart Alistair. Being robust (in high dimensions) can be practical. arXiv preprint arXiv:1703.00893, 2018.
 [39] Amit Kumar and Ravindran Kannan. Clustering with spectral norm and the kmeans algorithm. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 299–308. IEEE, 2010.

[40]
Pranjal Awasthi and Or Sheffet.
Improved spectralnorm bounds for clustering.
In Anupam Gupta, Klaus Jansen, José Rolim, and Rocco Servedio,
editors,
Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques
, pages 37–49, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.  [41] Yeshwanth Cherapanamjeri, Prateek Jain, and Praneeth Netrapalli. Thresholding based efficient outlier robust pca. arXiv preprint arXiv:1702.05571, 2017.
 [42] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 655–664. IEEE, 2016.
 [43] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. arXiv preprint arXiv:1703.00893, 2017.
 [44] Jacob Steinhardt, Moses Charikar, and Gregory Valiant. Resilience: A criterion for learning in the presence of arbitrary outliers. arXiv preprint arXiv:1703.04940, 2017.
 [45] Yahoo Learning to Rank Challenge (C14) (https://webscope.sandbox.yahoo.com/).
 [46] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
 [47] Yu Lu and Harrison H Zhou. Statistical and computational guarantees of lloyd’s algorithm and its variants. arXiv preprint arXiv:1612.02099, 2016.
 [48] Martin J Wainwright. Highdimensional statistics: A nonasymptotic viewpoint, volume 48. Cambridge University Press, 2019.
 [49] Shai ShalevShwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic convex optimization. In COLT, 2009.
Appendix
Appendix A Theoretical guarantees for Algorithm 2
a.1 Proof of Proposition 1
Given the parametric form of data generation, we first stack the covariates to form the matrix where . Also we form the vectors and . The objective is to estimate
. We run an ordinary least squares, i.e., we calculate the following,
From standard calculations, The ERM is given by
Hence the distribution of is Gaussian. Since for all , .
a.2 Symmetric cluster: proof of Theorem 3
Suppose after geometric median based trimming on both the centers at iteration , we retain data points.
Let be the estimate of at iteration . Let us fix a few notations here. At step , we denote as the set of datapoints that are not trimmed. denotes the set of trimmed points and denotes the set of adversarily corrupted data points. We have
(1)  
where, and . Consequently
Since , where are the true label of the th data point, we have the following relation
Let . Plugging in, we get
(2)  
We need a few definitions to proceed further. Recall denote the average error rate over good samples. Also define and . With the above definitions, we get
As a result
The extra term can be controlled using Cauchy Schwartz inequality in the following way