# Robust Federated Learning in a Heterogeneous Environment

We study a recently proposed large-scale distributed learning paradigm, namely Federated Learning, where the worker machines are end users' own devices. Statistical and computational challenges arise in Federated Learning particularly in the presence of heterogeneous data distribution (i.e., data points on different devices belong to different distributions signifying different clusters) and Byzantine machines (i.e., machines that may behave abnormally, or even exhibit arbitrary and potentially adversarial behavior). To address the aforementioned challenges, first we propose a general statistical model for this problem which takes both the cluster structure of the users and the Byzantine machines into account. Then, leveraging the statistical model, we solve the robust heterogeneous Federated Learning problem optimally; in particular our algorithm matches the lower bound on the estimation error in dimension and the number of data points. Furthermore, as a by-product, we prove statistical guarantees for an outlier-robust clustering algorithm, which can be considered as the Lloyd algorithm with robust estimation. Finally, we show via synthetic as well as real data experiments that the estimation error obtained by our proposed algorithm is significantly better than the non-Byzantine-robust algorithms; in particular, we gain at least by 53% and 33% for synthetic and real data experiments, respectively, in typical settings.

## Authors

• 18 publications
• 1 publication
• 21 publications
• 42 publications
• ### Byzantine-Robust Federated Learning via Credibility Assessment on Non-IID Data

Federated learning is a novel framework that enables resource-constraine...
09/06/2021 ∙ by Kun Zhai, et al. ∙ 0

• ### Local Model Poisoning Attacks to Byzantine-Robust Federated Learning

In federated learning, multiple client devices jointly learn a machine l...
11/26/2019 ∙ by Minghong Fang, et al. ∙ 0

• ### Robust Federated Learning: The Case of Affine Distribution Shifts

Federated learning is a distributed paradigm that aims at training model...
06/16/2020 ∙ by Amirhossein Reisizadeh, et al. ∙ 9

• ### FedCom: A Byzantine-Robust Local Model Aggregation Rule Using Data Commitment for Federated Learning

Federated learning (FL) is a promising privacy-preserving distributed ma...
04/16/2021 ∙ by Bo Zhao, et al. ∙ 10

• ### Federated Learning via Plurality Vote

Federated learning allows collaborative workers to solve a machine learn...
10/06/2021 ∙ by Kai Yue, et al. ∙ 0

• ### Dynamic Clustering in Federated Learning

In the resource management of wireless networks, Federated Learning has ...
12/07/2020 ∙ by Yeongwoo Kim, et al. ∙ 0

• ### Robust Mean Estimation in High Dimensions via ℓ_0 Minimization

We study the robust mean estimation problem in high dimensions, where α ...
08/21/2020 ∙ by Jing Liu, et al. ∙ 8

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Distributed computing is becoming increasingly important in many modern data-intensive applications like computer vision, natural language processing and recommendation systems. Federated Learning (

[1, 2, 3]) is one recently proposed distributed computing paradigm that aims to fully utilize on-device machine intelligence—in such systems, data are stored in end users’ own devices such as mobile phones and personal computers. Many statistical and computational challenges arise in Federated Learning, due to the highly decentralized system architecture. In this paper, we aim to tackle two challenges in Federated Learning: Byzantine robustness and heterogeneous data distribution.

In Federated Learning, robustness has become one of the major concerns since individual computing units (worker machines) may exhibit abnormal behavior owing to corrupted data, faulty hardware, crashes, unreliable communication channels, stalled computation, or even malicious and coordinated attacks . It is well known that the overall performance of such a system can be arbitrarily skewed even if a single machine behaves in a Byzantine way. Hence it is necessary to develop distributed learning algorithms that are provably robust against Byzantine failures. This is considered in a few recent works, and much progress has been made (see

[4, 5, 6, 7, 8]).

In practice, since worker nodes are end users’ personal devices, the issue of data heterogenity naturally arises in Federated Learning. Exploiting data heterogenity is particularly crucial in recommendation systems and personalized advertisement placement, which benefits both the users’ and the enterprises. For example, mobile phone users who read news articles may be interested in different categories of news like politics, sports or fashion; advertisement platforms might need to send different categories of ads to different groups of customers. These indicate that leveraging cluster structures among the users is of potential interest—each machine itself may not have enough data and thus we need to better utilize the similarity among the users in the same cluster. This problem has recently received attention in [9] in a non-statistical multi-task setting.

We believe that more effort is needed in this area in order to achieve better statistical guarantees and robustness against Byzantine failures. In this paper, we aim to tackle the data heterogeneity and Byzantine-robustness problems simultaneously. We propose a statistical model, along with a stage algorithm that solves the aforementioned problem yielding an estimation error which is optimal in dimension and number of data points. The crux of our approach lies in analyzing a clustering algorithm in the presence of adversarial data points. In particular, we study the classical Lloyd’s algorithm augmented with robust estimation. Specifically, we show that the number of misclustered points with the robust Lloyd algorithm decays at an exponential rate when initialized properly. Furthermore, we leverage a few properties of the robust Principle Component Analysis (PCA) to obtain a provable initialization. We now summarize the contributions of the paper.

### 1.1 Our contributions

We propose a general and flexible statistical model and a general algorithmic framework to address the heterogeneous Federated Learning problem in the presence of Byzantine machines. Our algorithmic framework consists of three stages: finding local solutions, performing centralized robust clustering and doing joint robust distributed optimization. The error incurred by our algorithm is optimal in several problem parameters. Furthermore, our framework allows for flexible choices of algorithms in each stage, and can be easily implemented in a modular manner.

Moreover, as a by-product, we analyze an outlier-robust clustering scheme, which may be considered as the Lloyd’s algorithm with robust estimation. The idea of robustifying the Lloyd’s algorithm is not new (e.g. see[10, 11] and the references therein) and several robust Lloyd algorithms are empirically well studied. However, to the best of our knowledge, this is the first work that analyzes and prove guarantees for such algorithms in a statistical setting, and might be of independent interest.

We validate our theoretical results via simulations on both synthetic and real world data. For synthetic experiments, using a mixture of regressions model, we find that our proposed algorithm drastically outperforms the non-Byzantine-robust algorithms. Further, using Yahoo! Learning to Rank dataset, we demonstrate that our proposed algorithm is practical, easy to implement and dominates the standard non-robust algorithms.

### 1.2 Related work

##### Distributed and Federated Learning:

Learning with a distributed computing framework has been studied extensively in various settings [12, 13, 14, 15, 16]. Since the paradigm of Federated Learning presented by [1, 3]

, several recent works focus on different applications of the problem, such as in deep learning

[2], predicting health events from wearable devices, and detecting burglaries in smart homes [17, 18]. While [19] deals with fairness in Federated Learning, [20, 21] deal with non-iid data. A few recent works study heterogeneity under different setting in Federated Learning, for example see [9, 22, 23, 24] and the references therein. However, neither of these papers explicitly utilize the cluster structure of the problem in the presence of Byzantine machines. Also, in most cases, the objective is to learn a single optimal parameter for the whole problem, instead of learning optimal parameters for each cluster. In contrast, the MOCHA algorithm [9] considers a multi-task learning setting and forms an optimization problem with the correlation matrix of the users being a regularization term. Our work differs from MOCHA since we consider a statistical setting and the Byzantine-robustness.

##### Byzantine-robustness:

The robustness and security issues in distributed learning has received much attention ([25, 26]). In particular, one recent work by [27] studies the Byzantine-robust distributed learning from heterogeneous datasets. However, the basic goal of this work differs from ours, since we aim to optimize different prediction rules for different users, whereas [27] tries to find a single optimal solution.

##### Clustering and mixture models:

In the centralized setting, outlier-robust clustering and mixture models have been extensively studied. Robust clustering has been studied in many previous works [28, 29, 30]. One recent work [31] considers a statistical model for robust clustering, similar to ours. However, their algorithm is computationally heavy and hard to implement, whereas the robust clustering algorithm in our paper is more intuitive and straightforward to implement. Our work is also related to learning mixture models, such as mixture of experts [32] and mixture of regressions [33, 34].

## 2 Problem setup

We consider a standard statistical setting of empirical risk minimization (ERM). Our goal is to learn several parametric models by minimizing some (convex) loss functions defined by the data. Suppose we have

compute nodes, of which are Byzantine nodes, i.e., nodes that are arbitrarily corrupted by some adversary. Out of the non-Byzantine compute nodes, we assume that there are different data distributions, , and that the machines are partitioned into clusters, . Suppose that every node contains i.i.d. data points drawn from . We also assume that we have no control over the data distribution of the corrupt nodes. Let be the loss function associated with data point , where is the parameter space. Our goal is to find the minimizers of all the population risk functions. For the -th cluster, the minimizer is .

The challenges in learning are: (i) we need a clustering scheme that work in presence of adversaries. Since, we have no control over the corrupted nodes, it is not possible to cluster all the nodes perfectly. Hence we need a robust distributed optimization algorithm. (ii) we want our algorithm to minimize uplink communication cost( [3]). Throughout, we use for universal constants; whose value may vary from line to line. Also, denotes norm.

## 3 A modular algorithm for robust Federated Learning in a heterogeneous environment

In this section, we present a modular algorithm that consists of stages—(1) Compute local empirical risk minimizers (ERMs) and send them to the center machine (2) Run outlier-robust clustering algorithm on these local ERMs and (3) Run a communication-efficient, robust, distributed optimization on each cluster (Algorithm 1, also see Figure 1).

### 3.1 Stage I- compute ERMs

In this step, each compute node calculates the local empirical risk minimizer (ERM) associated to its risk function send them to the center machine. Since machine is associated with the local risk function, defined as , the local ERM, . We assume the loss function is convex with respect to its first argument, and so the compute node can run a convex optimization program to solve for .

Instead of solving the local risk function directly, the compute node can run an “online-to-batch conversion” routine. Each compute node runs an online optimization algorithm like Online Gradient Descent [35]. At iteration , the compute node picks , and incurs a loss of . After episodes with the sequence of functions , the compute node sets the predictor as the average of the online choices made over instances. This predictor has similar properties like ERMs, however in case of online optimization, there is no need to store all data points apriori, and the entire operation is in a streaming setup.

### 3.2 Stage II- cluster the ERMs

The second step of the modular algorithms deals with clustering the compute nodes based on their local ERMs. All compute nodes send local ERMs, , for 111For integer , denotes the set of integers . to the center machine, and the center machine runs a clustering algorithm on these data points to find clusters . Since compute nodes can be Byzantine, the clustering algorithm should be outlier-robust.

We show (in Section C) that if the amount of data in each worker node, is reasonably large, a simple threshold based clustering rule is sufficient. This scheme uses the fact that the local ERMs of machines belonging to a same cluster are close, whereas they are far apart for different clusters. However, if is small (which is pragmatic in Federated Learning), the aforementioned scheme fails to work. An alternative is to use a robust version of Lloyd algorithm (-means). In particular: (i) at each iteration, assign the data points to its closest center (ii) compute a robust estimate of the mean with the assigned points for each cluster and use them as new centers and (iii) iterate until convergence.

The first step is identical to that of the data point assignment of -means algorithm. There are a few options for robust estimation for mean. Out of them the most common estimates are geometric median [36], coordinate-wise median, and trimmed mean . Although these mean estimates are robust, the estimation error ( being the dimension) which is prohibitive in large dimension. There is a recent line of work on robust mean estimation that adapts nicely to high dimension [37, 38]. In these results, the mean estimation error is either dimension-independent or very weakly dependent on dimension. In Section A, we analyze this clustering scheme rigorously both in moderate and high dimension.

Since we are dealing with the case where workers are corrupted, and since we do not have control over the corrupt machines, no clustering algorithm can cluster all the compute nodes correctly, and hence we need a robust optimization algorithm that takes care of the adversarially corrupt (albeit Byzantine) nodes. This is precisely done in the third stage of the modular algorithm.

### 3.3 Stage III- outlier-robust distributed optimization

After clustering, we run an outlier-robust distributed algorithm on each cluster. Each cluster can be thought of an instance of homogeneous distributed learning problem with possibly Byzantine machines. Hence, we can use the trimmed mean algorithm of [7] (since it has optimal statistical rate) for low to moderate dimension and the iterative filtering algorithm of [8] for high dimension. These algorithms are communication-efficient; the number of parallel iterations needed matches the standard results of gradient descent algorithm.

## 4 Main results

We now present the main results of the paper. Recall the problem set-up of Section 2. Our goal is to learn the optimal weights . By running the modular algorithm described in the previous section, we compute final output of the learned weights as . All the proofs of this section are deferred to Section A. We start with the following set of assumptions.

###### Assumption 1.

The loss function is Lipschitz: for all .

###### Assumption 2.

is -strongly convex: for all and ,

 f(w,x)−f(w′,x)−⟨∇f(w′,x),w−w′⟩≥λ2∥∥w−w′∥∥2
###### Assumption 3.

is strongly convex, smooth (i.e., ).

###### Assumption 4.

The function if smooth. For any the partial derivative of with respect to the -th coordinate, is Lipschitz and -sub exponential for all .

Note that, as illustrated in [7], the above structural assumptions on the partial derivative of the loss function are satisfied in several learning problems.

###### Assumption 5.

are separated: and .

###### Remark 1.

If is Lipschitz, , and hence can be . Also could be potentially small in many applications. Hence Assumption 5 enforces a strict requirement on .

Let the size of -th cluster is and . Furthermore, let .

###### Theorem 1.

Suppose Assumptions hold. If Algorithm 1 is run with the “Edge cutting” (Section C) algorithm for stage II and the trimmed mean algorithm (of [7]) for iterations with constant step-size of ) in stage III, then provided , for all , we obtain

 ∥ˆwi−w∗i∥≤˜O(^αid√n+d√nMi).

with probability at least

.

###### Remark 2.

We can remove the assumption of the strong convexity of (Assumption 2). In that case, under the setting of Theorem 1, for all , we obtain

 Fi(ˆwi)−Fi(w∗i)≤˜O(^αid√n+d√nMi)

with high probability.

As shown in Section C, given the above assumptions, “Edge-cutting” perfectly clusters the non-Byzantine machines with high probability. In the worst case, all the Byzantine machines may belong to a particular cluster, say the -th one (). So, the fraction of Byzantine machines for would be at most .

Comparison with an Oracle: We compare the above bound with an Oracle inequality. We assume that the oracle knows the cluster identity for all the non-Byzantine machines. Since with high probability, the modular algorithm makes no mistake in clustering the non-Byzantine machines, the bound we get perfectly matches the oracle bound.

We now move to the setting where we have no restriction on , and hence may be potentially much smaller than . This setting is more realistic since data arising from applications (like images and video) are high dimensional, and the amount of data in data owners’ device may be small ([1]). We start with the following assumption.

###### Assumption 6.

The empirical risk minimizers, , corresponding to non-Byzantine machines are sampled from a mixture of

-sub-gaussian distributions.

We emphasize that several learning problems satisfy Assumption 6. We now exhibit one such setting where the empirical risk minimizer is Gaussian. We assume that machine belongs to cluster . Recall that denote the data points for machine .

###### Proposition 1.

Suppose the data are sampled from a parametric class of generative model: with covariate and i.i.d noise . Then, with quadratic loss, the distribution of the empirical risk minimizer is Gaussian with mean .

In general, sub-Gaussian distributions form a huge class, including all bounded distributions. For non-Byzantine machines, we assume the observation model: where are unknown labels and are the unknown means of the sub-gaussian distribution. We denote as independent and zero mean sub-gaussian noise with parameter . We propose and analyze a robust clustering algorithm presented in Algorithm 2. At iteration , let be the label of the -th data point, and for be the estimate of the centers.

In Algorithm 2, we retain the nearest neighbor assignment of the Lloyd algorithm but change the sample mean estimate to a robust mean estimate using geometric median-based trimming.

We now introduce a few new notations. Let denote the minimum separation between clusters. The worst case error in the centers are determined by . Consequently we define as the maximum fraction of misclustered points in a cluster (maximized over all clusters). In Section 5.2 and A.3 (of the supplementary material), these quantities are formally defined along with the initialization condition, .

Recall that and note that from Theorem 7, when Algorithm 2 is run for a constant number of iterations, we get with high probability. Also, let . Since denotes fraction of non-Byzantine machines that are misclustered, denotes the worst case fraction of Byzantine machines in cluster . We assume .

###### Theorem 2.

Suppose Assumptions 2, 3, 4 and 6 hold along with the separation and initialization conditions (Assumptions 8 of Section 5). Furthermore, suppose Algorithm 1 is run with “Trimmed means” (Algorithm 2) for stage II for a constant iterations; and the trimmed mean algorithm (of [7]) for stage III for iterations with constant step-size of . Then, provided , for all , we have

 ∥^wi−w∗i∥≤˜O(~αid√n+d√nMi).

with probability at least .

###### Remark 3.

Like before, we can remove Assumption 2 and obtain guarantee on for all .

Comparison with the oracle: Recall that the oracle knows the cluster labels of all the non-Byzantine machines. Hence, the worst case fraction of Byzantine machines will be . Consequently, we observe that the obliviousness of the clustering identity hurts by a factor of in the precision of learning weight . A few remarks are in order.

###### Remark 4.

As seen in Section D.2, if and , we show that if “Trimmed means” is run for at least iterations provided . Hence our precision bound matches perfectly with the oracle bound.

###### Remark 5.

The dependence on can be improved if iterative filtering algorithm ([8]) is used in stage III of the modular algorithm. We get with high probability.

### 4.1 Oracle optimality

In the presence of the oracle, our problem decomposes to homogeneous ones. We study the dependence of the estimation error of Theorem 2 on , and under such a setting.

##### Dependence on (n,Mi,α):

We compare our results with the lower bounds presented in [7, Observation 1] assuming is constant. It is immediate that the dependence on and is optimal. To see the dependence on , we first consider the special case of with centers . Here . Typically, and hence . Comparing with the bound in [7, Observation 1], the dependence on is near optimal in this case. However for a cluster setting, may not be linear in in general (since is not proportional to ).

##### Dependence on dimension d:

In this setting, instead of running the trimmed mean algorithm as the distributed optimization subroutine, we run the iterative filtering algorithm of [8], and as shown in Remark 3, the dependence on when compared with the lower bound of [7, Observation 1] is optimal. Note that in this case, the dependence on becomes sub-optimal.

## 5 Robust clustering

In Stage II of the modular algorithm, we cluster the local ERMs, in the presence of Byzantine machines. To ease notation, we write . Recall that for non Byzantine data-points, we have , with unknown labels , unknown centers and sub-Gaussian noise . For Byzantine data points is arbitrary. It is worth mentioning here that the classical Lloyd can be arbitrarily bad since the adversary may put the data points far away, thus causing the sample mean-based subroutine of the algorithm to fail. As a performance measure, we define the fraction of misclustered non-Byzantine data points at iteration as, , where denotes the set of non-Byzantine data points with . We first concentrate the special case where with centers and , and hence . With slight abuse in notation, the labels are and hence, , where . This can be thought of estimating from samples .

### 5.1 Symmetric 2 clusters with Gaussian mixture

We analyze Algorithm 2 in the above-mentioned setting. The performance depends on the normalized signal-to-noise ratio, , where . At iteration , let be the fraction of data-points being trimmed by Algorithm 2 and let be the estimate of .

###### Assumption 7.

(i) (SNR) We have and (ii) (Initialization) , where , are sufficiently large and is sufficiently small constants.

Hence we require a constant SNR and needs to be slightly better than a random guess.

###### Theorem 3.

Suppose Assumptions 6 and  7 hold. For and for , satisfies

 As+1≤As(As+8r2)+2r2+√4log((1−α)m)/((1−α)m)

with probability at least . Furthermore, for , with high probability.

Hence, if , then after steps, implying , which matches the oracle bound () mentioned after Theorem 2. Also, here we can tolerate , which can be prohibitive for large . In the general -cluster case, we improve the tolerance level from to (Theorem 4), and in Section 5.4 we completely remove the dependence on .

### 5.2 K clusters with sub-Gaussian mixture

We now analyze the general -cluster setting and with sub-Gaussian noise. The details of this section are deferred to Section A.3 of the Appendix. Similar to , we define a cluster-wise misclustering fraction and the trimmed cluster-wise misclustering fraction as at iteration . Recall the definition of and from Section 4 and denote the minimum cluster size at iteration as . Also define and as the fraction of adversaries and trimmed points respectively for the -th cluster. Furthermore, let be the maximum adversarial fraction (after trimming) in a cluster and be the normalized SNR.

###### Assumption 8.

We have: (a) ; (b) (SNR) ; and (c) (Initialization) , for a small constant .

Hence the separation (of means) is , which matches the standard separation condition for non-adversarial clustering ([39]). Let , where and . We have the following result:

###### Theorem 4.

With Assumption 8 and , the cluster-wise misclustering fraction satisfies

 GUs+1≤C5r21+2C2r21GUs+C2C4r21GUs+√5Klog((1−α)m)/(γ21(1−α)m)

with probability exceeding . Furthermore, if Algorithm 2 is run for a constant iterations, with high probability.

### 5.3 Initialization

We see that in Theorem 3 and 4, the convergence guarantees require proper initialization. One possible way to achieve these guarantees is to initialize via spectral methods like the classical -means algorithm. Spectral methods first project the data in a dimensional space [39, 40]

, then use some heuristic scheme to cluster in the low dimensional space, and finally use the obtained labels for initialization (

). Since a fraction of the data points are corrupted, we need to run a Principal Component Analysis (PCA) that is outlier-robust (

[41]). The initialization algorithm can be summarized as:
Step-I: Split the data points in partitions: and . Run the robust PCA algorithm of [41] on to obtain . Denote the first columns of as .
Step-II: Project the data points, , onto , i.e., obtain for all .
Step-III: Run pairwise distance-based clustering algorithm (Algorithm 4 of Appendix B) and use the labels as .

Recall that denotes the initial fraction of misclustered good points. For a mixture of sub-Gaussian distributions, we prove an upper bound on , which in turn yields initialization for Theorem 3 and 4. We have the following assumption which is slightly worse than Assumption 8.

###### Assumption 9.

We have: for sufficiently large and .

###### Lemma 1.

Suppose Assumption 9 holds and we use the labels given by the initialization algorithm as initial labels . Then, the initial fraction of misclustered points , with probability at least , where is a function of and .

Changing the constant in , we change tailored to our requirement for Theorem 3 and 4. The details of the initialization subroutine is deferred to Section B of the supplementary material.

### 5.4 Robust clustering in high dimension

In Sections 5.1 and 5.2, we see that the tolerable fraction of adversarial data-points decays fast with , which makes Algorithm 2 unsuitable for large . Here we analyze the symmetric -cluster setting only. However given 1, our analysis can be extended to general cluster setting. We adapt a slightly different observation model: are drawn i.i.d from the following Huber contamination model: with probability , , where

is a Rademacher random variable and

is a

-sub-Gaussian random vector with zero mean, and is independent of

; with probability ,

is drawn from an arbitrary distribution. We assume that the maximum eigenvalue of the covariance matrix of

is bounded. More specifically, we let . We denote the distribution of and by and , respectively. Intuitively, with probability , is an inlier, i.e., drawn from a mixture of two symmetric distributions, and with probability , is an outlier. The goal is to estimate and find the correct labels (i.e., ) of the inliers. We propose Algorithm 3 where the total number of data points is an integer multiple of the number of iterations , and the algorithm uses the Iterative Filtering algorithm [42, 43, 44], denoted by as a subroutine. The intuition of the iterative filtering algorithm is to use higher order statistics, such as the sample covariance to iteratively remove outliers.

The convergence guarantee of the algorithm is in Theorem 5. We start with the following assumption:

###### Assumption 10.

We assume: (a) (Initialization) (b) (SNR) and (c) (Sample complexity) .

We emphasize that the SNR requirement is standard and the initialization condition is slightly stronger than a random guess. Armed with the above assumption, we have the following result.

###### Theorem 5.

Suppose that , , and let . With Assumption 10 and running Algorithm 3 for iterations, with probability at least , we have

 ∥^θ(T)−θ∗∥2≤C3(~σ+ζ)(√α+exp(−∥θ∗∥2/4ζ2)),and
 P{argmin^ν∈{−1,1}∥^νy−^θ(T)∥22≠ν∣y is inlier}≤C4(α+exp(−∥θ∗∥2/2ζ2)).

Note that the tolerable level of has no dependence on dimension, which is an improvement over Theorem 3.

## 6 Experiments

We perform extensive experiments on synthetic and real data and compare the performance of our algorithm to several non-robust clustering and/or optimization-based algorithms.

### 6.1 Synthetic data

For synthetic experiments, we use a mixture of linear regressions model. For each cluster, a

dimensional regression coefficient vector, , is generated element-wise by a distribution. Then machines are uniformly assigned to the clusters, and machines are considered adversarial machines. For each good machine, (belonging to cluster, ), data points are generated independently according to: , for all , where and . For adversarial machines, the regression coefficients are sampled from , resulting in outliers. We initialize the cluster assignments with percent correct assignments for the good machines. We test the performance of Lloyd (-means), Trimmed -means (Algorithm 2) and -geomedians (where the sample mean step of Lloyd is replaced by geometric median; note that this is Algorithm 2 excluding the trimming step). We set and . In Figure 1(a), we see that the fraction of misclustered points (which we call misclustering rate) indeed diminishes with iteration at a fast rate which validates Theorem 4, whereas for -means, it converges to a misclustering rate of .

We compare our algorithm consisting of robust clustering (using Trimmed -means or -geomedians) and robust distributed optimization with algorithms without robust subroutines in the clustering or the optimization step. In particular we use the classical -means as a non robust clustering, and a naive sample averaging-based scheme (instead of robust trimmed mean-based scheme by [7]) as a non-robust, distributed algorithm. Also, in the robust optimization stage, we compare with a robust version of the Federated Averaging algorithm of [1] with iterations of gradient descent in each worker node before the global model gets updated (by taking the trimmed mean of the local models in the worker nodes).

We first observe that the estimation error () for non-robust clustering schemes (KM in Figure 1(b)) is higher than that using Trimmed -means (TKM) and -geomedians (KGM). Furthermore, trimmed mean-based distributed optimization (TM) strictly outperforms the sample mean-based (SM) optimization routine by even with robust clustering. Federated averaging (FA) does orders of magnitude worse in estimation, likely due to the poor gradient updates provided by individual machines. Hence matching our theoretical intuition, robust clustering and robust optimization have the best performance in the presence of adversaries.

### 6.2 Yahoo! Learning to Rank dataset

The performance of the modular algorithm is evaluated on the Yahoo Learning to Rank dataset [45]. We use the set2.test.txt file for our experiment. We choose to treat the data as unsupervised, ignoring the labels for this simulation. Starting with queries and features, we adopt the following thresholding rule: we draw an edge between the queries with distance less than (which we optimize at ). We then run a tree-search algorithm to detect the connected graphs which produce our true cluster assignments. Small groups are removed from the dataset. This results in large clusters. Next, we take the mean of the features in each cluster to obtain . The data points in each cluster is then split randomly in batches of (hence, ). In addition, respecting , adversarial splits are incorporated via sampling points randomly from the unused data and adding a vector to the ERM. Note, we synthetically perturb the data points primarily since it is hard to find datasets with explicit adversaries. We then compute the mean in each split (these can be thought analogous to local ERMs), and perform clustering on them using -means, Trimmed -means, and -geomedians algorithms with fully random initialization. Then, we use trimmed mean, sample mean, or Federated Averaging optimization to estimate the on each of the cluster assignment estimates with mean squared loss.

The results of the real data experiments are shown in Fig 1(c). We see that Trimmed -means in conjunction with trimmed mean optimization outperforms the other methods with an estimation error of . This algorithm is easy to implement and learns the optimal weights efficiently. On the other hand, the estimation error of -means algorithm with sample mean optimization is , which is relatively two times worse than the robust algorithms. Also, Trimmed -means and -geomedians have similar final estimation error, which further confirms that trimming step after computing the geometric median may be redundant. Thus, we once again emphasize that our robust algorithm performs better than standard non-robust algorithms.

## 7 Conclusion and future work

We tackle the problem of robust Federated Learning in a heterogeneous environment. We propose a -step modular solution to the problem. For the second step, we analyze the classical Lloyd algorithm with a robust subroutine and analyze a provable initialization scheme. We observe that for the theoretical guarantees, we need the data points to be sub-Gaussian with a mean separation of . Weakening the sub-Gaussian assumption along with a better initialization scheme are kept as our future endeavors. We would also like to come up with robust clustering algorithms that have a nice adaptation to dimension.

## Acknowledgments

The authors would like to thank Swanand Kadhe and Prof. Peter Bartlett for helpful discussions.

## Appendix A Theoretical guarantees for Algorithm 2

### a.1 Proof of Proposition 1

Given the parametric form of data generation, we first stack the covariates to form the matrix where . Also we form the vectors and . The objective is to estimate

. We run an ordinary least squares, i.e., we calculate the following,

 ^w(i)=argminw∥Xj−Zw∥2

From standard calculations, The ERM is given by

 ^w(i)=(Z⊤Z)−1Z⊤(Zw∗k+Υ)=w∗k+(Z⊤Z)−1Z⊤Υ

Hence the distribution of is Gaussian. Since for all , .

### a.2 Symmetric 2 cluster: proof of Theorem 3

Suppose after geometric median based trimming on both the centers at iteration , we retain data points.

Let be the estimate of at iteration . Let us fix a few notations here. At step , we denote as the set of data-points that are not trimmed. denotes the set of trimmed points and denotes the set of adversarily corrupted data points. We have

 ^θ(s) = 1(1−β)m∑i∈Uz(s)iyi (1) = 1(1−β)m[∑i∈Mz(s)iyi−∑i∈M∩Tz(s)iyi+∑i∈B∩Uz(s)iyi] = 1(1−β)m∑i∈Mz(s)iyi−T1+T2

where, and . Consequently

 ^θ(s)−θ∗ = 1(1−β)m∑i∈Mz(s)iyi−T1+T2−θ∗ = 1(1−β)m∑i∈Mz(s)iyi−1(1−β)m∑i∈Mziyi+1(1−β)m∑i∈Mziyi−T1+T2−θ∗.

Since , where are the true label of the -th data point, we have the following relation

 z(s)i−zi=−2I{z(s)i≠zi}zi

Let . Plugging in, we get

 ^θ(s)−θ∗=1(1−β)m∑i∈M−2I{z(s)i≠zi}(θ∗+ξi)+1(1−β)m∑i∈Mziyi−T1+T2−θ∗ (2) = 1(1−β)m[∑i∈M−2I{z(s)i≠zi}(θ∗+ξi)+∑i∈M(θ∗+ξi)]−T1+T2−θ∗ = γ1(1−α)m[∑i∈M−2I{z(s)i≠zi}(θ∗+ξi)+∑i∈Mξi]−T1+T2+(γ−1)θ∗.

We need a few definitions to proceed further. Recall denote the average error rate over good samples. Also define and . With the above definitions, we get

 ^θ(s)−θ∗=γ(−2Asθ∗−2R+¯τ)−T1+T2+(γ−1)θ∗

As a result

 ⟨^θ(s),θ∗+ξi⟩=γ⟨θ∗+ξi,(1−2As)θ∗−2R+¯τ⟩−⟨θ∗+ξi,T1−T2⟩

The extra term can be controlled using Cauchy Schwartz inequality in the following way

 ⟨θ∗+ξi,T1−T2⟩≤∥θ∗+ξi∥∥T1−T2∥≤(∥θ∗∥+<