# Byzantine Fault Tolerant Distributed Linear Regression

This paper considers the problem of Byzantine fault tolerant distributed linear regression. The system comprises of a server and n number of agents, where each agent i is holding some data points and responses. Up to f of the n agents in the system are Byzantine faulty and the identity of Byzantine faulty agents is apriori unknown to the server. The datasets and responses of honest agents are related linearly through a common parameter, which is to be determined by the server. This seemingly simple problem is challenging to solve due to the Byzantine (or adversarial) nature of faulty agents. We propose a simple norm filtering technique that "robustifies" the original distributed gradient descent algorithm to solve the aforementioned regression problem when f/n is less than a specified threshold value. The computational complexity of the proposed filtering technique is O(n (d + n)) and the resultant algorithm is shown to be partially asynchronous. Unlike existing algorithms for Byzantine fault tolerance in distributed statistical learning, the proposed algorithm does not rely on assumptions on the probability distribution of agents' data points. The proposed algorithm also solves a more general distributed multi-agent optimization problem under Byzantine faults.

• 21 publications
• 21 publications
01/26/2021

### Robustness of Iteratively Pre-Conditioned Gradient-Descent Method: The Case of Distributed Linear Regression Problem

This paper considers the problem of multi-agent distributed linear regre...
10/21/2021

### Utilizing Redundancy in Cost Functions for Resilience in Distributed Optimization and Learning

This paper considers the problem of resilient distributed optimization a...
08/11/2020

### Byzantine Fault-Tolerant Distributed Machine Learning Using Stochastic Gradient Descent (SGD) and Norm-Based Comparative Gradient Elimination (CGE)

This report considers the problem of Byzantine fault-tolerance in homoge...
09/30/2020

### Byzantine Fault-Tolerance in Decentralized Optimization under Minimal Redundancy

This paper considers the problem of Byzantine fault-tolerance in multi-a...
05/30/2022

### Byzantine Fault-Tolerant Min-Max Optimization

In this report, we consider a min-max optimization problem under adversa...
04/01/2022

### On Distributed Exact Sparse Linear Regression over Networks

In this work, we propose an algorithm for solving exact sparse linear re...
11/15/2020

### Accelerating Distributed SGD for Linear Regression using Iterative Pre-Conditioning

This paper considers the multi-agent distributed linear least-squares pr...

## 1 Introduction

This paper considers the problem of Byzantine fault tolerant distributed linear regression in a multi-agent system. The proposed algorithms, however, are applicable for a more general class of distributed optimization problems (described in Section 5) that includes distributed linear regression. The system comprises of a server and agents, where each agent holds number of data points and responses, stacked as matrix

and vector

, respectively. Up to of the agents in the system are Byzantine faulty and identity of Byzantine faulty agents is apriori unknown to the server [1, 2]. The server knows that if agent is honest (non-faulty) then its data points and responses satisfy for some unknown parameter value . The objective of the server is to compute parameter , regardless of the identity of Byzantine faulty agents. This seemingly simple problem is challenging to solve due to the adversarial nature of Byzantine faulty agents [3]. In fact, it is well known that the existing techniques in robust statistical learning (cf. [4]) are ineffective in solving the aforementioned problem unless certain assumptions on the probability distribution of agents’ data points are satisfied [3, 5, 6].

Existing solutions for Byzantine fault tolerant distributed statistical learning (ref. [5, 6, 7, 8, 9, 10, 11, 12]) rely on assumptions on the probability distribution of honest agents’ data points for accuracy in probabilistic manner (even when their is no noise in the system). Whereas, we are interested in algorithms that can accurately (in absence of noise and with reasonably bounded error in presence of noise) compute in deterministic manner, under certain conditions on , regardless of the probability distribution of agents’ data points. We also note that all the prior works on Byzantine fault tolerance in distributed statistical learning assume synchronicity in the system, except [12, 7] where every agent has access to all the data points and responses. Whereas, the proposed algorithms are partially asynchronous, and therefore, robust to bounded delays in the system.

It should be noted that the above Byzantine fault tolerant linear regression can be used to solve a wide range of engineering problems pertaining to fault-tolerance or security, such as secure distributed state estimation of control systems

[13, 14, 15, 16], secure localization [17, 18]

and secure pattern recognition

[19].

## 2 Summary of Contributions

We propose two norm based filtering techniques, norm filtering and norm-cap filtering, that “robustifies” the original distributed gradient descent algorithm to solve the aforementioned regression problem when is less than specified threshold values222Refer Section 7 and Section 9 for further details.. The details of the algorithms are given in Sections 6 and 8. The proposed algorithms also solve a more general multi-agent optimization problem where the honest agents’ objective functions (or costs) satisfy certain assumptions, specified in Section 5. The computational complexity of the proposed filtering techniques is , and the resultant algorithms are shown to be partially asynchronous333Refer Section 7.2 for formal details..

Comparison of our paper with the existing related work is given in the following section.

## 3 Related Work

Existing related work can be broadly classified into four categories:

1. Regression with adversarial corruptions to data points or responses.

2. Byzantine fault tolerant distributed estimation.

3. Byzantine fault tolerant distributed learning.

4. Byzantine fault tolerant distributed multi-agent optimization.

### 3.1 Regression with adversarial corruptions

The aforementioned Byzantine fault-tolerant regression problem has been addressed for the centralized setting by many researchers in recent years (ref. [3, 20, 21, 22, 23, 24]), where the server has access to all the agents’ data points and responses. We are interested in a distributed setting, where the data points and responses are distributed amongst agents, and are inaccessible to the server.

#### 3.1.1 Challenges of distributed over centralized setting

The challenges of distributed setting over the centralized counterpart are as follows.

1. Both the data points and responses of Byzantine faulty agents can be corrupted. Some of the centralized techniques (cf. [3, 20]) assume only corrupted responses.

2. Agents could be holding large volume of data points and responses, that would make sharing of the entire data set with the server quite expensive in terms of the communication cost. Most of the centralized techniques (cf. [3]) require the server to have access to all the agents’ data points and responses.

3. Server and the agents need not be synchronous. All the centralized techniques rely on synchronicity in the system [3, 21, 20, 23, 24].

Unlike the centralized techniques, our proposed algorithms do not require agents to share their data points or responses with the server, and it is partially asynchronous. While spectral filters proposed in [23, 24]

can be used in the distributed setting, they rely on singular value decomposition (SVD) of agents’ costs’ gradients (in each iteration) and therefore, are orders of magnitude more computationally complex than the proposed norm based filters. Also, unlike

[23, 24], we are interested in computing precisely (in absence of noise and within a reasonably bounded error in presence of noise) in a deterministic manner.

The ‘hard-thresholding’ based robust regression technique in [3], even for the centralized setting, is effective only if the data points satisfy a certain condition. This condition holds with “high probability” if the probability distribution of the data points is Gaussian with zero mean [3]. It should be noted that the efficacy of our proposed algorithms does not depend on any assumptions on the probability distribution of agents’ data points. Therefore, the proposed algorithms have a much wider applicability than the solutions proposed in [3], even for the centralized case.

### 3.2 Byzantine fault tolerant distributed estimation

In a closely related work, Su and Shahrampour [25] propose coordinate-wise trimmed mean filtering for “robustifying” the distributed gradient descent method in a peer-to-peer network. However, they do not provide an explicit bound on the number of Byzantine faulty agents that can be tolerated using their filtering technique. The convergence of their algorithm relies on a technical assumption (assumption 1 in [25]) that imposes additional constraints, than required by our proposed algorithms, on agents’ data points. This point is reiterated by an example in Section 10. Resilient estimation technique proposed by [26] requires agents to commit (or share) their data points and responses to the server (or some central authority in their case), whereas we are interested in distributed setting where agents do not share their data points or responses with the server or any other agent in the system. In recent years, there has been a significant amount of work in Byzantine fault-tolerant state estimation (both distributed and centralized) of linear time-invariant (LTI) dynamical systems [27, 14, 13, 15, 22]. However, it should be noted that Byzantine fault-tolerant state estimation (aka. secure state estimation) of LTI dynamical systems is a special case of the considered regression problem (ref. [27, 14, 13, 15, 22]). We also note that our proposed algorithms are significantly (orders of magnitude) simpler than some of the secure state estimation algorithms [13, 15], albeit can handle relatively less number of Byzantine faulty agents.

### 3.3 Byzantine fault tolerant distributed statistical learning

In recent years, significant amount of progress has been made on Byzantine faulty tolerant distributed statistical parameter learning [9, 7, 6, 8, 5, 10, 12, 28]. In [6, 28, 7, 8, 12, 9]

the agents assume the role of workers in the parallelization of the (stochastic) gradient descent method and therefore, agents have access to all the data points. In

[12], the authors propose a data encoding scheme for tolerating Byzantine faulty workers. Whereas, [6, 28, 7, 8, 9] rely on filters to “robustify” the original distributed stochastic gradient descent method. In [5, 11, 10], the agents have distributed data points and responses, however it is assumed that all the agents choose their data points and responses following a common probability distribution. Thus, the filtering (or encoding) techniques proposed in these papers are not guaranteed to be effective for the considered problem setting where no assumptions are made on the probability distribution of agents’ data points. Moreover, we are interested in deterministic regression algorithms that compute in a deterministic manner. We also note that the computational complexity for the server in our proposed filtering techniques (both norm filtering and norm-cap filtering) is , which is significantly less than the filtering techniques proposed in [6, 5].

### 3.4 Byzantine fault tolerant distributed multi-agent optimization

Byzantine faulty tolerant distributed multi-agent optimization has also received considerable attention in recent years [29, 30, 31, 32, 33]. The objective in that case is to compute the point of minimum of the weighted average cost of the honest agents. If the agents’ costs are scalar (i.e. ) then the server can achieve this objective with weights of at least honest agents bounded away from zero [29, 31]. This result is extended in [30] for multivariate cost functions, where the proposed technique relies on the assumption that agents’ costs are weighted linear combination of finite number of convex functions. In general, this assumption does not hold for the regression problem considered in this paper. Further, it is known that the weights can not be uniform when there are non-zero number of Byzantine faulty agents in the system if the costs are not correlated [32, 31, 29]. Interestingly, the necessary correlation between honest agents’ costs that would admit equal (positive) weights for all the honest agents in Byzantine distributed multi-agent optimization problem remains an open problem. In this paper, we present a sufficient correlation between honest agents’ costs under which the weights associated with honest agents’ costs are equal and positive. Specifically, if there exists a common point of minimum for all the honest agents’ costs (refer Section 5) then the minimizer of the average cost of honest agents can be computed in presence of limited (limits specified in Section 7 and 9) number of Byzantine faulty agents. Moreover, the proposed algorithms solve this multi-agent optimization problem efficiently, under the aforementioned sufficient correlation.

Authors in [34] extend the results of [32] for multivariate cost functions by assuming that the original optimization problem can be split into independent scalar sub-problems with strictly convex objective costs. This assumption is quite strong and in general, does hold for the considered regression problem setting. Authors in [35] solve the Byzantine fault-tolerant distributed optimization problem, assuming that each and every agents’ cost is strongly convex, which implies that every honest agent can locally compute in context of the considered regression problem. This assumption is quite strong (it basically trivializes the considered regression problem), and is not required for the effectiveness of our proposed algorithms.

### 3.5 Norm Clipping in Machine Learning

We note that norm clipping (or filtering) of gradients has been proposed before for solving other un-related problems in machine learning, namely the gradient explosion problem in training of recurrent neural networks

[36]

, and the privacy preservation problem in distributed stochastic gradient descent based training of deep feed-forward neural networks

[37]. However, in these works the gradients are clipped based on a constant threshold value, that needs to be apriori determined carefully, whereas our filtering techniques rely on relative ranking of gradients’ norms at each iteration and does not require computation of any additional threshold value.

## Paper Organization

The rest of the paper is organized as follows. In Section 4, we introduce the notation used throughout the paper. Section 5 presents formal description of the problem addressed, along with the assumptions made in the paper. Section 6 presents the first filtering technique, referred as norm filtering. Section 7 presents the convergence analysis of the resultant gradient descent algorithm with norm filtering. Section 8 presents the second filtering technique, referred as norm-cap filtering. Section 9 presents the convergence analysis of the resultant gradient descent algorithm with norm-cap filtering. Section 10 presents a numerical example for demonstrating the obtained convergence results for the proposed algorithm. Finally, concluding remarks are made in Section 11. Appendix A discusses the effect of system noise. Appendix B contains formal proofs of the results.

## 4 Notations

, , and denote sets of integers, natural numbers, real numbers and -dimensional real-valued vectors, respectively. , and represent non-negative integers, non-negative reals and positive reals, respectively. Let . For a vector , denotes its -th element, and denotes its Euclidean norm (or -norm), which is equal to . Notation for denotes a set of -dimensional vectors with each element belonging to the interval . For a matrix , denotes its transpose and denotes a column vector corresponding its -th row. In other words, is the -th column of . For a set of matrices , the notation represents the row-wise concatenation of the matrices (stacking of the matrices). Thus, is a matrix of dimensions . Inner product (or scalar product) of two vectors in is denoted by and is equal to . For a multivariate differentiable function , denotes is gradient at a point . For a finite set , denotes its cardinality. For real number , denotes its absolute value.

## 5 Optimization Framework

As mentioned earlier, we consider a system of agents and a server, with communication links between all the agents and the server. Agents do not communicate with each other. The system contains at most Byzantine faulty agents that can behave arbitrarily [2, 1]. The identity of Byzantine faulty agents is apriori unknown to the server. However, the server knows the value of . Let and denote the sets of honest (non-faulty) agents and Byzantine faulty agents, respectively.

In this paper, we propose an algorithm to solve a distributed multi-agent optimization problem where each agent is associated with a differentiable convex cost , that satisfies certain assumptions that are mentioned below. The objective of the server is to compute a point of minimum of the average cost of the honest agents,

 CH(w)=1|H|∑i∈HCi(w),∀w∈Rd (1)

In Section 5.1, we demonstrate the applicability of this optimization framework for the case of least squared-error distributed linear regression. In this optimization problem, we assume the following:

• Unique point of minimum and strong convexity of reduced average cost:
Assume that has a unique point of minimum in a compact and convex set . Further, for any of cardinality at least , assume that the average cost of , i.e. , is strongly convex. Specifically,

 ⟨w−w′,∇C^H(w)−∇C^H(w′)⟩≥λ∥∥w−w′∥∥2,∀w,w′∈Rd

where .

• minimizes at and are Lipschitz continuous:
For every , assume that , and

where .

• Strength of Byzantine faulty agents is less than majority:
Assume that the maximum number of Byzantine faulty agents is less than the half of the total number of agents, i.e.

 f

It should be noted that it is impossible to compute if in general when no assumptions are made on the probability distribution of honest agents’ data points [3, 14, 13].

### 5.1 Least Squared-Error Distributed Linear Regression

Now, consider the distributed linear regression problem where each agent is associated with number of data points and responses, represented by and , respectively. The server knows that for each agent , for some parameter . The parameter is unknown to the server and is common for all the honest agents (cf. [3]). The objective of the server is to learn a value of (need not be unique). To solve this regression problem, each agent defines the following squared-error cost

 Ci(w)=12∥Yi−Xiw∥2=12(wTXTiXiw−2XTiYiw+∥Yi∥2),∀w∈Rd,∀i∈H

As , thus is a positive semi-definite matrix. Thus, is convex for all . Here,

 ∇Ci(w)=XTi(Xiw−Yi),∀w∈Rd,∀i∈H

As , thus . As the costs are convex, this implies that is a point of minimum for all . As is positive semi-definite, therefore (cf. [38])

where

is the largest eigenvalue of

. This implies,

 ∥∥∇Ci(w)−∇Ci(w′)∥∥=∥∥XTiXi(w−w′)∥∥=√(w−w′)T(XTiXi)2(w−w′)≤¯¯¯νi∥∥w−w′∥∥

for all . Thus, for , we get

Hence, assumption (A2) holds naturally for the case of least squared-error linear regression. For any set , the average cost is

where, and are the stacked responses and data points of all the agents in . Thus,

Therefore,

 ⟨w−w′,∇C^H(w)−∇C^H(w′)⟩=1∣∣^H∣∣(w−w′)TXT^HX^H(w−w′)≥ν––^H∣∣^H∣∣∥∥w−w′∥∥2,∀w,w′∈Rd

where, is the smallest eigenvalue of . Thus, if the stacked matrix has rank equal to , i.e. can be uniquely computed from the responses and data points of honest agents in , then not only is the unique point of minimum of , but is also strongly convex as (cf. [38]). In other words, if can be uniquely determined given the data points and responses of agents in , for all of cardinality then assumption (A1) holds, and

 λ=1∣∣^H∣∣(min^H⊆H,∣∣^H∣∣=n−fν––^H)>0

In the discussion above, we only consider the noiseless case. However, the proposed algorithms are effective even when there is (bounded) noise in the system, as discussed in Appendix A.

## 6 Algorithm-I: Gradient Descent with Norm Filtering

The algorithm follows the philosophy of gradient descent based optimization. The server starts with an arbitrary estimate of the parameter and updates it iteratively in two simple steps. In the first step, the server collects gradients of all the agents’ costs (at the current estimated value of the parameter) and sort them in the increasing order of their -norms (breaking ties arbitrarily in the order). In the second step, the server filters out the gradients with largest -norms, and uses the (vector) sum of the remaining gradients as update direction. Therefore, the filtering scheme is referred as norm filtering. The algorithm is formally described as follows.

Server begins with an arbitrary estimate of the parameter and iteratively updates it using the following steps. We let denote the parameter estimate at time .

1. At each time , the server requests from each agent the gradient of its cost at the current estimate , and sorts the received gradients by their norms. Let,

 ∥∥gti1∥∥≤…≤∥∥gtin−f∥∥≤…≤∥∥gtin∥∥

where, and denotes the gradient reported by agent at time . Note that if then (arbitrary), and if and the system is synchronous then (asynchronous case is discussed in Section 7.2). Let,

 Ft={i1,…,in−f} (2)

be the set of agents with smallest gradient norms at time .

 wt+1=[wt−ηt⋅∑σ∈Ftgtσ]W,∀t∈Z≥0 (3)

where, is a sequence of bounded positive real values and denotes projection onto w.r.t. Euclidean norm, i.e. .

### 6.1 Computational Complexity

In Step S1, the server computes the norm of all reported gradients in time. Sorting of these norms takes additional time. Thus, the net computational complexity of norm filtering (for the server) is . Whereas, computational complexity of each agent is .

In Step S2, the server adds all the vectors in set to update its parameter estimate in time. The projection of the updated estimate on a known compact convex set , defined using affine constraints (a bounded polygon), can be done in time using quadratic programming algorithm in [39]. Therefore, the net computational complexity of the algorithm (for the server) is per iteration.

### 6.2 Intuition

The principal factor behind the convergence of the proposed algorithm is consensus amongst all the honest agents on . Norm filtering bounds the norms of all the gradients used for computing the update direction (even if they are Byzantine faulty gradients) by norm of an honest agent’s gradient (as there could be at most Byzantine faulty agents). This has two-fold implications,

1. As the gradients of all the honest agents’ costs vanish at (cf. assumption (A2) and Claim 1), therefore is ensured to be a fixed-point of the iterative algorithm (3).

2. As gradients of all the honest agents’ costs are Lipschitz continuous (assumption (A2)), therefore the magnitude of the contribution of the adversarial gradients (reported by Byzantine faulty agents) in the update direction is bounded above by the separation between current estimate and (cf. Claim 1).

The proposed filtering allows contribution of at least honest agents’ gradients ( by assumption (A3)), that pushes the current estimate towards with force that is also proportional to the separation between current estimate and for small enough , due to the strong convexity assumption (A1). This gives us an intuition that effect of adversarial gradients can be overpowered by the honest agents’ gradients in Step S2 at all times if is small enough.

The insight above is conducive to the formal convergence results presented in the next section, for both synchronous (Section 7.1) and asynchronous (Section 7.2) cases.

## 7 Convergence Analysis: Algorithm-I

Before we present the convergence results for Algorithm-I, let us note the following implications of assumptions (A1) and (A2).

###### Claim 1.

Assumptions (A1)-(A2) imply that

 μ≥λ. (4)

Moreover, if then for any of cardinality , we get

 ∇CH′(w)=0 in W iff w=w∗ (5)

where, .

###### Proof.

Refer to Appendix B.1. ∎

We rely on the following sufficient criterion for the convergence of non-negative sequences.

###### Lemma 1 (Ref. Bottou, 1998 [40]).

Consider a sequence of real values . If then

 ∞∑t=0(ut+1−ut)+=S+∞<∞⟹⎧⎪ ⎪⎨⎪ ⎪⎩ut⟶t→∞u∞<∞∑∞t=0(ut+1−ut)−=S−∞>−∞ (6)

where the operators and are defined as follows (),

In other words, convergence of infinite sum of positive variations of a non-negative sequence is sufficient for the convergence of the sequence and infinite sum of its negative variations.

### 7.1 Convergence With Full Synchronism

We now present the sufficient conditions under which the proposed algorithm converges to when the server and honest agents are synchronous, i.e. we assume:

(A4) Full Synchronism: for all .

###### Theorem 1.

Under assumptions (A1)-(A4), if , , and

 fn<11+2(μ/λ) (7)

then the sequence of parameter estimates , generated by (3), converges to .

###### Proof.

Refer Appendix B.3. ∎

Theorem 1 states that if is less than then the proposed algorithm will reach the point of minimum of the asymptotically under assumptions (A1)-(A4). As assumptions (A1)-(A3) also imply that (cf. Claim 1), thus (maximum allowable Byzantine agents) should be less than one-third of (total number of agents) for the proposed algorithm to converge to .

If assumptions (A1)-(A2) and condition (7) are satisfied, then

 f/n<1/(1+2(μ/λ))<1/(1+(μ/λ))

and thus (cf. Claim 1),

 ∇CH′(w)=0 in W iff w=w∗

for all subject to . In other words, the point of minimum of the average cost of any honest agents is the point of minimum of the average cost of all honest agents. Therefore, under condition (7) and assumptions (A1)-(A2), is indeed strongly convex for all of cardinality .

It is known, from control systems literature [14, 41, 13, 16], that the considered linear regression problem can be solved in presence of at most Byzantine faulty agents only if matrix

 XH′=[Xi]i∈H′∈R(∑i∈H′ni)×d

has rank equal to for every subset of cardinality . In light of this information, we make the following additional assumption on the costs to improve the tolerance bound on .

• Uniform -Redundancy:
For any of cardinality , we assume that

 ⟨w−w′,∇CH′(w)−∇CH′(w′)⟩≥γ∥∥w−w′∥∥2, ∀w,w′∈Rd

where, and .

For the case of least squared-error linear regression (refer Section 5.1), similar to in assumption (A1), we have

 γ=1∣∣H′∣∣(minH′⊂H,∣∣H′∣∣=n−2fν––H′)

where, is the smallest eigenvalue of . We refer the above redundancy as uniform because it is required to hold for all of cardinality . This -redundancy property of the regression problem is also referred as -sparse observability in control systems literature [16]. Also, note that assumption (A5) is meaningful only if assumption (A3) holds, i.e. .

Similar to Claim 1,

###### Claim 2.

Assumptions (A2)-(A3) and (A5) imply that

###### Proof.

Refer Appendix B.2

With assumption (A5), we get the following alternate convergence result for the proposed algorithm.

###### Theorem 2.

Under assumptions (A1)-(A5), if , , and

 fn<12+μ/γ (8)

then the sequence of parameter estimates , generated by (3), converges to .

###### Proof.

Refer Appendix B.4. ∎

Theorem 2 states that if is less than then the proposed algorithm reaches the point of minimum of the asymptotically under assumptions (A1)-(A5). Owing to Claim 2, the right-hand side in condition (8) is less than or equal to .

Instead of using a diminishing step-size, we can use a small enough constant step-size in (3) to obtain linear convergence of the proposed algorithm as stated below.

###### Theorem 3.

Under assumptions (A1)-(A5), if condition (8) is satisfied then for

 ηt=η=nγ−f(2γ+μ)μ2(n−f)2>0,∀t∈Z≥0,

the sequence of parameter estimates , generated by (3), converges linearly to , with

 ∥∥wt+1−w∗∥∥≤ρ∥∥wt−w∗∥∥,∀t∈Z≥0

where is a positive real number of value less than .

###### Proof.

Refer Appendix B.5. ∎

### 7.2 Convergence With Partial Asynchronism

In practice, the server and the agents need not synchronize. At any given time , some of the honest agents might not be able to report gradients of their costs at the current estimate . This could occur due to various reasons, such as hardware malfunction or large communication delays. In order to cope with such irregularities, the server uses the last reported gradient, in step S2, of an agent that fails to report its cost’s gradient at the current estimate in step S1. Formally, for an agent that fails to report its gradient at , the server uses the last reported gradient of that agent, where is the time passed since agent reported its gradient. However, we assume to be bounded for all . In other words, we assume partial asynchronism that is formally stated as follows (cf. Section 7.1 of Bertsekas and Tsitsiklis, 1998  [42]).

• Partial Asynchronism:
For every , where .
Here, is a finite (unknown) positive integer. As the server uses the last available gradient at each time for each agent , thus .
If the server does not receive any gradient from an agent until time (i.e. ), then it assigns .

If then assumption (A6) is equivalent to assumption (A4), for which case the sufficient conditions for convergence of to have already been stated in Theorems 12 and 3. Therefore, in assumption (A6) . Before we state the result on the convergence result under (A6), let us first establish that the infinite sum of the sequence for all is finite (). This result is used later for showing convergence of , generated by (3), to under the aforementioned partial asynchronism.

###### Lemma 2.

Consider the update law (3) under assumptions (A1)-(A3) and (A6). If and then

###### Proof.

Refer Appendix B.6. ∎

The result in Lemma 2 does not require the sequence to be monotonically decreasing as long as . However, the proof is simplified under this assumption and a non-monotonous does not confer any additional advantages as far as asymptotic convergence of is concerned. Also, the commonly used diminishing step-size is indeed monotonically decreasing (cf. [43]).

###### Theorem 4.

Under assumptions (A1)-(A3), (A5) and (A6), if , , , and condition (8) holds then the sequence of parameter estimates , generated by (3), converges to .

###### Proof.

Refer Appendix B.7. ∎

The convergence result stated in Theorem 4 is same as that in Theorem 2, if the partial asynchronicity assumption (i.e. (A6)) is replaced by the synchronicity assumption (i.e. (A4)). Similarly, the convergence result stated in Theorem 1 is also valid if assumption (A4) (full synchronism) in Theorem 1 is replaced by assumption (A6) (partial asynchronism).

## 8 Algorithm-II: Gradient Descent With Norm-Cap Filtering

The algorithm in essence is similar to Algorithm-I, only here instead of eliminating the largest agents’ gradients the server caps the largest gradients’ norms by the norm of -th largest reported gradient. Therefore, the filtering scheme is referred as norm-cap filtering. Expectedly, norm-cap filtering improves the sufficiency bound on with respect to (8). The steps of the algorithm are formally described as follows.

Server begins with an arbitrary estimate of the parameter and iteratively updates it using the following steps. We let denote the parameter estimate at time .

1. At each time , the server requests from each agent the gradient of its cost at the current estimate , and sorts the received gradients by their norms. Let,

 ∥∥gti1∥∥≤…≤∥∥gtin−f∥∥≤…≤∥∥gtin∥∥

where, and denotes the gradient reported by agent at time . Note that if then (arbitrary), and if and the system is synchronous then (asynchronous case is discussed in Assumption (A6) of Section 7.2). Let,

 Ft={i1,…,in−f}

be the set of agents with smallest gradient norms at time .

2. The server caps the norms of the gradients reported by agents by as

 (9)

 wt+1=⎡⎣wt−ηt⋅⎛⎝∑σ∈Ftgtσ+∑ϱ∈[n]∖Ft¯gtϱ⎞⎠⎤⎦W,∀t∈Z≥0 (10)

where, is a sequence of bounded positive real values and denotes projection onto w.r.t. Euclidean norm, i.e. .

### 8.1 Modification (Informal): Normalizing Gradients

Instead of capping just the largest gradients, the server could scale the norms of all non-zero gradients to . In which case, the non-zero honest gradients in get amplified, whereas the maximum possible norm of Byzantine faulty agents’ gradients still remains bounded by . Therefore, intuitively, correctness of Algorithm-II implies correctness of this modified version of Algorithm-II, but the other way around need not be true. However, it might be possible to improve the sufficiency bound on by this modification of Algorithm-II. Note that modification of Algorithm-II in this manner is equivalent to normalizing all the agents’ gradients (that are non-zero), and then adding these normalized gradients to compute the update direction at each iteration. Thus, this modification replaces sorting of agents’ gradients in Step S1 with normalization of agents’ gradients.

## 9 Convergence Analysis: Algorithm-II

In this section, we present the convergence of Algorithm-II for the synchronous case. The convergence result is however expected to hold even under partial asynchronism.

###### Theorem 5.

Under assumptions (A1)-(A5), if , , and

 fn<12+μ/γ−γ/μ (11)

then the sequence of parameter estimates , generated by update law (10), converges to .

###### Proof.

To be included in a revision of this manuscript. ∎

Evidently, the bound on given in (11) is better than the bound in (8), which was obtained for norm filtering given in Section 6. In fact, in an extreme case where is the unique minimizer of every honest agents’ cost, i.e. , then right-hand side of (11) is equal to . Thus, in this extreme case, Algorithm-II solves the regression problem if Byzantine faulty agents are less than the majority, which is in fact the necessary condition for solving the problem.

## 10 Numerical Example

In this section, we present a small numerical example to demonstrate the convergence of norm filtering based gradient descent algorithm, as given by Theorem 2 for the synchronous case, i.e. under assumption (A4).

In this example, we choose , and . Note that assumption (A3) holds readily as . Each agent is associated with data point and a corresponding response , such that

 Yi=Xiw∗,w∗=[11],∀i∈[n]

The collective data points and responses are:

 X[n]=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣X1X2X3X4X5X6⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣100.80.50.50.801−0.50.8−0.80.5⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,Y[n]=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣Y1Y2Y3Y4Y5Y6⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣11.31.310.3−0.3⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

For the above data points, we get the following:

1. Rank of is equal to for every of cardinality . This implies that assumption (A1) holds with , and is some positive real value whose exact value is not required (refer Section 5.1 for the procedure).

2. Assumption (A2) holds and (refer Section 5.1 for the procedure).

3. Assumption (A5) holds and (refer Section 7.1 for the procedure).

Therefore,

 12+(μ/γ)≥0.17

As , thus condition (8) in Theorem 2 is satisfied for this example.

We also note that Assumption 1 in Su and Shahrampour [25], closest related work, does not hold for the given set of data points. Specifically, if and then

where, is the identity matrix, , , and is the -norm of any vector , i.e

 ∥v∥1=d∑k=1|v[k]|

Thus, the proposed coordinate-wise trimmed mean filtering technique in [25] is not guaranteed to be effective for this particular case.

Omniscient Byzantine faulty agents: To simulate our proposed algorithm, described in Section 6, we randomly choose an agent to be Byzantine faulty. The chosen Byzantine faulty agent is assumed to have complete knowledge of honest agents’ gradients, and even knows the value of . At each time , the faulty agent reports gradient that is directed opposite to ( being the parameter estimate at ), to maximize the damage, and has norm equal to the nd largest norm of honest agents’ gradients to pass through the filter (as in this particular example and so the filtering in step S1 eliminates the gradient with largest norm).

Expectedly (cf. Theorem 2), the proposed algorithm converges to for this example with and step-size , regardless of the identity of Byzantine faulty agent. Note that and (refer. [43]).

Convergence plot of the proposed (with norm filtering) gradient descent algorithm (plotted in ‘blue’) for (chosen randomly for the purpose of simulation) is shown in Figure 1. In the plot, the estimation error is equal to for each iteration (or time) . The initial estimate , Byzantine faulty agent is omniscient and chooses its gradients as described above.

Ill-informed Byzantine faulty agents: It may happen that Byzantine faulty agents are not omniscient, as mentioned above. They could just have access to information held by them. To simulate such faulty behavior, in this example, the Byzantine faulty agent simply reports randomly chosen gradient vectors to the server in step S1. The proposed norm filter converges to , as expected (shown in Figure 2). Whereas, the original gradient descent algorithm does not converge as expected, and often diverges away from as shown in Figure 2.

## 11 Conclusion

This paper proposes two simple norm based filtering techniques, norm filtering and norm-cap filtering, for “robustifying” the original distributed gradient descent algorithm for solving distributed linear regression problem in presence of Byzantine faulty agents in the multi-agent system, when the maximum possible number of Byzantine faulty agents is less than a specified bound. The proposed “robustification” techniques also solve a more general multi-agent optimization problem with Byzantine faults. We note that the obtained bound on the number of faulty agents, which if satisfied guarantees correctness of the proposed algorithm, relates to the conditioning of the resultant matrix constructed by stacking the data points of the honest agents.

Stopping Failures: Even though the proposed algorithm can handle any kind of faults, including stopping failure (when a certain agent crashes and stops responding), it is not yet optimal for handling such inadvertent crashes. However, the server can simply define an upper limit on the outdatedness (time passed since the last update) of an agent’s gradient and deem a particular agent as ‘crashed’ if the outdatedness of the agent’s gradient exceeds the limit.

## Acknowledgements

Research reported in this paper was sponsored in part by the Army Research Laboratory under Cooperative Agreement W911NF- 17-2-0196, and by National Science Foundation award 1610543. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the the Army Research Laboratory, National Science Foundation or the U.S. Government.

## References

• [1] L. Lamport, R. Shostak, and M. Pease, “The Byzantine generals problem,” ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 4, no. 3, pp. 382–401, 1982.
• [2] N. A. Lynch, Distributed algorithms.   Elsevier, 1996.
• [3] K. Bhatia, P. Jain, and P. Kar, “Robust regression via hard thresholding,” in Advances in Neural Information Processing Systems, 2015, pp. 721–729.
• [4] P. J. Huber, Robust statistics.   Springer, 2011.
• [5] Y. Chen, L. Su, and J. Xu, “Distributed statistical machine learning in adversarial settings: Byzantine gradient descent,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 1, no. 2, p. 44, 2017.
• [6] P. Blanchard, R. Guerraoui, J. Stainer et al., “Machine learning with adversaries: Byzantine tolerant gradient descent,” in Advances in Neural Information Processing Systems, 2017, pp. 119–129.
• [7] G. Damaskinos, R. Guerraoui, R. Patra, M. Taziki et al., “Asynchronous Byzantine machine learning (the case of sgd),” in International Conference on Machine Learning, 2018, pp. 1153–1162.
• [8] X. Cao and L. Lai, “Distributed gradient descent algorithm robust to an arbitrary number of Byzantine attackers,” 2018.
• [9] J. Bernstein, J. Zhao, K. Azizzadenesheli, and A. Anandkumar, “signsgd with majority vote is communication efficient and Byzantine fault tolerant,” arXiv preprint arXiv:1810.05291, 2018.
• [10] D. Alistarh, Z. Allen-Zhu, and J. Li, “Byzantine stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2018, pp. 4618–4628.
• [11] D. Yin, Y. Chen, K. Ramchandran, and P. Bartlett, “Byzantine-robust distributed learning: Towards optimal statistical rates,” in International Conference on Machine Learning, 2018, pp. 5636–5645.
• [12] D. Data, L. Song, and S. Diggavi, “Data encoding for Byzantine-resilient distributed gradient descent,” in 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).   IEEE, 2018, pp. 863–870.
• [13] Y. Shoukry, P. Nuzzo, A. Puggelli, A. L. Sangiovanni-Vincentelli, S. A. Seshia, and P. Tabuada, “Secure state estimation for cyber-physical systems under sensor attacks: A satisfiability modulo theory approach,” IEEE Transactions on Automatic Control, vol. 62, no. 10, pp. 4917–4932, 2017.
• [14] H. Fawzi, P. Tabuada, and S. Diggavi, “Secure estimation and control for cyber-physical systems under adversarial attacks,” IEEE Transactions on Automatic control, vol. 59, no. 6, pp. 1454–1467, 2014.
• [15] M. Pajic, I. Lee, and G. J. Pappas, “Attack-resilient state estimation for noisy dynamical systems,” IEEE Transactions on Control of Network Systems, vol. 4, no. 1, pp. 82–92, 2017.
• [16] M. S. Chong, M. Wakaiki, and J. P. Hespanha, “Observability of linear systems under adversarial attacks,” in American Control Conference (ACC), 2015.   IEEE, 2015, pp. 2439–2444.
• [17] Z. Li, W. Trappe, Y. Zhang, and B. Nath, “Robust statistical methods for securing wireless localization in sensor networks,” in Proceedings of the 4th international symposium on Information processing in sensor networks.   IEEE Press, 2005, p. 12.
• [18] Y. Zeng, J. Cao, J. Hong, S. Zhang, and L. Xie, “Secure localization and location verification in wireless sensor networks: a survey,” The Journal of Supercomputing, vol. 64, no. 3, pp. 685–701, 2013.
• [19]

J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,”

IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 2, pp. 210–227, 2009.
• [20] B. McWilliams, G. Krummenacher, M. Lucic, and J. M. Buhmann, “Fast and robust least squares estimation in corrupted linear models,” in Advances in Neural Information Processing Systems, 2014, pp. 415–423.
• [21] Y. Chen, C. Caramanis, and S. Mannor, “Robust sparse regression under adversarial corruption,” in International Conference on Machine Learning, 2013, pp. 774–782.
• [22] X. Ren, Y. Mo, J. Chen, and K. H. Johansson, “Secure state estimation with Byzantine sensors: A probabilistic approach,” arXiv preprint arXiv:1903.05698, 2019.
• [23] A. Prasad, A. S. Suggala, S. Balakrishnan, and P. Ravikumar, “Robust estimation via robust gradient estimation,” arXiv preprint arXiv:1802.06485, 2018.
• [24] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, J. Steinhardt, and A. Stewart, “Sever: A robust meta-algorithm for stochastic optimization,” arXiv preprint arXiv:1803.02815, 2018.
• [25] L. Su and S. Shahrampour, “Finite-time guarantees for Byzantine-resilient distributed state estimation with noisy measurements,” arXiv preprint arXiv:1810.10086, 2018.
• [26] Y. Chen, S. Kar, and J. M. Moura, “Resilient distributed estimation through adversary detection,” IEEE Transactions on Signal Processing, vol. 66, no. 9, pp. 2455–2469, 2018.
• [27] A. Mitra and S. Sundaram, “Byzantine-resilient distributed observers for lti systems,” 2018.
• [28] C. Xie, O. Koyejo, and I. Gupta, “Generalized Byzantine-tolerant sgd,” arXiv preprint arXiv:1802.10116, 2018.
• [29] L. Su and N. H. Vaidya, “Fault-tolerant multi-agent optimization: optimal iterative distributed algorithms,” in Proceedings of the 2016 ACM symposium on principles of distributed computing.   ACM, 2016, pp. 425–434.
• [30] ——, “Robust multi-agent optimization: coping with Byzantine agents with input redundancy,” in International Symposium on Stabilization, Safety, and Security of Distributed Systems.   Springer, 2016, pp. 368–382.
• [31] S. Sundaram and B. Gharesifard, “Distributed optimization under adversarial nodes,” IEEE Transactions on Automatic Control, 2018.
• [32] L. Su and N. Vaidya, “Multi-agent optimization in the presence of Byzantine adversaries: fundamental limits,” in 2016 American Control Conference (ACC).   IEEE, 2016, pp. 7183–7188.
• [33] F. Fanitabasi, “A review of adversarial behaviour in distributed multi-agent optimisation,” in 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC Companion).   IEEE, 2018, pp. 53–58.
• [34] Z. Yang and W. U. Bajwa, “Byrdie: Byzantine-resilient distributed coordinate descent for decentralized learning,” 2017.
• [35] W. Xu, Z. Li, and Q. Ling, “Robust decentralized dynamic optimization at presence of malfunctioning agents,” Signal Processing, vol. 153, pp. 24–33, 2018.
• [36] R. Pascanu, T. Mikolov, and Y. Bengio, “Understanding the exploding gradient problem,” CoRR, abs/1211.5063, vol. 2, 2012.
• [37]

R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in

Proceedings of the 22nd ACM SIGSAC conference on computer and communications security.   ACM, 2015, pp. 1310–1321.
• [38] R. A. Horn, R. A. Horn, and C. R. Johnson, Matrix analysis.   Cambridge university press, 1990.
• [39] Y. Ye and E. Tse, “An extension of Karmarkar’s projective algorithm for convex quadratic programming,” Mathematical programming, vol. 44, no. 1-3, pp. 157–179, 1989.
• [40] L. Bottou, “Online learning and stochastic approximations,” On-line learning in neural networks, vol. 17, no. 9, p. 142, 1998.
• [41] M. Pajic, J. Weimer, N. Bezzo, P. Tabuada, O. Sokolsky, I. Lee, and G. J. Pappas, “Robustness of attack-resilient state estimators,” in ICCPS’14: ACM/IEEE 5th International Conference on Cyber-Physical Systems (with CPS Week 2014).   IEEE Computer Society, 2014, pp. 163–174.
• [42] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: numerical methods.   Prentice hall Englewood Cliffs, NJ, 1989, vol. 23.
• [43] W. Rudin et al., Principles of mathematical analysis.   McGraw-hill New York, 1964, vol. 3.
• [44] S. Boyd and L. Vandenberghe, Convex optimization.   Cambridge university press, 2004.

## Appendix A Appendix: Noisy Gradients

In practice, honest agents might not report their costs’ gradients accurately due to reasons such as system noise or quantization errors. Specifically, in case of synchronous execution we assume the following.

• Noisy Gradients: For each honest agent , assume that

 gti=∇Ci(wt)+Di(wt),∀t∈Z≥0

where, .

### a.1 Noisy Responses in Linear Regression

The above approximate gradient framework models the case of noisy responses in distributed linear regression, where

 Yi=Xiw∗+ξi,∥ξi∥≤ξ<∞,∀i∈H (12)

The actual error cost of an agent at an estimated parameter value is

 Ci(w)=(1/2)∥Xiw−Xiw∗∥2 (13)

However, agent can only observe , and not . Therefore, the error cost observed by agent at an estimated parameter value is

 ˆCi(w)=(1/2)∥Xiw−Xiw∗∥2

Thus, the reported gradient of an agent at any time , in Step S1 of the Algorithm given in Section 6, is given as follows (for the synchronous case).

 gti=∇ˆCi(wt)=XTi(Xiwt−Yi)

Substituting (12) above gives

 gti=XTiXi(wt−w∗)−XTiξi

As (cf. (13)), thus for the synchronous case,

 gti=∇Ci(wt)−XTiξi,∀i∈H,∀t∈Z≥0

Note that the above gradient is a special case of the noisy gradient model in Assumption (A7), where . As , thus

 ∥∥Di(wt)∥∥=√ξTi(XiXTi)ξi≤√ui∥ξi∥≤√uiξ,∀t∈Z≥0,∀i∈H

where, is the largest eigenvalue of positive semi-definite matrix . Let , then

 ∥∥Di(wt)∥∥≤uξ<∞,∀t∈Z≥0,∀i∈H

### a.2 Convergence Analysis: Algorithm-I With System Noise

Intuitively, it is impossible in general for any algorithm to compute accurately when none of the agents report gradients of their costs accurately. However, if the algorithm is robust enough then it can compute a point in the neighborhood of , whose size usually depends on the magnitude of inaccuracies (or noise) in the agents’ gradients. For the proposed algorithm with update law (3) in Section 6, we can guarantee convergence to a neighborhood of whose size, expectedly, depends on and the also on the fraction of maximum possible Byzantine faulty agents .

###### Theorem 6.

Consider the update law (3) given in Section 6 under assumptions (A1)-(A3), (A5) and (A7). If , , and condition (8) holds then for

 D∗=1γ(1−2(f/n)1−(f/n)(2+μ/γ))D

there exists a finite such that

 ∥∥wt−w∗∥∥≤D∗,∀t≥τ
###### Proof.

Refer Appendix B.8. ∎

Theorem 6 states that the final inaccuracy of the solution obtained by the server using the algorithm given in Section 6 can be at most w.r.t -norm. In case ,

 D∗=(1γ)D

For now, we have only considered the synchronous case. However, using similar arguments as in assumption (A6) and Theorem 4, the above convergence result is expected to hold even when there is partial asynchronicity in the system.

## Appendix B Appendix: Proofs

### b.1 Proof of Claim 1

As is convex for all , thus assumption (A2) implies

 ∇Ci(w∗)=0,∀i∈H

Lipschitz continuity (assumption (A2)) of further implies

 ∥∇Ci(w)∥≤μ∥w−w∗∥,∀w∈Rd,∀i∈H

Combining this inequality with Cauchy-Schwartz inequality implies,

 (14)

From assumption (A1),

 (15)

as