1 Introduction
This paper considers the problem of Byzantine fault tolerant distributed linear regression in a multiagent system. The proposed algorithms, however, are applicable for a more general class of distributed optimization problems (described in Section 5) that includes distributed linear regression. The system comprises of a server and agents, where each agent holds number of data points and responses, stacked as matrix
and vector
, respectively. Up to of the agents in the system are Byzantine faulty and identity of Byzantine faulty agents is apriori unknown to the server [1, 2]. The server knows that if agent is honest (nonfaulty) then its data points and responses satisfy for some unknown parameter value . The objective of the server is to compute parameter , regardless of the identity of Byzantine faulty agents. This seemingly simple problem is challenging to solve due to the adversarial nature of Byzantine faulty agents [3]. In fact, it is well known that the existing techniques in robust statistical learning (cf. [4]) are ineffective in solving the aforementioned problem unless certain assumptions on the probability distribution of agents’ data points are satisfied [3, 5, 6].Existing solutions for Byzantine fault tolerant distributed statistical learning (ref. [5, 6, 7, 8, 9, 10, 11, 12]) rely on assumptions on the probability distribution of honest agents’ data points for accuracy in probabilistic manner (even when their is no noise in the system). Whereas, we are interested in algorithms that can accurately (in absence of noise and with reasonably bounded error in presence of noise) compute in deterministic manner, under certain conditions on , regardless of the probability distribution of agents’ data points. We also note that all the prior works on Byzantine fault tolerance in distributed statistical learning assume synchronicity in the system, except [12, 7] where every agent has access to all the data points and responses. Whereas, the proposed algorithms are partially asynchronous, and therefore, robust to bounded delays in the system.
It should be noted that the above Byzantine fault tolerant linear regression can be used to solve a wide range of engineering problems pertaining to faulttolerance or security, such as secure distributed state estimation of control systems
[13, 14, 15, 16], secure localization [17, 18]and secure pattern recognition
[19].2 Summary of Contributions
We propose two norm based filtering techniques, norm filtering and normcap filtering, that “robustifies” the original distributed gradient descent algorithm to solve the aforementioned regression problem when is less than specified threshold values^{2}^{2}2Refer Section 7 and Section 9 for further details.. The details of the algorithms are given in Sections 6 and 8. The proposed algorithms also solve a more general multiagent optimization problem where the honest agents’ objective functions (or costs) satisfy certain assumptions, specified in Section 5. The computational complexity of the proposed filtering techniques is , and the resultant algorithms are shown to be partially asynchronous^{3}^{3}3Refer Section 7.2 for formal details..
Comparison of our paper with the existing related work is given in the following section.
3 Related Work
Existing related work can be broadly classified into four categories:

Regression with adversarial corruptions to data points or responses.

Byzantine fault tolerant distributed estimation.

Byzantine fault tolerant distributed learning.

Byzantine fault tolerant distributed multiagent optimization.
3.1 Regression with adversarial corruptions
The aforementioned Byzantine faulttolerant regression problem has been addressed for the centralized setting by many researchers in recent years (ref. [3, 20, 21, 22, 23, 24]), where the server has access to all the agents’ data points and responses. We are interested in a distributed setting, where the data points and responses are distributed amongst agents, and are inaccessible to the server.
3.1.1 Challenges of distributed over centralized setting
The challenges of distributed setting over the centralized counterpart are as follows.

Agents could be holding large volume of data points and responses, that would make sharing of the entire data set with the server quite expensive in terms of the communication cost. Most of the centralized techniques (cf. [3]) require the server to have access to all the agents’ data points and responses.
Unlike the centralized techniques, our proposed algorithms do not require agents to share their data points or responses with the server, and it is partially asynchronous. While spectral filters proposed in [23, 24]
can be used in the distributed setting, they rely on singular value decomposition (SVD) of agents’ costs’ gradients (in each iteration) and therefore, are orders of magnitude more computationally complex than the proposed norm based filters. Also, unlike
[23, 24], we are interested in computing precisely (in absence of noise and within a reasonably bounded error in presence of noise) in a deterministic manner.The ‘hardthresholding’ based robust regression technique in [3], even for the centralized setting, is effective only if the data points satisfy a certain condition. This condition holds with “high probability” if the probability distribution of the data points is Gaussian with zero mean [3]. It should be noted that the efficacy of our proposed algorithms does not depend on any assumptions on the probability distribution of agents’ data points. Therefore, the proposed algorithms have a much wider applicability than the solutions proposed in [3], even for the centralized case.
3.2 Byzantine fault tolerant distributed estimation
In a closely related work, Su and Shahrampour [25] propose coordinatewise trimmed mean filtering for “robustifying” the distributed gradient descent method in a peertopeer network. However, they do not provide an explicit bound on the number of Byzantine faulty agents that can be tolerated using their filtering technique. The convergence of their algorithm relies on a technical assumption (assumption 1 in [25]) that imposes additional constraints, than required by our proposed algorithms, on agents’ data points. This point is reiterated by an example in Section 10. Resilient estimation technique proposed by [26] requires agents to commit (or share) their data points and responses to the server (or some central authority in their case), whereas we are interested in distributed setting where agents do not share their data points or responses with the server or any other agent in the system. In recent years, there has been a significant amount of work in Byzantine faulttolerant state estimation (both distributed and centralized) of linear timeinvariant (LTI) dynamical systems [27, 14, 13, 15, 22]. However, it should be noted that Byzantine faulttolerant state estimation (aka. secure state estimation) of LTI dynamical systems is a special case of the considered regression problem (ref. [27, 14, 13, 15, 22]). We also note that our proposed algorithms are significantly (orders of magnitude) simpler than some of the secure state estimation algorithms [13, 15], albeit can handle relatively less number of Byzantine faulty agents.
3.3 Byzantine fault tolerant distributed statistical learning
In recent years, significant amount of progress has been made on Byzantine faulty tolerant distributed statistical parameter learning [9, 7, 6, 8, 5, 10, 12, 28]. In [6, 28, 7, 8, 12, 9]
the agents assume the role of workers in the parallelization of the (stochastic) gradient descent method and therefore, agents have access to all the data points. In
[12], the authors propose a data encoding scheme for tolerating Byzantine faulty workers. Whereas, [6, 28, 7, 8, 9] rely on filters to “robustify” the original distributed stochastic gradient descent method. In [5, 11, 10], the agents have distributed data points and responses, however it is assumed that all the agents choose their data points and responses following a common probability distribution. Thus, the filtering (or encoding) techniques proposed in these papers are not guaranteed to be effective for the considered problem setting where no assumptions are made on the probability distribution of agents’ data points. Moreover, we are interested in deterministic regression algorithms that compute in a deterministic manner. We also note that the computational complexity for the server in our proposed filtering techniques (both norm filtering and normcap filtering) is , which is significantly less than the filtering techniques proposed in [6, 5].3.4 Byzantine fault tolerant distributed multiagent optimization
Byzantine faulty tolerant distributed multiagent optimization has also received considerable attention in recent years [29, 30, 31, 32, 33]. The objective in that case is to compute the point of minimum of the weighted average cost of the honest agents. If the agents’ costs are scalar (i.e. ) then the server can achieve this objective with weights of at least honest agents bounded away from zero [29, 31]. This result is extended in [30] for multivariate cost functions, where the proposed technique relies on the assumption that agents’ costs are weighted linear combination of finite number of convex functions. In general, this assumption does not hold for the regression problem considered in this paper. Further, it is known that the weights can not be uniform when there are nonzero number of Byzantine faulty agents in the system if the costs are not correlated [32, 31, 29]. Interestingly, the necessary correlation between honest agents’ costs that would admit equal (positive) weights for all the honest agents in Byzantine distributed multiagent optimization problem remains an open problem. In this paper, we present a sufficient correlation between honest agents’ costs under which the weights associated with honest agents’ costs are equal and positive. Specifically, if there exists a common point of minimum for all the honest agents’ costs (refer Section 5) then the minimizer of the average cost of honest agents can be computed in presence of limited (limits specified in Section 7 and 9) number of Byzantine faulty agents. Moreover, the proposed algorithms solve this multiagent optimization problem efficiently, under the aforementioned sufficient correlation.
Authors in [34] extend the results of [32] for multivariate cost functions by assuming that the original optimization problem can be split into independent scalar subproblems with strictly convex objective costs. This assumption is quite strong and in general, does hold for the considered regression problem setting. Authors in [35] solve the Byzantine faulttolerant distributed optimization problem, assuming that each and every agents’ cost is strongly convex, which implies that every honest agent can locally compute in context of the considered regression problem. This assumption is quite strong (it basically trivializes the considered regression problem), and is not required for the effectiveness of our proposed algorithms.
3.5 Norm Clipping in Machine Learning
We note that norm clipping (or filtering) of gradients has been proposed before for solving other unrelated problems in machine learning, namely the gradient explosion problem in training of recurrent neural networks
[36], and the privacy preservation problem in distributed stochastic gradient descent based training of deep feedforward neural networks
[37]. However, in these works the gradients are clipped based on a constant threshold value, that needs to be apriori determined carefully, whereas our filtering techniques rely on relative ranking of gradients’ norms at each iteration and does not require computation of any additional threshold value.Paper Organization
The rest of the paper is organized as follows. In Section 4, we introduce the notation used throughout the paper. Section 5 presents formal description of the problem addressed, along with the assumptions made in the paper. Section 6 presents the first filtering technique, referred as norm filtering. Section 7 presents the convergence analysis of the resultant gradient descent algorithm with norm filtering. Section 8 presents the second filtering technique, referred as normcap filtering. Section 9 presents the convergence analysis of the resultant gradient descent algorithm with normcap filtering. Section 10 presents a numerical example for demonstrating the obtained convergence results for the proposed algorithm. Finally, concluding remarks are made in Section 11. Appendix A discusses the effect of system noise. Appendix B contains formal proofs of the results.
4 Notations
, , and denote sets of integers, natural numbers, real numbers and dimensional realvalued vectors, respectively. , and represent nonnegative integers, nonnegative reals and positive reals, respectively. Let . For a vector , denotes its th element, and denotes its Euclidean norm (or norm), which is equal to . Notation for denotes a set of dimensional vectors with each element belonging to the interval . For a matrix , denotes its transpose and denotes a column vector corresponding its th row. In other words, is the th column of . For a set of matrices , the notation represents the rowwise concatenation of the matrices (stacking of the matrices). Thus, is a matrix of dimensions . Inner product (or scalar product) of two vectors in is denoted by and is equal to . For a multivariate differentiable function , denotes is gradient at a point . For a finite set , denotes its cardinality. For real number , denotes its absolute value.
5 Optimization Framework
As mentioned earlier, we consider a system of agents and a server, with communication links between all the agents and the server. Agents do not communicate with each other. The system contains at most Byzantine faulty agents that can behave arbitrarily [2, 1]. The identity of Byzantine faulty agents is apriori unknown to the server. However, the server knows the value of . Let and denote the sets of honest (nonfaulty) agents and Byzantine faulty agents, respectively.
In this paper, we propose an algorithm to solve a distributed multiagent optimization problem where each agent is associated with a differentiable convex cost , that satisfies certain assumptions that are mentioned below. The objective of the server is to compute a point of minimum of the average cost of the honest agents,
(1) 
In Section 5.1, we demonstrate the applicability of this optimization framework for the case of least squarederror distributed linear regression. In this optimization problem, we assume the following:

Unique point of minimum and strong convexity of reduced average cost:
Assume that has a unique point of minimum in a compact and convex set . Further, for any of cardinality at least , assume that the average cost of , i.e. , is strongly convex. Specifically,where .

minimizes at and are Lipschitz continuous:
For every , assume that , andwhere .

Strength of Byzantine faulty agents is less than majority:
Assume that the maximum number of Byzantine faulty agents is less than the half of the total number of agents, i.e.It should be noted that it is impossible to compute if in general when no assumptions are made on the probability distribution of honest agents’ data points [3, 14, 13].
5.1 Least SquaredError Distributed Linear Regression
Now, consider the distributed linear regression problem where each agent is associated with number of data points and responses, represented by and , respectively. The server knows that for each agent , for some parameter . The parameter is unknown to the server and is common for all the honest agents (cf. [3]). The objective of the server is to learn a value of (need not be unique). To solve this regression problem, each agent defines the following squarederror cost
As , thus is a positive semidefinite matrix. Thus, is convex for all . Here,
As , thus . As the costs are convex, this implies that is a point of minimum for all . As is positive semidefinite, therefore (cf. [38])
where
is the largest eigenvalue of
. This implies,for all . Thus, for , we get
Hence, assumption (A2) holds naturally for the case of least squarederror linear regression. For any set , the average cost is
where, and are the stacked responses and data points of all the agents in . Thus,
Therefore,
where, is the smallest eigenvalue of . Thus, if the stacked matrix has rank equal to , i.e. can be uniquely computed from the responses and data points of honest agents in , then not only is the unique point of minimum of , but is also strongly convex as (cf. [38]). In other words, if can be uniquely determined given the data points and responses of agents in , for all of cardinality then assumption (A1) holds, and
In the discussion above, we only consider the noiseless case. However, the proposed algorithms are effective even when there is (bounded) noise in the system, as discussed in Appendix A.
6 AlgorithmI: Gradient Descent with Norm Filtering
The algorithm follows the philosophy of gradient descent based optimization. The server starts with an arbitrary estimate of the parameter and updates it iteratively in two simple steps. In the first step, the server collects gradients of all the agents’ costs (at the current estimated value of the parameter) and sort them in the increasing order of their norms (breaking ties arbitrarily in the order). In the second step, the server filters out the gradients with largest norms, and uses the (vector) sum of the remaining gradients as update direction. Therefore, the filtering scheme is referred as norm filtering. The algorithm is formally described as follows.
Server begins with an arbitrary estimate of the parameter and iteratively updates it using the following steps. We let denote the parameter estimate at time .

At each time , the server requests from each agent the gradient of its cost at the current estimate , and sorts the received gradients by their norms. Let,
where, and denotes the gradient reported by agent at time . Note that if then (arbitrary), and if and the system is synchronous then (asynchronous case is discussed in Section 7.2). Let,
(2) be the set of agents with smallest gradient norms at time .

The server updates as,
(3) where, is a sequence of bounded positive real values and denotes projection onto w.r.t. Euclidean norm, i.e. .
6.1 Computational Complexity
In Step S1, the server computes the norm of all reported gradients in time. Sorting of these norms takes additional time. Thus, the net computational complexity of norm filtering (for the server) is . Whereas, computational complexity of each agent is .
In Step S2, the server adds all the vectors in set to update its parameter estimate in time. The projection of the updated estimate on a known compact convex set , defined using affine constraints (a bounded polygon), can be done in time using quadratic programming algorithm in [39]. Therefore, the net computational complexity of the algorithm (for the server) is per iteration.
6.2 Intuition
The principal factor behind the convergence of the proposed algorithm is consensus amongst all the honest agents on . Norm filtering bounds the norms of all the gradients used for computing the update direction (even if they are Byzantine faulty gradients) by norm of an honest agent’s gradient (as there could be at most Byzantine faulty agents). This has twofold implications,

As gradients of all the honest agents’ costs are Lipschitz continuous (assumption (A2)), therefore the magnitude of the contribution of the adversarial gradients (reported by Byzantine faulty agents) in the update direction is bounded above by the separation between current estimate and (cf. Claim 1).
The proposed filtering allows contribution of at least honest agents’ gradients ( by assumption (A3)), that pushes the current estimate towards with force that is also proportional to the separation between current estimate and for small enough , due to the strong convexity assumption (A1). This gives us an intuition that effect of adversarial gradients can be overpowered by the honest agents’ gradients in Step S2 at all times if is small enough.
7 Convergence Analysis: AlgorithmI
Before we present the convergence results for AlgorithmI, let us note the following implications of assumptions (A1) and (A2).
Claim 1.
Assumptions (A1)(A2) imply that
(4) 
Moreover, if then for any of cardinality , we get
(5) 
where, .
Proof.
Refer to Appendix B.1. ∎
We rely on the following sufficient criterion for the convergence of nonnegative sequences.
Lemma 1 (Ref. Bottou, 1998 [40]).
Consider a sequence of real values . If then
(6) 
where the operators and are defined as follows (),
In other words, convergence of infinite sum of positive variations of a nonnegative sequence is sufficient for the convergence of the sequence and infinite sum of its negative variations.
7.1 Convergence With Full Synchronism
We now present the sufficient conditions under which the proposed algorithm converges to when the server and honest agents are synchronous, i.e. we assume:
(A4) Full Synchronism: for all .
Theorem 1.
Under assumptions (A1)(A4), if , , and
(7) 
then the sequence of parameter estimates , generated by (3), converges to .
Proof.
Refer Appendix B.3. ∎
Theorem 1 states that if is less than then the proposed algorithm will reach the point of minimum of the asymptotically under assumptions (A1)(A4). As assumptions (A1)(A3) also imply that (cf. Claim 1), thus (maximum allowable Byzantine agents) should be less than onethird of (total number of agents) for the proposed algorithm to converge to .
If assumptions (A1)(A2) and condition (7) are satisfied, then
and thus (cf. Claim 1),
for all subject to . In other words, the point of minimum of the average cost of any honest agents is the point of minimum of the average cost of all honest agents. Therefore, under condition (7) and assumptions (A1)(A2), is indeed strongly convex for all of cardinality .
It is known, from control systems literature [14, 41, 13, 16], that the considered linear regression problem can be solved in presence of at most Byzantine faulty agents only if matrix
has rank equal to for every subset of cardinality . In light of this information, we make the following additional assumption on the costs to improve the tolerance bound on .

Uniform Redundancy:
For any of cardinality , we assume thatwhere, and .
For the case of least squarederror linear regression (refer Section 5.1), similar to in assumption (A1), we have
where, is the smallest eigenvalue of . We refer the above redundancy as uniform because it is required to hold for all of cardinality . This redundancy property of the regression problem is also referred as sparse observability in control systems literature [16]. Also, note that assumption (A5) is meaningful only if assumption (A3) holds, i.e. .
Similar to Claim 1,
Claim 2.
Assumptions (A2)(A3) and (A5) imply that
Proof.
Refer Appendix B.2 ∎
With assumption (A5), we get the following alternate convergence result for the proposed algorithm.
Theorem 2.
Under assumptions (A1)(A5), if , , and
(8) 
then the sequence of parameter estimates , generated by (3), converges to .
Proof.
Refer Appendix B.4. ∎
Theorem 2 states that if is less than then the proposed algorithm reaches the point of minimum of the asymptotically under assumptions (A1)(A5). Owing to Claim 2, the righthand side in condition (8) is less than or equal to .
Instead of using a diminishing stepsize, we can use a small enough constant stepsize in (3) to obtain linear convergence of the proposed algorithm as stated below.
Theorem 3.
Proof.
Refer Appendix B.5. ∎
7.2 Convergence With Partial Asynchronism
In practice, the server and the agents need not synchronize. At any given time , some of the honest agents might not be able to report gradients of their costs at the current estimate . This could occur due to various reasons, such as hardware malfunction or large communication delays. In order to cope with such irregularities, the server uses the last reported gradient, in step S2, of an agent that fails to report its cost’s gradient at the current estimate in step S1. Formally, for an agent that fails to report its gradient at , the server uses the last reported gradient of that agent, where is the time passed since agent reported its gradient. However, we assume to be bounded for all . In other words, we assume partial asynchronism that is formally stated as follows (cf. Section 7.1 of Bertsekas and Tsitsiklis, 1998 [42]).

Partial Asynchronism:
For every , where .
Here, is a finite (unknown) positive integer. As the server uses the last available gradient at each time for each agent , thus .
If the server does not receive any gradient from an agent until time (i.e. ), then it assigns .
If then assumption (A6) is equivalent to assumption (A4), for which case the sufficient conditions for convergence of to have already been stated in Theorems 1, 2 and 3. Therefore, in assumption (A6) . Before we state the result on the convergence result under (A6), let us first establish that the infinite sum of the sequence for all is finite (). This result is used later for showing convergence of , generated by (3), to under the aforementioned partial asynchronism.
Lemma 2.
Consider the update law (3) under assumptions (A1)(A3) and (A6). If and then
Proof.
Refer Appendix B.6. ∎
The result in Lemma 2 does not require the sequence to be monotonically decreasing as long as . However, the proof is simplified under this assumption and a nonmonotonous does not confer any additional advantages as far as asymptotic convergence of is concerned. Also, the commonly used diminishing stepsize is indeed monotonically decreasing (cf. [43]).
Theorem 4.
Proof.
Refer Appendix B.7. ∎
The convergence result stated in Theorem 4 is same as that in Theorem 2, if the partial asynchronicity assumption (i.e. (A6)) is replaced by the synchronicity assumption (i.e. (A4)). Similarly, the convergence result stated in Theorem 1 is also valid if assumption (A4) (full synchronism) in Theorem 1 is replaced by assumption (A6) (partial asynchronism).
8 AlgorithmII: Gradient Descent With NormCap Filtering
The algorithm in essence is similar to AlgorithmI, only here instead of eliminating the largest agents’ gradients the server caps the largest gradients’ norms by the norm of th largest reported gradient. Therefore, the filtering scheme is referred as normcap filtering. Expectedly, normcap filtering improves the sufficiency bound on with respect to (8). The steps of the algorithm are formally described as follows.
Server begins with an arbitrary estimate of the parameter and iteratively updates it using the following steps. We let denote the parameter estimate at time .

At each time , the server requests from each agent the gradient of its cost at the current estimate , and sorts the received gradients by their norms. Let,
where, and denotes the gradient reported by agent at time . Note that if then (arbitrary), and if and the system is synchronous then (asynchronous case is discussed in Assumption (A6) of Section 7.2). Let,
be the set of agents with smallest gradient norms at time .

The server caps the norms of the gradients reported by agents by as
(9) and updates as,
(10) where, is a sequence of bounded positive real values and denotes projection onto w.r.t. Euclidean norm, i.e. .
8.1 Modification (Informal): Normalizing Gradients
Instead of capping just the largest gradients, the server could scale the norms of all nonzero gradients to . In which case, the nonzero honest gradients in get amplified, whereas the maximum possible norm of Byzantine faulty agents’ gradients still remains bounded by . Therefore, intuitively, correctness of AlgorithmII implies correctness of this modified version of AlgorithmII, but the other way around need not be true. However, it might be possible to improve the sufficiency bound on by this modification of AlgorithmII. Note that modification of AlgorithmII in this manner is equivalent to normalizing all the agents’ gradients (that are nonzero), and then adding these normalized gradients to compute the update direction at each iteration. Thus, this modification replaces sorting of agents’ gradients in Step S1 with normalization of agents’ gradients.
9 Convergence Analysis: AlgorithmII
In this section, we present the convergence of AlgorithmII for the synchronous case. The convergence result is however expected to hold even under partial asynchronism.
Theorem 5.
Under assumptions (A1)(A5), if , , and
(11) 
then the sequence of parameter estimates , generated by update law (10), converges to .
Proof.
To be included in a revision of this manuscript. ∎
Evidently, the bound on given in (11) is better than the bound in (8), which was obtained for norm filtering given in Section 6. In fact, in an extreme case where is the unique minimizer of every honest agents’ cost, i.e. , then righthand side of (11) is equal to . Thus, in this extreme case, AlgorithmII solves the regression problem if Byzantine faulty agents are less than the majority, which is in fact the necessary condition for solving the problem.
10 Numerical Example
In this section, we present a small numerical example to demonstrate the convergence of norm filtering based gradient descent algorithm, as given by Theorem 2 for the synchronous case, i.e. under assumption (A4).
In this example, we choose , and . Note that assumption (A3) holds readily as . Each agent is associated with data point and a corresponding response , such that
The collective data points and responses are:
For the above data points, we get the following:

Rank of is equal to for every of cardinality . This implies that assumption (A1) holds with , and is some positive real value whose exact value is not required (refer Section 5.1 for the procedure).

Assumption (A2) holds and (refer Section 5.1 for the procedure).

Assumption (A5) holds and (refer Section 7.1 for the procedure).
Therefore,
As , thus condition (8) in Theorem 2 is satisfied for this example.
We also note that Assumption 1 in Su and Shahrampour [25], closest related work, does not hold for the given set of data points. Specifically, if and then
where, is the identity matrix, , , and is the norm of any vector , i.e
Thus, the proposed coordinatewise trimmed mean filtering technique in [25] is not guaranteed to be effective for this particular case.
Omniscient Byzantine faulty agents: To simulate our proposed algorithm, described in Section 6, we randomly choose an agent to be Byzantine faulty. The chosen Byzantine faulty agent is assumed to have complete knowledge of honest agents’ gradients, and even knows the value of . At each time , the faulty agent reports gradient that is directed opposite to ( being the parameter estimate at ), to maximize the damage, and has norm equal to the nd largest norm of honest agents’ gradients to pass through the filter (as in this particular example and so the filtering in step S1 eliminates the gradient with largest norm).
Expectedly (cf. Theorem 2), the proposed algorithm converges to for this example with and stepsize , regardless of the identity of Byzantine faulty agent. Note that and (refer. [43]).
Convergence plot of the proposed (with norm filtering) gradient descent algorithm (plotted in ‘blue’) for (chosen randomly for the purpose of simulation) is shown in Figure 1. In the plot, the estimation error is equal to for each iteration (or time) . The initial estimate , Byzantine faulty agent is omniscient and chooses its gradients as described above.
Illinformed Byzantine faulty agents: It may happen that Byzantine faulty agents are not omniscient, as mentioned above. They could just have access to information held by them. To simulate such faulty behavior, in this example, the Byzantine faulty agent simply reports randomly chosen gradient vectors to the server in step S1. The proposed norm filter converges to , as expected (shown in Figure 2). Whereas, the original gradient descent algorithm does not converge as expected, and often diverges away from as shown in Figure 2.
11 Conclusion
This paper proposes two simple norm based filtering techniques, norm filtering and normcap filtering, for “robustifying” the original distributed gradient descent algorithm for solving distributed linear regression problem in presence of Byzantine faulty agents in the multiagent system, when the maximum possible number of Byzantine faulty agents is less than a specified bound. The proposed “robustification” techniques also solve a more general multiagent optimization problem with Byzantine faults. We note that the obtained bound on the number of faulty agents, which if satisfied guarantees correctness of the proposed algorithm, relates to the conditioning of the resultant matrix constructed by stacking the data points of the honest agents.
Stopping Failures: Even though the proposed algorithm can handle any kind of faults, including stopping failure (when a certain agent crashes and stops responding), it is not yet optimal for handling such inadvertent crashes. However, the server can simply define an upper limit on the outdatedness (time passed since the last update) of an agent’s gradient and deem a particular agent as ‘crashed’ if the outdatedness of the agent’s gradient exceeds the limit.
Acknowledgements
Research reported in this paper was sponsored in part by the Army Research Laboratory under Cooperative Agreement W911NF 1720196, and by National Science Foundation award 1610543. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the the Army Research Laboratory, National Science Foundation or the U.S. Government.
References
 [1] L. Lamport, R. Shostak, and M. Pease, “The Byzantine generals problem,” ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 4, no. 3, pp. 382–401, 1982.
 [2] N. A. Lynch, Distributed algorithms. Elsevier, 1996.
 [3] K. Bhatia, P. Jain, and P. Kar, “Robust regression via hard thresholding,” in Advances in Neural Information Processing Systems, 2015, pp. 721–729.
 [4] P. J. Huber, Robust statistics. Springer, 2011.
 [5] Y. Chen, L. Su, and J. Xu, “Distributed statistical machine learning in adversarial settings: Byzantine gradient descent,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 1, no. 2, p. 44, 2017.
 [6] P. Blanchard, R. Guerraoui, J. Stainer et al., “Machine learning with adversaries: Byzantine tolerant gradient descent,” in Advances in Neural Information Processing Systems, 2017, pp. 119–129.
 [7] G. Damaskinos, R. Guerraoui, R. Patra, M. Taziki et al., “Asynchronous Byzantine machine learning (the case of sgd),” in International Conference on Machine Learning, 2018, pp. 1153–1162.
 [8] X. Cao and L. Lai, “Distributed gradient descent algorithm robust to an arbitrary number of Byzantine attackers,” 2018.
 [9] J. Bernstein, J. Zhao, K. Azizzadenesheli, and A. Anandkumar, “signsgd with majority vote is communication efficient and Byzantine fault tolerant,” arXiv preprint arXiv:1810.05291, 2018.
 [10] D. Alistarh, Z. AllenZhu, and J. Li, “Byzantine stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2018, pp. 4618–4628.
 [11] D. Yin, Y. Chen, K. Ramchandran, and P. Bartlett, “Byzantinerobust distributed learning: Towards optimal statistical rates,” in International Conference on Machine Learning, 2018, pp. 5636–5645.
 [12] D. Data, L. Song, and S. Diggavi, “Data encoding for Byzantineresilient distributed gradient descent,” in 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2018, pp. 863–870.
 [13] Y. Shoukry, P. Nuzzo, A. Puggelli, A. L. SangiovanniVincentelli, S. A. Seshia, and P. Tabuada, “Secure state estimation for cyberphysical systems under sensor attacks: A satisfiability modulo theory approach,” IEEE Transactions on Automatic Control, vol. 62, no. 10, pp. 4917–4932, 2017.
 [14] H. Fawzi, P. Tabuada, and S. Diggavi, “Secure estimation and control for cyberphysical systems under adversarial attacks,” IEEE Transactions on Automatic control, vol. 59, no. 6, pp. 1454–1467, 2014.
 [15] M. Pajic, I. Lee, and G. J. Pappas, “Attackresilient state estimation for noisy dynamical systems,” IEEE Transactions on Control of Network Systems, vol. 4, no. 1, pp. 82–92, 2017.
 [16] M. S. Chong, M. Wakaiki, and J. P. Hespanha, “Observability of linear systems under adversarial attacks,” in American Control Conference (ACC), 2015. IEEE, 2015, pp. 2439–2444.
 [17] Z. Li, W. Trappe, Y. Zhang, and B. Nath, “Robust statistical methods for securing wireless localization in sensor networks,” in Proceedings of the 4th international symposium on Information processing in sensor networks. IEEE Press, 2005, p. 12.
 [18] Y. Zeng, J. Cao, J. Hong, S. Zhang, and L. Xie, “Secure localization and location verification in wireless sensor networks: a survey,” The Journal of Supercomputing, vol. 64, no. 3, pp. 685–701, 2013.

[19]
J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,”
IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 2, pp. 210–227, 2009.  [20] B. McWilliams, G. Krummenacher, M. Lucic, and J. M. Buhmann, “Fast and robust least squares estimation in corrupted linear models,” in Advances in Neural Information Processing Systems, 2014, pp. 415–423.
 [21] Y. Chen, C. Caramanis, and S. Mannor, “Robust sparse regression under adversarial corruption,” in International Conference on Machine Learning, 2013, pp. 774–782.
 [22] X. Ren, Y. Mo, J. Chen, and K. H. Johansson, “Secure state estimation with Byzantine sensors: A probabilistic approach,” arXiv preprint arXiv:1903.05698, 2019.
 [23] A. Prasad, A. S. Suggala, S. Balakrishnan, and P. Ravikumar, “Robust estimation via robust gradient estimation,” arXiv preprint arXiv:1802.06485, 2018.
 [24] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, J. Steinhardt, and A. Stewart, “Sever: A robust metaalgorithm for stochastic optimization,” arXiv preprint arXiv:1803.02815, 2018.
 [25] L. Su and S. Shahrampour, “Finitetime guarantees for Byzantineresilient distributed state estimation with noisy measurements,” arXiv preprint arXiv:1810.10086, 2018.
 [26] Y. Chen, S. Kar, and J. M. Moura, “Resilient distributed estimation through adversary detection,” IEEE Transactions on Signal Processing, vol. 66, no. 9, pp. 2455–2469, 2018.
 [27] A. Mitra and S. Sundaram, “Byzantineresilient distributed observers for lti systems,” 2018.
 [28] C. Xie, O. Koyejo, and I. Gupta, “Generalized Byzantinetolerant sgd,” arXiv preprint arXiv:1802.10116, 2018.
 [29] L. Su and N. H. Vaidya, “Faulttolerant multiagent optimization: optimal iterative distributed algorithms,” in Proceedings of the 2016 ACM symposium on principles of distributed computing. ACM, 2016, pp. 425–434.
 [30] ——, “Robust multiagent optimization: coping with Byzantine agents with input redundancy,” in International Symposium on Stabilization, Safety, and Security of Distributed Systems. Springer, 2016, pp. 368–382.
 [31] S. Sundaram and B. Gharesifard, “Distributed optimization under adversarial nodes,” IEEE Transactions on Automatic Control, 2018.
 [32] L. Su and N. Vaidya, “Multiagent optimization in the presence of Byzantine adversaries: fundamental limits,” in 2016 American Control Conference (ACC). IEEE, 2016, pp. 7183–7188.
 [33] F. Fanitabasi, “A review of adversarial behaviour in distributed multiagent optimisation,” in 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC Companion). IEEE, 2018, pp. 53–58.
 [34] Z. Yang and W. U. Bajwa, “Byrdie: Byzantineresilient distributed coordinate descent for decentralized learning,” 2017.
 [35] W. Xu, Z. Li, and Q. Ling, “Robust decentralized dynamic optimization at presence of malfunctioning agents,” Signal Processing, vol. 153, pp. 24–33, 2018.
 [36] R. Pascanu, T. Mikolov, and Y. Bengio, “Understanding the exploding gradient problem,” CoRR, abs/1211.5063, vol. 2, 2012.

[37]
R. Shokri and V. Shmatikov, “Privacypreserving deep learning,” in
Proceedings of the 22nd ACM SIGSAC conference on computer and communications security. ACM, 2015, pp. 1310–1321.  [38] R. A. Horn, R. A. Horn, and C. R. Johnson, Matrix analysis. Cambridge university press, 1990.
 [39] Y. Ye and E. Tse, “An extension of Karmarkar’s projective algorithm for convex quadratic programming,” Mathematical programming, vol. 44, no. 13, pp. 157–179, 1989.
 [40] L. Bottou, “Online learning and stochastic approximations,” Online learning in neural networks, vol. 17, no. 9, p. 142, 1998.
 [41] M. Pajic, J. Weimer, N. Bezzo, P. Tabuada, O. Sokolsky, I. Lee, and G. J. Pappas, “Robustness of attackresilient state estimators,” in ICCPS’14: ACM/IEEE 5th International Conference on CyberPhysical Systems (with CPS Week 2014). IEEE Computer Society, 2014, pp. 163–174.
 [42] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: numerical methods. Prentice hall Englewood Cliffs, NJ, 1989, vol. 23.
 [43] W. Rudin et al., Principles of mathematical analysis. McGrawhill New York, 1964, vol. 3.
 [44] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
Appendix A Appendix: Noisy Gradients
In practice, honest agents might not report their costs’ gradients accurately due to reasons such as system noise or quantization errors. Specifically, in case of synchronous execution we assume the following.

Noisy Gradients: For each honest agent , assume that
where, .
a.1 Noisy Responses in Linear Regression
The above approximate gradient framework models the case of noisy responses in distributed linear regression, where
(12) 
The actual error cost of an agent at an estimated parameter value is
(13) 
However, agent can only observe , and not . Therefore, the error cost observed by agent at an estimated parameter value is
Thus, the reported gradient of an agent at any time , in Step S1 of the Algorithm given in Section 6, is given as follows (for the synchronous case).
Substituting (12) above gives
As (cf. (13)), thus for the synchronous case,
Note that the above gradient is a special case of the noisy gradient model in Assumption (A7), where . As , thus
where, is the largest eigenvalue of positive semidefinite matrix . Let , then
a.2 Convergence Analysis: AlgorithmI With System Noise
Intuitively, it is impossible in general for any algorithm to compute accurately when none of the agents report gradients of their costs accurately. However, if the algorithm is robust enough then it can compute a point in the neighborhood of , whose size usually depends on the magnitude of inaccuracies (or noise) in the agents’ gradients. For the proposed algorithm with update law (3) in Section 6, we can guarantee convergence to a neighborhood of whose size, expectedly, depends on and the also on the fraction of maximum possible Byzantine faulty agents .
Theorem 6.
Proof.
Refer Appendix B.8. ∎
Theorem 6 states that the final inaccuracy of the solution obtained by the server using the algorithm given in Section 6 can be at most w.r.t norm. In case ,
For now, we have only considered the synchronous case. However, using similar arguments as in assumption (A6) and Theorem 4, the above convergence result is expected to hold even when there is partial asynchronicity in the system.
Appendix B Appendix: Proofs
b.1 Proof of Claim 1
As is convex for all , thus assumption (A2) implies
Lipschitz continuity (assumption (A2)) of further implies
Combining this inequality with CauchySchwartz inequality implies,
(14) 
From assumption (A1),
(15) 
as