1 Introduction
Learning a model that minimizes the statistical risk is one of the fundamental goals of machine learning. A typical technique that accomplishes this task is empirical risk minimization (ERM) [1, 2, 3, 4]. In this case, the model is learned by applying optimization tools on a training dataset that is traditionally assumed to be available at a centralized location. However, in many recent applications (e.g., the Internet of Things), training data are distributed over a network, while in some other applications, the dataset cannot be processed by a single machine due to its size (e.g., social network data) or privacy concerns (e.g., smartphone data). Such applications require that the model be learned over the network. When the learning alrogorithm requires a central server directly connected to all the nodes in the network, we term the algorithm as distributed learning. Some other algorithms can accomplish learning tasks without a central server. We call these algirhtms decentralized learning algorithms.
While learning over a network has a rich history in the literature, a significant fraction of that work has focused on faultless networks [5, 6]. On the other hand, realworld networks are bound to undergo failures because of malfunctioning equipment, cyber attacks, etc. [7, 8]. And when failures happen, learning algorithms designed for faultless networks completely break down. Among different types of failures in the network, a Byzantine failure is considered the most general as it allows the faulty/compromised node to arbitrarily deviate from the agreedupon protocol [9]. Byzantine failures are the hardest to safeguard against and can easily jeopardize the operation of the entire network [10, 11, 12]. In [13], for example, it has been shown that a single Byzantine node in the network can lead to the failure of decentralized learning algorithms with a simple strategy. The overarching goal of this paper is to develop an efficient decentralized learning algorithm that is provably resilient against Byzantine failures in decentralized settings.
1.1 Related works
The machine learning task can be accomplished by defining and then minimizing a (regularized) loss function on the training data over a network. There have been several types of decentralized optimization methods that can solve the resulting problem. One class of the most commonly used methods is gradientbased such as distributed gradient descent (DGD)
[14, 15, 16]; methods in this class have low local computational complexity. Augmented Lagrangianbased methods are also broadly adopted for decentralized optimization [17, 18, 19], which require each node to solve an optimization subproblem locally. A third type of decentralized optimization methods includes secondorder methods [20, 21], which usually have high computational and/or communications cost. Although any of the algorithms mentioned above can be applied to solve decentralized learning problems, these algorithms have been developed under the assumption that there are no failures in the network.1.2 Our contributions
This paper focuses on solving a decentralized vectorvalued learning problem under Byzantine settings. Comparing to recent works on Byzantineresilient distributed learning algorithms
[22, 23, 24, 25, 26, 27, 28, 13, 29, 30, 31, 32, 33, 34, 35], there are two aspects of contribution in this paper. The first aspect is that we combine dimensionwise trimmed mean with decentralized gradient descent in this work. While a similar idea has been studied for distributed setting [27, 28], the consensus and convergence behavior of applying trimmed mean on gradient descent in decentralized setting was completely unknown before this paper. As we show later in the paper, the analysis is very different from the distributed setting and the guarantees are also different. The main reason is that decentralized learning algorithm requires consensus which usually is not required for distributed setting. As a result, Byzantine failures are much more dangerous and harder to deal with in decentralized settings. It is shown in previous works [13] that one Byzantine node with simple strategies is enough to crash the whole network, while it usually takes a portion of nodes to undergo failures or some extremely large value in distributed settings to show the performance difference. In the decentralized setting, a nonfaulty node cannot distinguish Byzantine neighbors from nonfaulty neighbors due to the lack of knowledge about most of the nodes in the network. So any given node has to engage with Byzantine nodes during consensus too. While in the distributed setting, nodes can always trust the server. For this reason, translating Byzantineresilient algorithm under distributed settings to decentralized settings is highly nontrivial.There do exist previous works that focus on Byzantine resilience in the decentralized setting [22, 23, 24, 25, 26, 13, 29]. But these works either do not trivially translate into a general learning problem [22, 23, 24, 25, 36] or lack generalization from scalarvalued problems to vectorvalued ones [13, 37]. The coordinate descentbased algorithm introduced in [38]
is the only vectorvalued decentralized Byzantineresilient algorithm to the best of our knowledge. But because the algorithm emphasizes onecoordinateatatime process (cannot use block coordinate to accelerate), the algorithm is not preferable when calculating one dimension gradient is not cheap (e.g., deep neural networks). In this paper, we develop an efficient Byzantineresilient algorithm and show that the algorithm solves decentralized vectorvalued learning problems under Byzantine settings. We provide theoretical guarantees for strongly convex problems, while we show the usefulness of our algorithm on nonconvex learning problems using numerical experiments.
1.3 Notations
All vectors are taken to be column vectors, while and denote the th element of vector and the th element of matrix , respectively. We use to denote norm of and to denote the vector of all ones, while denotes the transpose operation. Given a set, denotes its cardinality. Finally, we use to denote the gradient of a function with respect to . We use to denote the inner product. For a given vector and constant , we denote the ball of radius centered around as .
2 Problem formulation
The goal of this paper is the following: when given a network in which each node has access to some local training data, we want to learn a machine learning model in a decentralized fashion, even in the presence of Byzantine failures. In this section, we first describe the basic problem with a mathematical model. Then we introduce the Byzantine failure model.
2.1 Decentralized learning model
We consider a network of nodes, expressed as a directed, static graph . Here, the set represents nodes in the network, while the set of edges represents communication links between different nodes. Specifically, if and only if node can receive information from node and vice versa. Each node has access only to a local training set . Training samples belong to some Hilbert space are independent and identically distributed (i.i.d.) and drawn from an unknown distribution , i.e., . For simplicity, we assume that the cardinalities of local training sets are the same, i.e., . The generalization to the case when ’s are not equal sized is trivial.
Machine learning tasks are usually accomplished by defining and statistically minimizing a risk function with respect to a variable . For simplicity, we use in the following to denote the statistical risk function in this paper. We denote the true minimizer of the risk function as , i.e., . In learning problems, the distribution is usually unknown. Therefore cannot be solved for directly. One way of completing the task in the decentralized setting is to employ empirical risk minimization (ERM), i.e.,
(1) 
It can be shown that the minimizer of (1) converges to
with high probability as
increases [39]. In this paper, we focus on finite valued strongly convex risk functions with Lipschitz gradient. Here we make the assumptions formally.Assumption 1
The risk function is bounded almost surely over all training samples, i.e., , .
Assumption 2
The risk function is strongly convex, i.e., satisfying .
Assumption 3
The gradient of is Lipschitz, i.e., satisfying .
Note that Assumption 3 implies that the risk function itself is also Lipschitz, i.e., for some [40].
In decentralized learning, each node maintains a local variable . Then the ERM problem can be solved in the decentralized fashion, i.e.,
(2) 
To accomplish the decentralized ERM task, all nodes need to cooperate with each other by communicating with their neighbors over edges. We define the neighborhood of node as . If , then is a neighbor of node . Classic decentralized learning algorithms proceed iteratively. A node is expected to accomplish two tasks during each iteration: update the local variable according to some rule and broadcast a message to all its neighbors. Note that node can receive values from node only if .
2.2 Byzantine failure model
When there is no failure in the network, decentralized learning is well understood [19, 41]. The main assumption in this paper is that some of the nodes in the network can arbitrarily deviate from intended behavior. We model this behavior as Byzantine failure, formally defined as follows.
Definition 1
A node is said to be Byzantine if during any iteration, it either updates its local variable using an update function or it broadcasts some value other than the intended update to its neighbors.
We use to denote the set of nonfaulty nodes and we assume that there are at most Byzantine nodes in the network. We label the nonfaulty nodes from 1 to without loss of generality. We now provide some definitions and assumptions that are common in the literature, e.g., [13, 38].
Definition 2
A subgraph of is called a reduced graph if it is generated from graph by () removing all Byzantine nodes along with all their incoming and outgoing edges, and () removing additionally up to incoming edges from each nonfaulty node. A source component of graph is a collection of nodes such that each node in the source component has a directed path to every other node in .
Assumption 4
All reduced graphs generated from contain a source component of cardinality at least .
Assumption 4 describes the redundancy of a graph. What it ensures is that after removing a certain number of edges from nonfaulty nodes, each normal node can still receive information from a few other nonfaulty nodes. While checking this assumption efficiently remains an open problem, we do understand the generation of graphs that satisfy this assumption. One way of generating a resilient graph was introduced in the literature during the study of Byzantineresilient consensus techniques, e.g., in [26]. We also have observed empirically that in an Erdös–Rényi graph, when the degree of the least connected node is larger than , the assumption is often satisfied. This is also the technique we use to generate graphs in our numerical experiments. We also emphasize that we do not need to know the exact number of Byzantine nodes. Constructing a graph with a chosen will enable our algorithm to tolerate at most Byzantine nodes and the algorithm still has a competitive performance even if there is actually no failure in the network. In the next section, we will introduce a Byzantine faulttolerant algorithm for distributed learning and theoretically analyze it under Assumptions 1, 2, 3, and 4. The algorithm is expected to accomplish the following tasks: () achieve consensus, i.e., as the number of iterations ; and () learn a as sample size .
3 Byzantineresilient distributed gradient descent
It is shown in [42] that the exact global optimal of (2) is not achievable when . In this section, we introduce an algorithm called Byzantineresilient decentralized gradient descent (BRIDGE) that pursues the minimum of the statistical risk in the presence of Byzantine failures given that the training data are i.i.d..
3.1 Algorithm
When the network is fault free. One way of of solving (2) is to let each node update its local variable as
(3) 
where is the weight and is a positive sequence satisfying , and . One way of choosing is to use the sequence as the following. Find a and set so that and is . Note that (3) is a special case of the distributed gradient descent (DGD) algorithm [14]. The main difference between the proposed algorithm and the classic DGD method is that there is a screening step before each update, which is the key idea to make BRIDGE Byzantine resilient. The complete process at each node is as shown in Algorithm 1.
When initializing the algorithm, it is necessary to specify , the maximum number of Byzantine nodes that the algorithm can tolerate. Each node initializes at or some arbitrary vector. During each iteration , node first broadcasts and then receives from all . Next, node performs a screening among all ’s. The screening is with respect to each dimension separately. At dimension , node separates into three groups defined as following:
(4) 
(5) 
and
(6) 
Then node updates the th element of as
(7) 
The idea of performing screening before updating is to eliminate the largest and the smallest values in each dimension. This screening method is called coordinatewise trimmed mean. The screening can be easily realized by a simple sorting process. Since the update of the th element does not depend on other coordinates of or , the update of each dimension (step 5 to 10) can be done in parallel or sequentially in any order. Note that is likely to be different for different at each iteration so that only some elements of for any may be taken for update at node while other elements are dropped. But the size of each is the same in all dimensions. We now give theoretical guarantees for this algorithm.
Theorem 1
The theorem shows that the BRIDGE algorithm can learn a good model even when there are Byzantine failures in the network. To achieve this goal, the algorithm needs to accomplish two tasks: consensus and optimality. Consensus requires that all nonfaulty nodes agree on the same variable () despite the existence of Byzantine failures in the network while optimality requires that the globally agreed model indeed minimizes the statistical risk (). In the next section, we will prove the theorem for consensus and optimality, respectively.
4 Theoretical analysis
While gradient descent is well understood in the literature, we observe that BRIDGE does not take a regular gradient step at each iteration. The main idea of proving Theorem 1 is to take advantage of the convergence property of gradient descent and try to bound the distance between one gradient descent step and one BRIDGE step. The proof can be briefly described as the following. The BRIDGE local update sequence is described in (7). We first define a vector sequence and show that as , which is the proof for consensus. We then consider three sequences , , and that will be defined later. We define the following distances: , , , and . Observe that , we then show that , , and all go to 0. This is the proof for optimality.
4.1 Consensus analysis
Recall that the update is done in parallel for all coordinates. So we pick one coordinate and prove that all nodes achieve consensus in this coordinate. Since is arbitrarily picked, we then conclude that consensus is achieved for all coordinates. In this section, we drop the index for all variables for simplicity. It should be straight forward that the variables are dependent.
Define a vector whose elements are the th elements of from nonfaulty nodes only, i.e., . We first show that the update can be written in a matrix form which only involves nonfaulty nodes, i.e.,
(8) 
where is formed as . The formulation of matrix can be described as following. Let denote the nonfaulty nodes in the neighborhood of node , i.e., . The set of Byzantine neighbors can be defined as . One of two cases can happen during each iteration, () or () . To make the expression clear, we drop the iteration indicator for the rest of this discussion. It should be straightforward that the variables are dependent. For case (), since and , we know that . Similarly, . Then and satisfying for any . So that for each , satisfying . In this way, we can express the update with only messages from nonfaulty nodes. The elements of matrix can be written as
(9) 
For case (), since all nodes in are already nonfaulty, we keep only the first, second and last rows of (9). Note that since the choices of and are generally not unique, the formulation of matrix is also not unique. So far, we have expressed the update of nonfaulty nodes within matrix form involving only nonfaulty nodes.
Next, define a transition matrix to represent the product of ,
(10) 
Let be the total number of reduced graphs we can generate from . Let . Denote by . Let . Then it is know from previous work [42, 43] that
(11) 
where satisfies and . It can also be expressed as
(12) 
Taking as the starting point, we can express the iterations as
(13) 
Let us create a scenario that all nodes stop computing local gradients after iteration so that when . Define a vector under this scenario, i.e.,
(14)  
Observe that has identical elements in all dimensions. Let scalar sequence denote one element of . Next, we show that as . From (14),
(15) 
Then recall from the update of that
(16) 
If Assumption 3 hold and we initiate the algorithm from some vector with finite norm, we can always find two scalars and satisfying that , and . Then we have
(17) 
as . Since is arbitrarily picked, the convergence is true for all dimensions. Define a vector satisfying for . Then as ,
(18) 
The convergence in (18) can also be interpreted as as . The proof of consensus is complete.
Remark 1
It follows from (18) that the rate of consensus convergence is . Specifically, if choosing to be gives us .
4.2 Optimality analysis
From (18), we have an upper bound for . We then bound the other distances to show . Note that the sequence is not truly kept at any node, so we first describe the “update” of . Considering (14) for all coordinates together, the update for the full vector can be written in the form
(19) 
where satisfies for . Define another vector satisfying . We define a new sequence as
(20) 
Recalling that
(21) 
from (4.1) and Assumption 3 we have
(22) 
Next, defining a new sequence as
(23) 
we have
(24) 
Here we give a lemma to show the relationship between and the gradient of the statistical risk.
Lemma 1 shows that converges to the gradient of statistical risk in probability. The proof of Lemma 1 is in the Appendix A.
Remark 2
Lemma 1 shows that the difference between the gradient step of and the true gradient step is of order . If there is no failure in the network, the gradient step for nonresilient algorithm such as DGD usually has an error rate . If each node runs centralized algorithm with the given samples, the error rate for gradient step is usually . Science is a stochastic vector, . The lemma shows that BRIDGE improves the sample complexity by a factor of for each node by cooperating over a network.
Now we focus on . Note that is obtained by taking a regular gradient descent step of with step size . When Assumption 2 and 3 are satisfied, it is well understood [44, Ch.9] that the gradient descent step satisfies
(26) 
Then we have
(27) 
Now we can write the property of sequence for some as
(28) 
It follows from (18), (22), and Lemma 1 that with probability at least ,
(29) 
Remark 3
5 Numerical analysis
The numerical experiments are separated into two parts. In the first part, we run experiments on MNIST dataset using linear classifier with squared hinge loss, which is a case that fully satisfies all our assumptions for the theoretical guarantees and is broadly adopted in solving realworld machine learning problems. In the second part, we run experiments on MNIST dataset with a convolutional neural network, which does not fully satisfy the assumptions of Theorem
1. The purpose is to address the usefulness of our Byzantineresilient technique on a more general (nonconvex) class of machine learning problems.5.1 Linear classifier on MNIST
The first set of experiments is performed to demonstrate two facts: BRIDGE can maintain good performance under Byzantine settings while classic distributed learning methods fail; and comparing with an existing Byzantineresilient method, ByRDiE [38], BRIDGE is more efficient in terms of communication cost. We choose one of the most wellunderstood machine learning tools, the linear classifier with squared hinge loss, to learn the model. Note that this method satisfies the assumption of strictly convex loss function with Lipschitz gradient.
The MNIST dataset is a set of 60,000 training images and 10,000 test images of handwritten digits from ‘0’ to ‘9’. Each image is converted to a 784dimensional vector and we distributed 60,000 images equally onto 100 nodes. Then we connect each pair of nodes with probability . Ten of the nodes are randomly picked to be Byzantine nodes which broadcast random vectors to all their neighbors during each iteration. We check and make sure the network satisfies Assumption 4 with . The classifiers are trained using the “one vs all” strategy. We run five sets of experiments: () classic distributed gradient descent (DGD) with no Byzantine nodes; () classic DGD with 10 Byzantine nodes; () centralized gradient descent with only local data; ()BRIDGE with 10 Byzantine nodes; and () ByRDiE with 10 Byzantine nodes. The performance is evaluated by two metrics: classification accuracy on the 10,000 test images and whether consensus is achieved. When comparing ByRDiE and BRIDGE, we compare the accuracy with respect to the number of communication iterations.
The result is shown in Table 1. When there are no failures in the network, the performance of DGD is aligned with preliminary works [45]. However, in the presence of Byzantine failures, DGD fails in the sense that it cannot either learn a good classifier or achieve consensus. There are two aspects worth considering for BRIDGE algorithm. First is that the gap between the performance of BRIDGE under failure and DGD under no failure is small but the gap between BRIDGE and local gradient descent is large. This shows the necessity to have a robust distributed learning method: by cooperating with more nodes in the network, one can achieve a better performance than using only local data. The second aspect is that comparing with ByRDiE, BRIDGE has better communication efficiency. This is primarily because BRIDGE updates the variables in all dimensions for each message exchange while ByRDiE only updates one dimension at a time.
Algorithm  Failures  Accuracy  Consensus 

DGD  0  89.8  
Local GD  N/A  84.3  N/A 
DGD  10  10.3  
ByRDiE  10  89.2  
BRIDGE  10  89.3  

5.2 Convolutional neural network on MNIST
In section 4
, we have given theoretical guarantees under the assumption of strictly convex risk functions. However, there is a wide class of modern machine learning problems that are nonconvex but have very good performance (e.g., deep neural networks). In this set of experiments, we demonstrate the usefulness of the screening technique in BRIDGE on a smaller scale but for a highly nonconvex problem: distributed convolutional neural network. We create a network with 10 nodes and 500 training samples on each node. Each local neural network is constructed by two convolution layers, each followed by a max pooling layer, and two fully connected layers before output. The label of each sample is represented by the onehot expression. We randomly pick 1 node to be Byzantine node which broadcasts random values to its neighbors during each iteration. The distribution of the random values is identical to the random initiation of each layer. The way we generate the training sets and the network topology is identical to the previous test. We pick the Adam optimizer
[46]as the local update method. Adam is an extended version of stochastic gradient descent and it is known to have better performance on the setting of these experiments. The algorithm proceeds as following: each node takes a local Adam step with a batch size of 50 and then broadcasts the network weights to its neighbor; after receiving the weights from neighbors, each node takes average of its neighbors’ weights (with BRIDGE screening if required); each node repeats the process for 1000 epochs. We run four rounds of experiments and average each round over 100 independent trials: (
) Adam with no screening and no failure; () Adam with no screening under failure; () BRIDGE (Adam with screening); and () Adam with only local data. The results are shown in Table 2.The results show that Adam with no screening fails when there are failures in the network. The small gap between BRIDGE performance and faultless Adam performance indicates that the Byzantineresilient technique can also be applied to nonconvex machine learning tools. The performance difference between BRIDGE and training with only local data shows that BRIDGE can benefit from cooperation with other nodes even when there are failures in the network. We emphasize here again that in the fully distributed setting, one Byzantine node with a reasonable normed value is enough to bring down the whole network. In contrast to the federated settings [33, 34], it does not need a portion of nodes to be faulty or an extremely large value (gradient with 100 times larger elements) to show an obvious performance difference. This is because a Byzantine node can bias the normal nodes through consensus process, which is not a part of federated setting algorithms. This shows that Byzantine failures are much more dangerous and harder to safeguard against in fully distributed settings. Thus Byzantineresilient algorithms in fully distributed settings are of great interest in realworld applications.
Algorithm  Failures  Accuracy  Consensus 

DisAdam  0  96.2  
Local Adam  N/A  92.3  N/A 
DisAdam  1  10.3  
BRIDGE  1  95.8  

6 Conclusion
This paper has introduced a new decentralized machine learning algorithm called Byzantine resilient decentralized gradient descent. This algorithm is designed to solve machine learning problems when the training set is distributed over a network in the presence of Byzantine failures. Theoretical analysis and numerical results have been given to show that the algorithm can learn good models while being able to tolerate a certain number of Byzantine nodes in the network. This is in contrast to the fact that classic distributed learning algorithms fail under Byzantine failure.
Appendix A Proof of Lemma 1
First we observe at some dimension ,
(30) 
Since is arbitrarily picked, it is also true that
(31) 
Note that depends on and depends on both and . We need to show that the convergence is simultaneously true for all and . We fix one coordinate and drop the index for simplicity. We define a vector as . Then . We know from Hoeffding’s inequality [47]:
(32) 
Further, since the dimensional vector is an arbitrary element of the standard simplex, defined as
(33) 
the probability bound in (32) also holds for any , i.e.,
(34) 
We now define the set . Our next goal is to leverage (34) and derive a probability bound similar to (32) that uniformly holds for all . To this end, let
(35) 
denote an covering of in terms of the norm and define . It then follows from (34) and the union bound that
(36) 
In addition, we have
(37) 
where () is due to triangle and Cauchy–Schwarz inequalities. Trivially, from the definition of , while from the definition of and Assumption 3. Combining (36) and (37), we get
(38) 
We now define
Comments
There are no comments yet.