BRIDGE: Byzantine-resilient Decentralized Gradient Descent

08/21/2019 ∙ by Zhixiong Yang, et al. ∙ Rutgers University 5

Decentralized optimization techniques are increasingly being used to learn machine learning models from data distributed over multiple locations without gathering the data at any one location. Unfortunately, methods that are designed for faultless networks typically fail in the presence of node failures. In particular, Byzantine failures—corresponding to the scenario in which faulty/compromised nodes are allowed to arbitrarily deviate from an agreed-upon protocol—are the hardest to safeguard against in decentralized settings. This paper introduces a Byzantine-resilient decentralized gradient descent (BRIDGE) method for decentralized learning that, when compared to existing works, is more efficient and scalable in higher-dimensional settings and that is deployable in networks having topologies that go beyond the star topology. The main contributions of this work include theoretical analysis of BRIDGE for strongly convex learning objectives and numerical experiments demonstrating the efficacy of BRIDGE for both convex and nonconvex learning tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning a model that minimizes the statistical risk is one of the fundamental goals of machine learning. A typical technique that accomplishes this task is empirical risk minimization (ERM) [1, 2, 3, 4]. In this case, the model is learned by applying optimization tools on a training dataset that is traditionally assumed to be available at a centralized location. However, in many recent applications (e.g., the Internet of Things), training data are distributed over a network, while in some other applications, the dataset cannot be processed by a single machine due to its size (e.g., social network data) or privacy concerns (e.g., smartphone data). Such applications require that the model be learned over the network. When the learning alrogorithm requires a central server directly connected to all the nodes in the network, we term the algorithm as distributed learning. Some other algorithms can accomplish learning tasks without a central server. We call these algirhtms decentralized learning algorithms.

While learning over a network has a rich history in the literature, a significant fraction of that work has focused on faultless networks [5, 6]. On the other hand, real-world networks are bound to undergo failures because of malfunctioning equipment, cyber attacks, etc. [7, 8]. And when failures happen, learning algorithms designed for faultless networks completely break down. Among different types of failures in the network, a Byzantine failure is considered the most general as it allows the faulty/compromised node to arbitrarily deviate from the agreed-upon protocol [9]. Byzantine failures are the hardest to safeguard against and can easily jeopardize the operation of the entire network [10, 11, 12]. In [13], for example, it has been shown that a single Byzantine node in the network can lead to the failure of decentralized learning algorithms with a simple strategy. The overarching goal of this paper is to develop an efficient decentralized learning algorithm that is provably resilient against Byzantine failures in decentralized settings.

1.1 Related works

The machine learning task can be accomplished by defining and then minimizing a (regularized) loss function on the training data over a network. There have been several types of decentralized optimization methods that can solve the resulting problem. One class of the most commonly used methods is gradient-based such as distributed gradient descent (DGD)

[14, 15, 16]; methods in this class have low local computational complexity. Augmented Lagrangian-based methods are also broadly adopted for decentralized optimization [17, 18, 19], which require each node to solve an optimization subproblem locally. A third type of decentralized optimization methods includes second-order methods [20, 21], which usually have high computational and/or communications cost. Although any of the algorithms mentioned above can be applied to solve decentralized learning problems, these algorithms have been developed under the assumption that there are no failures in the network.

1.2 Our contributions

This paper focuses on solving a decentralized vector-valued learning problem under Byzantine settings. Comparing to recent works on Byzantine-resilient distributed learning algorithms 

[22, 23, 24, 25, 26, 27, 28, 13, 29, 30, 31, 32, 33, 34, 35], there are two aspects of contribution in this paper. The first aspect is that we combine dimension-wise trimmed mean with decentralized gradient descent in this work. While a similar idea has been studied for distributed setting [27, 28], the consensus and convergence behavior of applying trimmed mean on gradient descent in decentralized setting was completely unknown before this paper. As we show later in the paper, the analysis is very different from the distributed setting and the guarantees are also different. The main reason is that decentralized learning algorithm requires consensus which usually is not required for distributed setting. As a result, Byzantine failures are much more dangerous and harder to deal with in decentralized settings. It is shown in previous works  [13] that one Byzantine node with simple strategies is enough to crash the whole network, while it usually takes a portion of nodes to undergo failures or some extremely large value in distributed settings to show the performance difference. In the decentralized setting, a nonfaulty node cannot distinguish Byzantine neighbors from nonfaulty neighbors due to the lack of knowledge about most of the nodes in the network. So any given node has to engage with Byzantine nodes during consensus too. While in the distributed setting, nodes can always trust the server. For this reason, translating Byzantine-resilient algorithm under distributed settings to decentralized settings is highly nontrivial.

There do exist previous works that focus on Byzantine resilience in the decentralized setting  [22, 23, 24, 25, 26, 13, 29]. But these works either do not trivially translate into a general learning problem [22, 23, 24, 25, 36] or lack generalization from scalar-valued problems to vector-valued ones  [13, 37]. The coordinate descent-based algorithm introduced in [38]

is the only vector-valued decentralized Byzantine-resilient algorithm to the best of our knowledge. But because the algorithm emphasizes one-coordinate-at-a-time process (cannot use block coordinate to accelerate), the algorithm is not preferable when calculating one dimension gradient is not cheap (e.g., deep neural networks). In this paper, we develop an efficient Byzantine-resilient algorithm and show that the algorithm solves decentralized vector-valued learning problems under Byzantine settings. We provide theoretical guarantees for strongly convex problems, while we show the usefulness of our algorithm on nonconvex learning problems using numerical experiments.

1.3 Notations

All vectors are taken to be column vectors, while and denote the -th element of vector and the -th element of matrix , respectively. We use to denote -norm of and to denote the vector of all ones, while denotes the transpose operation. Given a set, denotes its cardinality. Finally, we use to denote the gradient of a function with respect to . We use to denote the inner product. For a given vector and constant , we denote the -ball of radius centered around as .

2 Problem formulation

The goal of this paper is the following: when given a network in which each node has access to some local training data, we want to learn a machine learning model in a decentralized fashion, even in the presence of Byzantine failures. In this section, we first describe the basic problem with a mathematical model. Then we introduce the Byzantine failure model.

2.1 Decentralized learning model

We consider a network of nodes, expressed as a directed, static graph . Here, the set represents nodes in the network, while the set of edges represents communication links between different nodes. Specifically, if and only if node can receive information from node and vice versa. Each node has access only to a local training set . Training samples belong to some Hilbert space are independent and identically distributed (i.i.d.) and drawn from an unknown distribution , i.e., . For simplicity, we assume that the cardinalities of local training sets are the same, i.e., . The generalization to the case when ’s are not equal sized is trivial.

Machine learning tasks are usually accomplished by defining and statistically minimizing a risk function with respect to a variable . For simplicity, we use in the following to denote the statistical risk function in this paper. We denote the true minimizer of the risk function as , i.e., . In learning problems, the distribution is usually unknown. Therefore cannot be solved for directly. One way of completing the task in the decentralized setting is to employ empirical risk minimization (ERM), i.e.,

(1)

It can be shown that the minimizer of (1) converges to

with high probability as

increases [39]. In this paper, we focus on finite valued strongly convex risk functions with Lipschitz gradient. Here we make the assumptions formally.

Assumption 1

The risk function is bounded almost surely over all training samples, i.e., , .

Assumption 2

The risk function is -strongly convex, i.e., satisfying .

Assumption 3

The gradient of is -Lipschitz, i.e., satisfying .

Note that Assumption 3 implies that the risk function itself is also Lipschitz, i.e., for some [40].

In decentralized learning, each node maintains a local variable . Then the ERM problem can be solved in the decentralized fashion, i.e.,

(2)

To accomplish the decentralized ERM task, all nodes need to cooperate with each other by communicating with their neighbors over edges. We define the neighborhood of node as . If , then is a neighbor of node . Classic decentralized learning algorithms proceed iteratively. A node is expected to accomplish two tasks during each iteration: update the local variable according to some rule and broadcast a message to all its neighbors. Note that node can receive values from node only if .

2.2 Byzantine failure model

When there is no failure in the network, decentralized learning is well understood [19, 41]. The main assumption in this paper is that some of the nodes in the network can arbitrarily deviate from intended behavior. We model this behavior as Byzantine failure, formally defined as follows.

Definition 1

A node is said to be Byzantine if during any iteration, it either updates its local variable using an update function or it broadcasts some value other than the intended update to its neighbors.

We use to denote the set of nonfaulty nodes and we assume that there are at most Byzantine nodes in the network. We label the nonfaulty nodes from 1 to without loss of generality. We now provide some definitions and assumptions that are common in the literature, e.g., [13, 38].

Definition 2

A subgraph of is called a reduced graph if it is generated from graph by () removing all Byzantine nodes along with all their incoming and outgoing edges, and () removing additionally up to incoming edges from each nonfaulty node. A source component of graph is a collection of nodes such that each node in the source component has a directed path to every other node in .

Assumption 4

All reduced graphs generated from contain a source component of cardinality at least .

Assumption 4 describes the redundancy of a graph. What it ensures is that after removing a certain number of edges from nonfaulty nodes, each normal node can still receive information from a few other nonfaulty nodes. While checking this assumption efficiently remains an open problem, we do understand the generation of graphs that satisfy this assumption. One way of generating a resilient graph was introduced in the literature during the study of Byzantine-resilient consensus techniques, e.g., in [26]. We also have observed empirically that in an Erdös–Rényi graph, when the degree of the least connected node is larger than , the assumption is often satisfied. This is also the technique we use to generate graphs in our numerical experiments. We also emphasize that we do not need to know the exact number of Byzantine nodes. Constructing a graph with a chosen will enable our algorithm to tolerate at most Byzantine nodes and the algorithm still has a competitive performance even if there is actually no failure in the network. In the next section, we will introduce a Byzantine fault-tolerant algorithm for distributed learning and theoretically analyze it under Assumptions 1, 2, 3, and 4. The algorithm is expected to accomplish the following tasks: () achieve consensus, i.e., as the number of iterations ; and () learn a as sample size .

3 Byzantine-resilient distributed gradient descent

It is shown in [42] that the exact global optimal of (2) is not achievable when . In this section, we introduce an algorithm called Byzantine-resilient decentralized gradient descent (BRIDGE) that pursues the minimum of the statistical risk in the presence of Byzantine failures given that the training data are i.i.d..

3.1 Algorithm

0:  , , and at node
1:  ,
2:  for  do
3:     Broadcast
4:     Receive from
5:     for  do
6:        
7:        
8:        
9:        
10:     end for
11:  end for
11:  
Algorithm 1 Byzantine-resilient decentralized gradient descent (BRIDGE)

When the network is fault free. One way of of solving (2) is to let each node update its local variable as

(3)

where is the weight and is a positive sequence satisfying , and . One way of choosing is to use the sequence as the following. Find a and set so that and is . Note that (3) is a special case of the distributed gradient descent (DGD) algorithm [14]. The main difference between the proposed algorithm and the classic DGD method is that there is a screening step before each update, which is the key idea to make BRIDGE Byzantine resilient. The complete process at each node is as shown in Algorithm 1.

When initializing the algorithm, it is necessary to specify , the maximum number of Byzantine nodes that the algorithm can tolerate. Each node initializes at or some arbitrary vector. During each iteration , node first broadcasts and then receives from all . Next, node performs a screening among all ’s. The screening is with respect to each dimension separately. At dimension , node separates into three groups defined as following:

(4)
(5)

and

(6)

Then node updates the -th element of as

(7)

The idea of performing screening before updating is to eliminate the largest and the smallest values in each dimension. This screening method is called coordinate-wise trimmed mean. The screening can be easily realized by a simple sorting process. Since the update of the -th element does not depend on other coordinates of or , the update of each dimension (step 5 to 10) can be done in parallel or sequentially in any order. Note that is likely to be different for different at each iteration so that only some elements of for any may be taken for update at node while other elements are dropped. But the size of each is the same in all dimensions. We now give theoretical guarantees for this algorithm.

Theorem 1

If Assumption 1, 2, 3, and 4 are satisfied, BRIDGE can achieve consensus on all nonfaulty nodes, i.e., , as . Further, as , the output of BRIDGE converges sublinearly in to the minimum of the global statistical risk at each nonfaulty node, i.e., , , with probability going to 1.

The theorem shows that the BRIDGE algorithm can learn a good model even when there are Byzantine failures in the network. To achieve this goal, the algorithm needs to accomplish two tasks: consensus and optimality. Consensus requires that all nonfaulty nodes agree on the same variable () despite the existence of Byzantine failures in the network while optimality requires that the globally agreed model indeed minimizes the statistical risk (). In the next section, we will prove the theorem for consensus and optimality, respectively.

4 Theoretical analysis

While gradient descent is well understood in the literature, we observe that BRIDGE does not take a regular gradient step at each iteration. The main idea of proving Theorem 1 is to take advantage of the convergence property of gradient descent and try to bound the distance between one gradient descent step and one BRIDGE step. The proof can be briefly described as the following. The BRIDGE local update sequence is described in (7). We first define a vector sequence and show that as , which is the proof for consensus. We then consider three sequences , , and that will be defined later. We define the following distances: , , , and . Observe that , we then show that , , and all go to 0. This is the proof for optimality.

4.1 Consensus analysis

Recall that the update is done in parallel for all coordinates. So we pick one coordinate and prove that all nodes achieve consensus in this coordinate. Since is arbitrarily picked, we then conclude that consensus is achieved for all coordinates. In this section, we drop the index for all variables for simplicity. It should be straight forward that the variables are -dependent.

Define a vector whose elements are the -th elements of from nonfaulty nodes only, i.e., . We first show that the update can be written in a matrix form which only involves nonfaulty nodes, i.e.,

(8)

where is formed as . The formulation of matrix can be described as following. Let denote the nonfaulty nodes in the neighborhood of node , i.e., . The set of Byzantine neighbors can be defined as . One of two cases can happen during each iteration, () or () . To make the expression clear, we drop the iteration indicator for the rest of this discussion. It should be straightforward that the variables are -dependent. For case (), since and , we know that . Similarly, . Then and satisfying for any . So that for each , satisfying . In this way, we can express the update with only messages from nonfaulty nodes. The elements of matrix can be written as

(9)

For case (), since all nodes in are already nonfaulty, we keep only the first, second and last rows of (9). Note that since the choices of and are generally not unique, the formulation of matrix is also not unique. So far, we have expressed the update of nonfaulty nodes within matrix form involving only nonfaulty nodes.

Next, define a transition matrix to represent the product of ,

(10)

Let be the total number of reduced graphs we can generate from . Let . Denote by . Let . Then it is know from previous work [42, 43] that

(11)

where satisfies and . It can also be expressed as

(12)

Taking as the starting point, we can express the iterations as

(13)

Let us create a scenario that all nodes stop computing local gradients after iteration so that when . Define a vector under this scenario, i.e.,

(14)

Observe that has identical elements in all dimensions. Let scalar sequence denote one element of . Next, we show that as . From (14),

(15)

Then recall from the update of that

(16)

If Assumption 3 hold and we initiate the algorithm from some vector with finite norm, we can always find two scalars and satisfying that , and . Then we have

(17)

as . Since is arbitrarily picked, the convergence is true for all dimensions. Define a vector satisfying for . Then as ,

(18)

The convergence in (18) can also be interpreted as as . The proof of consensus is complete.

Remark 1

It follows from (18) that the rate of consensus convergence is . Specifically, if choosing to be gives us .

4.2 Optimality analysis

From (18), we have an upper bound for . We then bound the other distances to show . Note that the sequence is not truly kept at any node, so we first describe the “update” of . Considering (14) for all coordinates together, the update for the full vector can be written in the form

(19)

where satisfies for . Define another vector satisfying . We define a new sequence as

(20)

Recalling that

(21)

from (4.1) and Assumption 3 we have

(22)

Next, defining a new sequence as

(23)

we have

(24)

Here we give a lemma to show the relationship between and the gradient of the statistical risk.

Lemma 1

If Assumption 1 and 3 are satisfied, with probability at least ,

(25)

where satisfies and .

Lemma 1 shows that converges to the gradient of statistical risk in probability. The proof of Lemma 1 is in the Appendix A.

Remark 2

Lemma 1 shows that the difference between the gradient step of and the true gradient step is of order . If there is no failure in the network, the gradient step for non-resilient algorithm such as DGD usually has an error rate . If each node runs centralized algorithm with the given samples, the error rate for gradient step is usually . Science is a stochastic vector, . The lemma shows that BRIDGE improves the sample complexity by a factor of for each node by cooperating over a network.

Now we focus on . Note that is obtained by taking a regular gradient descent step of with step size . When Assumption 2 and  3 are satisfied, it is well understood [44, Ch.9] that the gradient descent step satisfies

(26)

Then we have

(27)

Now we can write the property of sequence for some as

(28)

It follows from (18), (22), and Lemma 1 that with probability at least ,

(29)
Remark 3

When choosing to be , the second term on the right hand side of (29) is . Note that the right hand side of (29) converges to as and . Given that , inequality (29) shows a sublinear convergence rate.

We have shown in the consensus analysis. We then show that in Appendix B. The analysis of Theorem 1 is complete.

5 Numerical analysis

The numerical experiments are separated into two parts. In the first part, we run experiments on MNIST dataset using linear classifier with squared hinge loss, which is a case that fully satisfies all our assumptions for the theoretical guarantees and is broadly adopted in solving real-world machine learning problems. In the second part, we run experiments on MNIST dataset with a convolutional neural network, which does not fully satisfy the assumptions of Theorem

1. The purpose is to address the usefulness of our Byzantine-resilient technique on a more general (nonconvex) class of machine learning problems.

5.1 Linear classifier on MNIST

Figure 1: Classification accuracy on MNIST dataset for different learning methods. When there is no failure in the network, DGD does have the best performance in terms of both communication efficiency and final accuracy, but DGD completely fails where there are Byzantine nodes in the network. ByRDiE and BRIDGE have similar final accuracy under Byzantine settings and the accuracy gap to faultless DGD is small, which indicates that both algorithms are indeed Byzantine resilient. The difference between ByRDiE and BRIDGE is in the communication efficiency. Since BRIDGE updates the whole vector at each iteration, fewer communication iterations are required to reach the best accuracy.

The first set of experiments is performed to demonstrate two facts: BRIDGE can maintain good performance under Byzantine settings while classic distributed learning methods fail; and comparing with an existing Byzantine-resilient method, ByRDiE [38], BRIDGE is more efficient in terms of communication cost. We choose one of the most well-understood machine learning tools, the linear classifier with squared hinge loss, to learn the model. Note that this method satisfies the assumption of strictly convex loss function with Lipschitz gradient.

The MNIST dataset is a set of 60,000 training images and 10,000 test images of handwritten digits from ‘0’ to ‘9’. Each image is converted to a 784-dimensional vector and we distributed 60,000 images equally onto 100 nodes. Then we connect each pair of nodes with probability . Ten of the nodes are randomly picked to be Byzantine nodes which broadcast random vectors to all their neighbors during each iteration. We check and make sure the network satisfies Assumption 4 with . The classifiers are trained using the “one vs all” strategy. We run five sets of experiments: () classic distributed gradient descent (DGD) with no Byzantine nodes; () classic DGD with 10 Byzantine nodes; () centralized gradient descent with only local data; ()BRIDGE with 10 Byzantine nodes; and () ByRDiE with 10 Byzantine nodes. The performance is evaluated by two metrics: classification accuracy on the 10,000 test images and whether consensus is achieved. When comparing ByRDiE and BRIDGE, we compare the accuracy with respect to the number of communication iterations.

The result is shown in Table 1. When there are no failures in the network, the performance of DGD is aligned with preliminary works [45]. However, in the presence of Byzantine failures, DGD fails in the sense that it cannot either learn a good classifier or achieve consensus. There are two aspects worth considering for BRIDGE algorithm. First is that the gap between the performance of BRIDGE under failure and DGD under no failure is small but the gap between BRIDGE and local gradient descent is large. This shows the necessity to have a robust distributed learning method: by cooperating with more nodes in the network, one can achieve a better performance than using only local data. The second aspect is that comparing with ByRDiE, BRIDGE has better communication efficiency. This is primarily because BRIDGE updates the variables in all dimensions for each message exchange while ByRDiE only updates one dimension at a time.


Algorithm Failures Accuracy Consensus
DGD 0 89.8
Local GD N/A 84.3 N/A
DGD 10 10.3
ByRDiE 10 89.2
BRIDGE 10 89.3

Table 1: Linear classifier “one vs all” on MNIST dataset

5.2 Convolutional neural network on MNIST

In section 4

, we have given theoretical guarantees under the assumption of strictly convex risk functions. However, there is a wide class of modern machine learning problems that are nonconvex but have very good performance (e.g., deep neural networks). In this set of experiments, we demonstrate the usefulness of the screening technique in BRIDGE on a smaller scale but for a highly nonconvex problem: distributed convolutional neural network. We create a network with 10 nodes and 500 training samples on each node. Each local neural network is constructed by two convolution layers, each followed by a max pooling layer, and two fully connected layers before output. The label of each sample is represented by the one-hot expression. We randomly pick 1 node to be Byzantine node which broadcasts random values to its neighbors during each iteration. The distribution of the random values is identical to the random initiation of each layer. The way we generate the training sets and the network topology is identical to the previous test. We pick the Adam optimizer 

[46]

as the local update method. Adam is an extended version of stochastic gradient descent and it is known to have better performance on the setting of these experiments. The algorithm proceeds as following: each node takes a local Adam step with a batch size of 50 and then broadcasts the network weights to its neighbor; after receiving the weights from neighbors, each node takes average of its neighbors’ weights (with BRIDGE screening if required); each node repeats the process for 1000 epochs. We run four rounds of experiments and average each round over 100 independent trials: (

) Adam with no screening and no failure; () Adam with no screening under failure; () BRIDGE (Adam with screening); and () Adam with only local data. The results are shown in Table 2.

The results show that Adam with no screening fails when there are failures in the network. The small gap between BRIDGE performance and faultless Adam performance indicates that the Byzantine-resilient technique can also be applied to nonconvex machine learning tools. The performance difference between BRIDGE and training with only local data shows that BRIDGE can benefit from cooperation with other nodes even when there are failures in the network. We emphasize here again that in the fully distributed setting, one Byzantine node with a reasonable normed value is enough to bring down the whole network. In contrast to the federated settings [33, 34], it does not need a portion of nodes to be faulty or an extremely large value (gradient with 100 times larger elements) to show an obvious performance difference. This is because a Byzantine node can bias the normal nodes through consensus process, which is not a part of federated setting algorithms. This shows that Byzantine failures are much more dangerous and harder to safeguard against in fully distributed settings. Thus Byzantine-resilient algorithms in fully distributed settings are of great interest in real-world applications.


Algorithm Failures Accuracy Consensus
Dis-Adam 0 96.2
Local Adam N/A 92.3 N/A
Dis-Adam 1 10.3
BRIDGE 1 95.8

Table 2: Convolutional neural network on MNIST dataset

6 Conclusion

This paper has introduced a new decentralized machine learning algorithm called Byzantine resilient decentralized gradient descent. This algorithm is designed to solve machine learning problems when the training set is distributed over a network in the presence of Byzantine failures. Theoretical analysis and numerical results have been given to show that the algorithm can learn good models while being able to tolerate a certain number of Byzantine nodes in the network. This is in contrast to the fact that classic distributed learning algorithms fail under Byzantine failure.

Appendix A Proof of Lemma 1

First we observe at some dimension ,

(30)

Since is arbitrarily picked, it is also true that

(31)

Note that depends on and depends on both and . We need to show that the convergence is simultaneously true for all and . We fix one coordinate and drop the index for simplicity. We define a vector as . Then . We know from Hoeffding’s inequality [47]:

(32)

Further, since the -dimensional vector is an arbitrary element of the standard simplex, defined as

(33)

the probability bound in (32) also holds for any , i.e.,

(34)

We now define the set . Our next goal is to leverage (34) and derive a probability bound similar to (32) that uniformly holds for all . To this end, let

(35)

denote an -covering of in terms of the norm and define . It then follows from (34) and the union bound that

(36)

In addition, we have

(37)

where () is due to triangle and Cauchy–Schwarz inequalities. Trivially, from the definition of , while from the definition of and Assumption 3. Combining (36) and (37), we get

(38)

We now define