Towards Byzantine-resilient Learning in Decentralized Systems

02/20/2020 ∙ by Shangwei Guo, et al. ∙ 0

With the proliferation of IoT and edge computing, decentralized learning is becoming more promising. When designing a distributed learning system, one major challenge to consider is Byzantine Fault Tolerance (BFT). Past works have researched Byzantine-resilient solutions for centralized distributed learning. However, there are currently no satisfactory solutions with strong efficiency and security in decentralized systems. In this paper, we propose a novel algorithm, Mozi, to achieve BFT in decentralized learning systems. Specifically, Mozi provides a uniform Byzantine-resilient aggregation rule for benign nodes to select the useful parameter updates and filter out the malicious ones in each training iteration. It guarantees that each benign node in a decentralized system can train a correct model under very strong Byzantine attacks with an arbitrary number of faulty nodes. We perform the theoretical analysis to prove the uniform convergence of our proposed algorithm. Experimental evaluations demonstrate the high security and efficiency of Mozi compared to all existing solutions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The rapid development of edge computing and deep learning technologies leads to the era of Artificial Intelligence of Things. Nowadays, it is a trend to learn and update powerful Deep Learning (DL) models on edge devices

[li2018learning, zhang2018analyzing, zhang2019deep, wang2019e2, sun2020stealthy]. In contrast to the traditional approaches where training data and tasks are outsourced to third-party cloud platforms, model training in edge computing provides greater reliability and privacy protection. One typical example is the federated learning system [mcmahan2016communication, bonawitz2019towards] where multiple edge devices collaborate to train a shared DL model.

However, federated learning introduces a centralized parameter server, which can bring new security and efficiency drawbacks [lian2017can, bonawitz2019towards, xie2019diffchaser]. First, federated learning suffers from single point of failure. The functionality of the system highly depends on the operations of the parameter server. If the server gets crashed or hacked, then the entire system will be broken down, affecting all the edge devices. Second, the centralized parameter server can be the performance bottleneck, particularly when a large amount of edge devices are connected to this server.

Due to the limitations of federated learning, there is a growing trend towards training a DL model in a decentralized fashion [lian2017can, dobbe2017fully, tang2018communication, lalitha2019decentralized]. Specifically, each node (e.g., edge device) in the decentralized network plays an euqal role (both training and aggregating parameters) in learning the model [tsitsiklis1984problems, nedic2009distributed]. This decentralization mode exhibits promising applications in many scenarios: in the Internet of Vehicles [gulati2018deep], cars can collaboratively learn powerful models of traffic sign classification for autonomous driving; in a social network, users can learn better recommendation models by communicating with their friends [bellet2018personalized].

A distributed system can be threatened by the famous Byzantine Generals Problem [lamport1982byzantine]

: a portion of nodes inside the network can conduct inappropriate behaviors, and propagate wrong information, leading to the failure of the entire system. This is particularly dangerous in both centralized and decentralized distributed learning scenarios. Distributed learning requires the collaboration of thousands of edge devices from different domains and parties. It is impossible to guarantee that each device is trusted and reliable. A single dishonest node can send wrong parameters/estimates to affect the entire network and final results. Past works have focused on the centralized distributed and federated learning with Byzantine-resilient Stochastic Gradient Descent (SGD) solutions

[blanchard2017machine, xie2018generalized, mhamdi2018hidden, chen2018draco, xie2019zeno, xie2019slsgd]. The essential task for the parameter server is to distinguish between benign and malicious gradients and select the potential benign ones for model update. Based on the selection criteria, current solutions will fall into one of the two categories: (1) distance

-based defense: the parameter server selects the gradients closer to the mean or median values in terms of vector distance

[xie2018generalized, blanchard2017machine, mhamdi2018hidden, chen2018draco, damaskinos2018asynchronous]; (2) performance-based defense: the server measures the model performance of each gradient using an extra validation dataset and chooses the ones with better performance [xie2019zeno, fang2019local].

However, there are very few studies on BFT in decentralized learning systems. It is challenging to apply the above solutions to the decentralized scenario, due to two reasons. First, existing defenses themselves have security and efficiency drawbacks for protecting centralized systems. It has been shown that distance-based defenses are vulnerable to elaborately designed Byzantine attacks [baruch2019little, fang2019local]. Performance-based defenses have large computation overhead and scalability issue when conducting performance evaluation for each update. Besides, they require the parameter server to have an extra validation dataset, which is not realistic. These limitations still exist if the defenses were extended into decentralized systems. The second reason lies in the huge differences between centralized and decentralized learning systems. Existing defenses are mainly designed for one parameter server. However, each device in decentralized learning acts as not only a worker node, but also a “parameter server”. In addition, the number of neighbors connected to each device varies dramatically. So some assumptions made in the centralized defenses will not hold. To the best of our knowledge, currently there are only two research papers [yang2019byrdie, yang2019bridge] attempting to achieve Byzantine-resilient decentralized learning by adopting the distance-based strategy, which is vulnerable to sophisticated Byzantine attacks, as demonstrated in the evaluation section of this paper.

In this paper, we propose Mozi 111Mozi is the name of one famous ancient Chinese book on defensive warfare., a novel and efficient Byzantine-resilient algorithm for decentralized learning systems. Mozi has several innovations. (1) Mozi integrates both distance-based and performance-based strategies to detect Byzantine parameters. The distance-based strategy is used to reduce the scope of candidate nodes for better performance, and the performance-based strategy is used to further remove all the abnormal nodes for strong Byzantine Fault Tolerance. (2) In the distance-based selection phase, each benign node adopts its own parameters as the baseline instead of the mean or medium value of received untrusted parameters. This can enhance the quality of selected nodes, and defeat the sophisticated Byzantine attacks which attempt to manipulate the baseline value [xie2018generalized]. (3) In the performance-based selection phase, each benign node uses its training samples to test the performance of the selected parameters. This can perfectly relax the unrealistic assumption of the server’ availability of validation datasets in centralized systems in performance-based strategies.

We theoretically analyze our proposed algorithm, and prove that the learned estimates of all nodes in the decentralized system can be uniformly converged to an ideal one. Besides, the convergence of decentralized systems still holds even with the presence of strong Byzantine attacks, i.e., an arbitrary number of Byzantine nodes in the case of satisfying the connectivity assumption. We conduct comprehensive experiments to show that Mozi is tolerant against both simple and sophisticated malicious behaviors, while all existing defense solutions fail. Mozi also achieves 8-30X performance improvement over existing popular defenses.

The key contributions of this paper are:

  • We formally define the Byzantine Generals Problem in decentralized learning and analyze the vulnerabilities.

  • We propose Mozi, a uniform Byzantine-resilient aggregation rule, to defeat an arbitrary number of Byzantine nodes in decentralized systems.

  • We present theoretic analysis to prove the computational complexity and the convergence of Mozi.

  • We conduct extensive experiments to show Mozi outperforms other solutions for both security and performance.

The rest of this paper is organized as follows. Related works are reviewed in Section II. Section III gives formal definitions of Byzantine General Problem in decentralized systems, and analyzes the vulnerabilities. Section IV presents our solution, Mozi, followed by the theoretical analysis. Section V shows the experimental results under various attacks and system settings. Section VI concludes the paper.

Ii Background and Related Work

Ii-a Byzantine-resilient Centralized Learning

A centralized distributed learning system consists of a parameter server and multiple distributed worker nodes, as shown in Figure 1 (a). Every worker node has its own training dataset, but adopts the same training algorithm. In each iteration, a worker node 1) pulls the gradient from the parameter server and 2) calculates the gradient based on its local data. Then, the worker node 3) uploads the new gradient to the parameter server. After receiving all gradients from the worker nodes, the server 4) aggregates them into one gradient vector. The nodes repeat the above steps from the new gradient, until the training process is terminated and a model is produced.

Fig. 1: Distributed learning systems in centralized (left) and decentralized (right) fashions.

Dishonest worker nodes in the system can compromise the training process and the final model by uploading wrong gradients [mhamdi2018hidden, baruch2019little, fang2019local]. It is necessary for the parameter server to detect such Byzantine nodes and discard their updates when aggregating the gradients. Existing defenses can be categorized into two classes, as described below.

Distance-based solutions.

  This type of solutions clusters the uploaded gradients and detects the outliers based on the vector distances. For instance, Blanchard et al. proposed Krum

[blanchard2017machine], which chooses the gradient vector with the minimal sum of squared distances to its neighbors as the aggregated one. Xie et al. designed Median-based Aggregation rules [xie2018generalized], which inspect the gradient vectors and calculate the median values in each dimension to defeat Byzantine attacks. Mhamdi et al. introduced Bulyan [mhamdi2018hidden], to further enhance existing Byzantine-resilient aggregation rules by combining Krum and Median-based aggregation rules. Although these defenses can defeat simple Byzantine attacks such as Gaussian and bit-flip attacks [blanchard2017machine, xie2018generalized], they were shown to be vulnerable against more sophisticated attacks [baruch2019little, fang2019local]. The reason lies in the vulnerability of distance-based strategies: the close distance between two gradients does not imply similar performance. Thus, these sophisticated attacks could create gradients that are malicious but indistinguishable from benign gradients in distance.

Performance-based Solutions.  This type of solutions selects the benign nodes by evaluating the performance of each uploaded gradient. For instance, Xie et al. proposed Zeno [xie2019zeno] and Zeno++ [xie2019zeno++] for synchronous and asynchronous learning systems, respectively. Both Zeno and Zeno++ calculate the prediction accuracy of each gradient on the extra validation datasets to identify Byzantine nodes. However, they require that the parameter server has a validation dataset, which is not realistic under some circumstances. Besides, performance evaluations of all gradients have much more overhead than parameter evaluations. This can significantly increase the total training time and the computation burdens for the parameter server, especially when the number of worker nodes is larger.

Ii-B Byzantine-resilient Decentralized Learning

Decentralized learning systems remove the parameter server, as every node in the network is also responsible for model update. The architecture of a decentralized learning system is illustrated in Figure 1 (b). Specifically, in each iteration of the training process, each worker node 1) broadcasts its parameter vectors (estimates222“Parameter vector” and “estimate” are used interchangeably.) to its neighbor nodes, and receive the estimates from them; 2) trains the model estimates using the local data. 3) It then aggregates them with the neighbor nodes’ estimates and updates the model.

Compared to centralized learning, research of Byzantine-resilient decentralized learning is still at an early stage. Yang et al. proposed ByRDiE [yang2019byrdie] and BRIDGE [yang2019bridge], which simply apply the trimmed-median algorithm from centralized systems [xie2018generalized] to decentralized systems. While ByRDiE is designed for the coordinate descent optimization algorithm, BRIDGE is used in decentralized learning systems with SGD. Those two solutions follow the distance-based design strategy, and thus are vulnerable to some Byzantine attacks [mhamdi2018hidden, baruch2019little]. In Section V, we will demonstrate their incapability of defeating sophisticated Byzantine attacks.

Iii Problem Statement

In this section, we formally define a decentralized learning system under the Byzantine Fault circumstance and illustrate the vulnerability as well as the attack consequences.

Iii-a Decentralized Systems

A decentralized system is defined as an undirected graph: , where denotes a set of nodes and denotes a set of edges representing communication links. Specifically, we have

  • if and only if node can receive information from node ;

  • if .

Let be the set of the neighbors of node . We further assume that out of nodes are benign and the rest are malicious. We can define a subgraph that only contains the benign nodes:

Definition 1.

(Benign Induced Subgraph) The benign induced subgraph, , is a subgraph of , formed by all the benign nodes in and all the edges connecting those benign nodes. Specifically,

  • if is a benign node and ;

  • if and only if ;

  • if .

Following the information exchange models in [tsitsiklis1984problems, nedic2009distributed], we assume the benign induced subgraph is fully connected, i.e., giving two arbitrary benign nodes and , there always exists at least one path that connects these two nodes. We formally state the assumption as below:

Assumption 1.

(Connectivity of Benign Induced Subgraph) There exists an integer such that for , node can propagate its information to node through at most edges.

Iii-B Model Training under Byzantine Fault

In the decentralized learning system,

nodes cooperatively train a model by optimizing the loss function with SGD and exchanging estimates with their neighbors, while Byzantine nodes attempt to tamper with such training process. Let

be the -dimensional estimate vector of a DL model; be the loss function. Each node obtains a training dataset , consisting of independent and identically distributed (i.i.d.) data samples from a distribution . Those nodes train a shared model by solving the following optimization problem with the presence of Byzantine nodes:

where is a training data sample from .

At the -th iteration, node has its local estimate denoted as , and broadcasts it to its neighbors. When receiving the estimates from the neighbors, node will update its local estimate according to the General Update Function (GUF):

Definition 2.

(GUF) Let , be the estimate and gradient of node in the -th iteration. are the estimates from its neighbors. is an aggregation rule. Node updates its estimate for the -th iteration using the following general update function:

(1)

where is the learning rate; is a hyper-parameter that balances the weights of the estimates.

Without loss of generality, we assume all the nodes have the same learning rate. The stochastic gradient can be replaced with a mini-batch of stochastic gradients [lian2017can].

A straightforward way is the average aggregation rule:

(2)

and is set as . However, because the average aggregation in Equation 2 does not consider BFT, this training process can be easily compromised by Byzantine attacks: an adversary can use just one malicious node to send wrong estimates to its neighbors and alter their aggregated estimates. More seriously, due to the fully connectivity of benign induced subgraph, this fault will be also propagated to other nodes not directly connected to this Byzantine node after several iterations, and finally all the nodes in this network will be affected.

Lemma 1.

Consider a decentralized system, in which is a Byzantine node, attempting to add a malicious vector to a benign node ’s estimate. The shortest distance (i.e., number of edges) between them is and are the benign nodes on the shortest trace between and . The distance between node and is . Then at -th iteration, node can broadcast to its neighbors the following estimate to achieve this goal in iterations:

(3)

where is the number of neighbors of node .

Proof.

See Appendix -A

Iii-C Byzantine-resilient Aggregation Rule

It is necessary to design a robust aggregation rule in a decentralized system to defeat Byzantine nodes. This rule should guarantee that all benign nodes converge to the optimal estimate learned without Byzantine nodes. We define the Byzantine-resilient aggregation rule as below:

Definition 3.

(Uniform Byzantine-resilience) Consider a decentralized system under Byzantine Fault. Let be the optimal model. An aggregation rule ,

is said to be uniform Byzantine-resilient if, for all benign nodes, the distances between their estimates and is uniformly bounded. Formally, there exists a constant such that for , ,

(4)

Iii-D Existing Byzantine Defenses

The aggregation rule must be designed to be secure against Byzantine attacks, and also guarantee convergence. We can extend existing aggregation rules from centralized systems to the decentralized case, as the estimates and gradients have the same dimensionality and network structure. We consider three popular Byzantine-resilient solutions: Krum [blanchard2017machine], marginal median [xie2018generalized], and Bulyan [mhamdi2018hidden]. Other solutions can be extended in similar ways.

It is worth noting that in a centralized distributed system, the maximal number of Byzantine workder nodes connected to the parameter server is usually assumed. This does not hold in a decentralized system, as the number of Byzantine nodes connected to each benign node varies greatly. To meet this assumption, we approximately calculate the number of Byzantine nodes allowed for each benign node in Equation 5. In this equation, is the number of Byzantine nodes connected to node , is the maximal ratio of Byzantine nodes allowed in a centralized distributed defense, and is the ceil function.

(5)

DKrum.  Similar to Krum, let be two neighbors of node and we denote the closest estimate vectors to the estimate of node . Then, we calculate the score of each neighbor:

We select the estimate with the minimal score as the aggregated estimate. Formally, the aggregated rule DKrum is defined as

Dmedian.  To apply the marginal median solution[xie2018generalized] to decentralized systems, we only need to replace the gradients with the received estimates. Specifically, the aggregated rule is defined as

(6)

where MarMed is the marginal median function defined in [xie2018generalized]: the -th dimensional value of is the median of the -th dimensional elements of all estimates in .

DBulyan.  At each iteration, node first recursively uses DKrum to select estimates, i.e., , where . Specifically, node uses DKrum to select one estimate from its neighbors and deletes the corresponding node from its neighbors. Then, node recursively selects the remaining estimates using DKrum. Finally, it adopts a median-based method to aggregate the estimates in . Formally, the -th coordinate of the aggregated estimate is calculated as

(7)

where and is the set of neighbors with the size of . The sum of the -th elements to its median is minimal among all subsets of with the size .

Iv Algorithm

Iv-a Mozi

As we introduced in Section II, existing Byzantine defenses for centralized systems have certain vulnerabilities or limitations. Such design flaws still exist when we extend the solutions to decentralized scenarios. Besides, a decentralized system has higher convergence requirement than a centralized system: convergence of one parameter server enforced by the solutions cannot guarantee the convergence of all benign nodes in a decentralized system. Further, centralized Byzantine-resilient solutions usually assume a fixed number of faulty nodes connected to the parameter server, while in a decentralized system, the number of faulty nodes connected to each benign node varies significantly. As such, it is necessary to have a more robust Byzantine-resilient solution that can defeat an arbitrary number of Byzantine nodes and guarantee convergence of each benign node in decentralized systems.

We propose Mozi to achieve this goal. Our solution consists of two stages for each training iteration. At the first stage, each benign node performs a distance-based strategy to select a candidate pool of potential benign nodes from its neighbors. The selection is made by comparing the Euclidean distance of the estimate of each neighbor node with its own estimate. Although after this stage, the candidate pool might still contain Byzantine nodes as the distance-based strategies are not strict Byzantine-resilient, it indeed reduces the scope of benign nodes for further selection. At the second stage, each benign node performs a performance-based strategy to pick the final nodes from the candidate pool for estimate update. It reuses the training sample as the validation data to test the performance (i.e., loss function value) of each estimate. It selects the estimates whose loss values are smaller than the value of its own estimate, and calculates the average of those estimates as the final updated value. Although such performance-based strategy can incur certain computation cost, the small candidate pool can significantly reduce the impact compared to previous works. Mozi can be formally described as below:

Definition 4.

(Mozi) Let be the -th estimate of node ; be the loss of the estimates on the -th stochastically selected data sample, i.e., ; be the ratio of benign neighbors of node . The proposed uniform Byzantine resilient aggregation rule, Mozi, is define as

(8)

where

(Stage 1)
(Stage 2)

Algorithm 1 details the training process of node using Mozi in a decentralized system. The algorithm begins with the estimate . At the -th iteration, node broadcasts its estimate to and receives the estimates from its neighbors. It stochastically selects a training data sample and calculates the loss and the gradient (Lines 4-6). Then it conducts two-stage estimate selection. First, it calculates the Euclidean distances between and the estimates from its neighbors and selects neighbors with lowest distances (Lines 7-10).

Second, for each estimate , node calculates the loss of on . It chooses the estimates that have similar or better performance than that of (Lines 11-19). Finally it calculates the average value of the selected nodes and updates the final estimate using GUF (Lines 20-21).

1:  Input:initial estimate , learning rate , number of iterations , ratio of benign nodes ,
2:  for  in  do
3:     Broadcast and receive from
4:     Stochastically sample from
5:     
6:      Compute the local gradient
7:     for  in  do
8:        
9:     end for
10:     
11:     for  do
12:        
13:        if  then
14:           
15:        end if
16:     end for
17:     if  then
18:        
19:     end if
20:     
21:     Update the local estimate
22:  end for
23:  return  
Algorithm 1 The training algorithm for each node in Mozi.

Iv-B Complexity Analysis

The training process with Mozi is performance efficient, as proved below:

Proposition 1.

(Cost of Mozi) The computational complexity of Mozi is for each node at each iteration.

Proof.

For node , at each iteration, Mozi aggregates the received estimates with three operations. First, Mozi selects the neighbors that are closest to its current estimate. The cost is . Second, Mozi calculates the loss of the selected estimates on the stochastic sample and the cost is . Finally, Mozi takes at most to aggregate the estimates with better performance. Since for , the overall computational complexity of Mozi is . ∎

Compared to existing aggregation rules, the complexity of Average aggregation, Median-based and BRIDGE is , while Krum and Bulyan have a complexity of . So we conclude that Mozi maintains the same performance efficiency as some solutions and performs much better than others, especially when the number of connected neighbor nodes becomes large.

Iv-C Convergence Analysis

We adopt the following assumptions for our convergence analysis, which are also commonly used in [nedic2009distributed, lian2017can, xie2019zeno].

Assumption 2.

We assume that all worker nodes are initialized with the same model parameter .

Since Byzantine nodes are trying to compromise the training procedure, we have the following assumption.

Assumption 3.

(Monotonic Performance) We assume that the performance of benign estimates is higher than the Byzantine estimates on the predefined training distribution at each iteration. Specifically, let be a benign and Byzantine estimates at the -th iteration. Then,

(9)

To ensure the convergence of decentralized systems without Byzantine nodes, we also assume the bounded variance of estimates and gradients.

Assumption 4.

(Bounded Variance) We assume that the difference between all gradients and estimates generated by benign nodes are uniformly bounded. Specifically, there exists scalars and such that, for ,

(10)
(11)

We assume that the loss function is a convex function and Lipschitz continuity. With Assumption 1-4 and the assumption on the loss function, we present the convergence of a benign induced subgraph, which has been proven in [nedic2009distributed, lian2017can].

Lemma 2.

(Convergence of Benign Induced Subgraph) We assume that the sub-decentralized system, , uniformly converges. Specifically, let be the optimal model. For , the estimate and loss differences with the optimal ones are uniformly bounded by predefined constants and , i.e.,

Let be a decentralized system with Byzantine nodes. It follows the Byzantine-resilient strategy in Algorithm 1 to train a model. Under the given assumptions, we show that the differences among the estimates of all benign nodes are uniformly bounded.

Lemma 3.

(Consistency of Benign Nodes) Consider the above decentralized system with Assumptions 1-4. We claim that for , ,

(12)
Proof.

See Appendix -B

Then, we prove that the proposed Mozi is uniform Byzantine resilience.

Theorem 1.

(Convergence of Mozi) Consider the above decentralized system with Assumptions 1-4. Let be the optimal estimate. For benign nodes , we have

(13)

where is a predefined constant.

Proof.

See Appendix -C

We also show that the differences between the losses of each benign node and the optimal are uniformly bounded.

Corollary 1.

(Convergence of Loss) Consider the above decentralized system with Assumption 1-4. Let be the optimal estimate. For benign nodes , we have

(14)

where is a predefined constant.

Proof.

See Appendix -D

Note that Theorem 1 and Corollary 1 have no restrictions on the ratio of Byzantine nodes connected to each benign node. Therefore, the proposed MOZI is Byzantine tolerant against an arbitrary number of faulty nods, only if the connectivity of benign induced subgraph (Assumption 1) is satisfied.

V Experiments

(a) 30 nodes
(b) 50 nodes
(c) 70 nodes
Fig. 2: The worst accuracy of the benign nodes with different network sizes.
(a) Connection ratio = 0.2
(b) Connection ratio = 0.4
(c) Connection ratio = 0.6
Fig. 3: The worst accuracy of the benign nodes with different connection ratios.
(a) Byzantine ratio = 0.1
(b) Byzantine ratio = 0.3
(c) Byzantine ratio = 0.5
Fig. 4: The worst accuracy of the benign nodes under the Gaussian attack.
(a) Byzantine ratio = 0.1
(b) Byzantine ratio = 0.3
(c) Byzantine ratio = 0.5
Fig. 5: The worst accuracy of the benign nodes under the bit-flip attack.
(a) Byzantine ratio = 0.1
(b) Byzantine ratio = 0.3
(c) Byzantine ratio = 0.5
Fig. 6: The worst accuracy of the benign nodes under the Mhamdi attack.

V-a Experimental Setup and Configurations

Datasets.

  We evaluate our defense solution with a DL-based image classification task. Specifically, we train a Convolutional Neural Network (CNN) over the CIFAR10 dataset

[krizhevsky2009learning]

. This CNN includes two max–pooling layers and three fully connected layers

[mhamdi2018hidden]. We adopt a batch size of 256 and a fading learning rate with the initial learning rate 0.05. We also conduct the same evaluations on a simpler MNIST dataset. The experimental results lead to the same conclusion, and can be found in Appendix -E.

Implementation of decentralized systems.

  The network topology of the decentralized system in our consideration is defined by a connection rate between the nodes. A connection rate is the probability that a node is connected to another node. To ensure the connectivity assumption, we first generate a decentralized network in which all the nodes strictly follow the learning procedure. Then we randomly add Byzantine nodes to the network. To simulate various adversarial environments, we adopt a new parameter, Byzantine ratio, which is defined as the number of Byzantine nodes divided by the number of all nodes in the network. We assume that the Byzantine ratio of node

is lower than . Without lose of generality, we set as 0.4 for all benign nodes. we set as 0.4. We train the deep learning model in a synchronous mode. We simulate the operations of the decentralized system by running the nodes serially at each iteration. All our experiments are conducted on a server equipped with Xeon Silver 4214 @2.20 GHz CPUs and a NVIDIA Tesla P40 GPU.

Implementation of Byzantine-resilient defenses.  In addition to our proposed Mozi, we also implement four other defenses for comparison: three are extended from existing popular defenses for centralized systems, i.e., DKrum, Dmedian and DBulyan and one is designed specifically for decentralized systems, i.e., BRIDGE [yang2019bridge]. For all these defenses, we set to be 0.5 in the GUF. For baseline, we consider the same decentralized system configuration without Byzantine nodes, and using the Average Aggregation rule (Equation 2). The model trained from this setting can be regarded as the optimal one. For each defense deployed in the decentralized system, we measure the testing accuracy of the trained model on each benign node, and report the worst accuracy among all nodes to represent the effectiveness of this defense.

V-B Convergence

As an aggregation rule in a decentralized system, the essential functionality is to achieve uniform convergence, i.e., the model in each benign node must be converged to the correct one. We evaluate the convergence functionality of Mozi with different configurations.

Network size.  We first evaluate the convergence of our solution under different network sizes. It is more difficult to achieve uniform convergence when there are more nodes. In our experiments, we consider a decentralized system with 30, 40 and 50 nodes respectively [mhamdi2018hidden]. The connection ratio is set as 0.4. Figure 2 shows the worst accuracy during the training phase on CIFAR10.

We can observe that only DKrum, DBulyan and our proposed Mozi have the same convergence as the baseline. Mozi converges to a slightly better model at a higher speed. In contrast, BRIDGE and Dmedian have bad convergence performance, especially when the network size is larger.

Network connection ratio.  This factor can also affect the model convergence: it takes more effort and time for all nodes to reach the consensus when the connection is heavier. We evaluate such impact with different connection ratio (0.2, 0.4 and 0.6), while fixing the number of nodes as 30.

Figure 3 illustrates that most defense solutions in our consideration have satisfactory convergence performance when the connection ratio is small (0.2 and 0.4). Our proposed Mozi has better convergence performance when the connection ratio is 0.6. The BRIDGE and DMedian approaches cannot produce correct models at this high connectivity.

V-C Byzantine Fault Tolerance

We evaluate the performance of different defense strategies under various Byzantine attacks. We set the connection ratio of the evaluated system as 0.4 and the number of benign nodes as 30. We consider different Byzantine ratios (0.1, 0.3 and 0.5).

Guassian attack.

  We first use a simple attack to test the Byzantine resilience. Specifically, in each iteration the adversarial nodes broadcast to their neighbors random estimate vectors following the Gaussian distribution. We refer to this kind of attack as Gaussian attack.

Figure 4 shows the model training performance under the Gaussian attack. The advantage of Mozi over other strategies is obvious. Dmedian, DBulyan and BRIDGE do not uniformly converge in all systems. DKrum fails to converge at the Byzantine ratio of 0.5. Only Mozi can generate the correct model regardless of the Byzantine ratio.

Bit-flip attack.  We also implement a bit-flip attack [xie2018generalized] to evaluate these defenses, where the adversarial nodes flip the sign of the floating estimates and broadcast them to their neighbors.

Figure 5 shows the convergence results: the advantage of Mozi over other strategies is more obvious. Mozi has the same performance as the baseline regardless of the Byzantine ratio. This indicates that it is absolutely Byzantine-resilient against the bit-flip attack. In contrast, DBulyan and BRIDGE do not converge uniformly when the Byzantine ratio is high (0.3 and 0.5). DKrum and Dmedian also fail to defeat the bit-flip attack when the Byzantine ratio is 0.5.

Sophisticated attack.  To fully evaluate the BFT of our proposed approach, we adopt a more sophisticated attack, Mhamdi attack [mhamdi2018hidden]: the adversary has the capability of collecting all the uploaded estimates from other neighbor nodes. Then it can carefully design its own estimate to make it undetectable from the benign ones, while still compromising the training process. Mhamdi attack has been shown effective against most defenses in centralized systems [mhamdi2018hidden].

The results are shown in Figure. 6. We observe that Mozi can always succeed for different Byzantine ratios. In contrast, other solutions fail to defeat Mhamdi attack in some cases: all solutions fail to converge when the Byzantine ratio is 0.5. Dmedian and BRIDGE cannot converge even when the Byzantine ratio is 0.1.

V-D Computation Cost

To evaluate the computation cost of Mozi, we measure the average training and aggregation time for one iteration on CIFAR10. We consider a decentralized system with 30 nodes and the connectivity is 0.4. Figure 7 shows the average time of different defense solutions. We can see that the training time for those solutions are identical, but the aggregation time differs a lot. Mozi can finish one iteration in a much shorter time than DKrum (8X faster) and DBulyan (30X faster). The reason is that while Mozi only calculates the difference between a node and its neighbors, DKrum and DBulyan have to compare the distances among all neighbors. Mozi is slightly worse than Dmedian and BRIDGE. But considering the bad convergence of Dmedian and BRIDGE under Byzantine attacks demonstrated in Section V-C, we conclude that Mozi is the optimal solution with the strongest Byzantine resilience and acceptable computation overhead.

Fig. 7: The average training and aggregation time of one iteration for different aggregation rules.

Vi Conclusion

In this paper, we explore the Byzantine Fault Tolerance in decentralized learning systems. We formally define the Byzantine General Problem and demonstrate that a decentralized system is highly vulnerable to Byzantine attacks. We show that existing Byzantine-resilient solutions in centralized systems cannot be used to protect decentralized systems due to their security flaws and inefficiency. Then we propose an uniform Byzantine-resilient approach, Mozi to defeat Byzantine attacks in decentralized learning. We conduct theoretic analysis to prove that Mozi has strong convergence. Experimental results reveal that Mozi can resist both simple and sophisticated Byzantine attacks with low computation overhead under different system configurations and training tasks.

References