SPDL: Blockchain-secured and Privacy-preserving Decentralized Learning

Decentralized learning involves training machine learning models over remote mobile devices, edge servers, or cloud servers while keeping data localized. Even though many studies have shown the feasibility of preserving privacy, enhancing training performance or introducing Byzantine resilience, but none of them simultaneously considers all of them. Therefore we face the following problem: how can we efficiently coordinate the decentralized learning process while simultaneously maintaining learning security and data privacy? To address this issue, in this paper we propose SPDL, a blockchain-secured and privacy-preserving decentralized learning scheme. SPDL integrates blockchain, Byzantine Fault-Tolerant (BFT) consensus, BFT Gradients Aggregation Rule (GAR), and differential privacy seamlessly into one system, ensuring efficient machine learning while maintaining data privacy, Byzantine fault tolerance, transparency, and traceability. To validate our scheme, we provide rigorous analysis on convergence and regret in the presence of Byzantine nodes. We also build a SPDL prototype and conduct extensive experiments to demonstrate that SPDL is effective and efficient with strong security and privacy guarantees.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

06/08/2020

Secure Byzantine-Robust Machine Learning

Increasingly machine learning systems are being deployed to edge servers...
02/20/2020

Towards Byzantine-resilient Learning in Decentralized Systems

With the proliferation of IoT and edge computing, decentralized learning...
07/21/2020

Byzantine-Resilient Secure Federated Learning

Secure federated learning is a privacy-preserving framework to improve m...
12/09/2021

Asynchronous Semi-Decentralized Federated Edge Learning for Heterogeneous Clients

Federated edge learning (FEEL) has drawn much attention as a privacy-pre...
12/10/2019

Privacy-Preserving Blockchain Based Federated Learning with Differential Data Sharing

For the modern world where data is becoming one of the most valuable ass...
10/26/2020

Communication-efficient learning of traffic flow in a network of wireless presence sensors

Current traffic management systems learn global traffic flow models base...
01/27/2020

SecEL: Privacy-Preserving, Verifiable and Fault-Tolerant Edge Learning for Autonomous Vehicles

Mobile edge computing (MEC) is an emerging technology to transform the c...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the increasing amount of data and growing complexity of machine learning models, there is a rigid demand for utilizing computational hardware and storage owned by various entities in a distributed network. State-of-the-art distributed machine learning schemes adopt three major network topologies shown in Fig. 1. Federated learning [bonawitz2017practical] utilizes an efficient centralized network illustrated in Fig. 1(a), where a parameter server aggregates gradients computed by distributed devices and updates the global model for them while preserving privacy since devices compute locally without communicating with each other. The fragility of the centralized network topology lies in that a centralized parameter server suffers from the single point of failure problem (a server might crash or be Byzantine). To solve this issue, El-Mhamdi et al. [el2020genuinely] proposed a Byzantine-resilient learning network shown in Fig. 1(b), which substitutes the centralized server with a server group in which no more than servers can be Byzantine.

Fig. 1: Three major network topologies adopted in distributed learning

In this paper, we make a step further to break the barriers among the parameter servers and the computational devices, and allow all nodes to train models in a decentralized network, as shown in Fig. 1(3). Such a decentralized network is frequently adopted in ad hoc networks, edge computing, Internet-of-Things (IoT), decentralized applications (Dapp), etc. It can greatly unleash the potential for building large-scale (even worldwide) machine learning models that can reasonably and fully maximize the utilization of computational resources [lian2017can]. Besides, devices such as mobile phones, IoT sensors and vehicles are generating a large amount of data nowadays. A decentralized network filled with real-time big data can lead to tremendous improvements in large-scale applications, e.g., illness detection, outbreak discovery, and disaster warning, which involve a large number of decentralized edge and cloud servers from different regions and/or countries. However, this poses a new critical challenge: How can we efficiently coordinate the decentralized learning process while simultaneously maintaining learning security and data privacy?

Specifically, decentralized learning confronts with the following challenges. 1) Without a fully trusted centralized custodian, users have no incentive but is reluctant to participate in the learning process due to the lack of trust in a decentralized network. Therefore, it is highly possible that the volume of data might be insufficient to train a reliable model. 2) In decentralized learning, a Byzantine node who can behave arbitrarily (e.g., crash or launch attacks) might prevent the model from convergence or interrupt the training process. 3) It is challenging to make the trade-off between privacy & security as well as efficiency, and to make data sharing frictionless with privacy and transparency guarantees.

To overcome the above challenges, we propose SPDL, a decentralized learning framework that simultaneously ensures strong security using blockchain (as an immutable distributed ledger), BFT consensus, and BFT GAR, and preserves privacy utilizing local gradient computation and differential privacy (DP). Blockchain, as a key component, maintains a complete, immutable, traceable record of the machine learning process, covering user registration, gradients by rounds, and model parameters. With blockchain, a user can identify illegal and Byzantine peers, update its local model without concerns, and ultimately trust the SPDL scheme. The BFT consensus algorithm and the BFT GAR are embedded into the blockchain. Concretely, the BFT consensus algorithm ensures consistency of model transition during multiple rounds while the BFT GAR offers an effective method of detecting and filtering Byzantine gradients at each round. Concerning privacy, we let nodes compute gradients with their local training data, and only share perturbed gradients with peers, which provides a strong privacy protection.

Our contributions are summarized as follows

  1. To our best knowledge, this is the first secure and privacy-preserving machine learning scheme for decentralized networks in which learning processes are free from trusted parameter servers.

  2. SPDL makes use of DP for data privacy protection and seamlessly embeds BFT consensus and BFT GAR into a blockchain system to benefit model training with Byzantine fault tolerance, transparency, and traceability while retaining high efficiency.

  3. We conduct rigorous convergence and regret analyses on SPDL in the presence of Byzantine nodes, build a prototype, and carry out extensive experiments to demonstrate the feasibility and effectiveness of SPDL.

This paper is organized as follows. Section 3 outlines the necessary preliminary knowledge needed by the development of SPDL. Section 4 details the SPDL protocol. The analysis on convergence and regret are presented in Section 5. Evaluation results are reported in Section 6. We summarize the most related work in Section 2 and conclude this paper in Section 7.

2 Related Work

2.1 Privacy and Byzantine Resilience in Distributed Learning

Private learning schemes include secure multiparty computation, encryption, homomorphic encryption, differential privacy, and aggregation models. For details, we recommend two comprehensive surveys [liu2021machine, yang2019federated] to the interested readers. Su and Vaidya [su2016fault] introduced the distributed optimization problem in the presence of Byzantine failures. The problem was formulated as one in which each node has a local cost function, and aims to optimize the global cost function. The proposed method, namely the synchronous Byzantine gradient method (SBG), first trims the largest gradients and the smallest gradients, then computes the average of the minimum and the maximum of the remaining values. This approach sheds light on providing byzantine resilience for distributed learning. Following this idea, many byzantine-resilient aggregation rules were proposed. They all work towards a common goal – more precisely and efficiently trim the byzantine values. Blanchard et al. were the earliest to tackle the Byzantine resilience problem in distributed learning by the Krum algorithm which can guarantee convergence despite Byzantine workers (in a server-worker architecture). Krum stimulates the arrivals of many BFT aggregation rules including Median [yin2018byzantine], Bulyan [guerraoui2018hidden], and MDA [el2020genuinely]. A recent work [guerraoui2021differential] demonstrates that these aggregation rules can function together with the DP technique under proper assumptions.

2.2 Blockchain-Enhanced Distributed Learning

The BinDaaS [bhattacharya2019bindaas]

framework provides a blockchain-based deep learning service, ensuring data privacy and confidentiality in sharing Electronic Health Records (EHRs). With BinDaaS, a patient can mine a block filled with a private health record, which can be accessed by legitimate doctors. BinDaaS trains each model based on a patient’s private EHRs independent of others’ data. In contrast, decentralized learning intends to train a global model using the data owned by individual nodes, arising more security and privacy concerns. Hu

et al. [9615370] utilized a blockchain and a game-theoretic approach to protect the user privacy for federated learning in mobile crowdsensing. The collective extortion (CE) strategy was proposed in [hu2021nothing] as an incentive mechanism that can regulate workers’s behavior. These two game-based approaches cannot strictly guarantee the safety of model training against byzantine nodes. Lu et al. [lu2019blockchain] developed a learning scheme that protects privacy by differential privacy, and proposed the Proof of Training Quality (PoQ) consensus algorithm for model convergence. This scheme does not consider byzantine users who might disturb the training processes by proposing erroneous model parameters.

FL-Block [FLblock] allows end devices to train a global model secured by a PoW-based blockchain. LearningChain [learning2018chain] is a differential privacy based scheme to protect each party’s data privacy, also resting on a PoW-based blockchain. Warnat-Herresthal et al. [warnat2021swarm] proposed the concept of swarm learning which is analogous to decentralized learning, in which each node can join the learning process managed by Ethereum (again, PoW-based) and smart contracts. Compared to a pure adoption of centralized federated learning schemes, PoW-based blockchains can help avoid the single point of failures and mitigate poisoning attacks caused by a central parameter server. With PoW, a miner can propose a block containing model parameters used for a global model update. However, this method cannot rigorously prevent malicious miners from harming the training processes by proposing wrong updates. The PoW consensus itself is not sufficient to judge whether given parameters are byzantine or not. Therefore, PoW-based blockchains are vulnerable to poisoning attacks launched by byzantine nodes. Besides, PoW-based blockchains are hard to scale since PoW can incur much overhead and result in heavy consumption of computational resources. Another noteworthy system is Biscotti [shayan2020biscotti], which utilizes a commitment scheme to protect data privacy and adopts a novel Proof-of-Federation based blockchain for security guarantee in decentralized learning. Using a commitment scheme, the aggregation of parameters can be verifiable and tamper-proof.

In this paper, we leverage the DP technique for privacy protection since adding noises (only needs an addition operation using pre-calculated noise) is more efficient than employing complicated cryptographic tools. In addition, using DP and a BFT consensus, we can ensure the byzantine fault tolerance for the whole training process, which can not be realized by existing works. Our unique contributions lie in two aspects: 1) providing security and privacy guarantees for the complete training process, by seamlessly integrating DP, BFT aggregation, and blockchain in one system, considering byzantine nodes; and 2) offering rigorous analysis on the Byzantine fault-tolerance, convergence and regret for decentralized learning.

3 Preliminaries

3.1 Decentralized Learning

With the prosperity of machine learning tasks that need to process massive data, distributed learning was proposed to coordinate a large number of devices to complete a training process so as to achieve a rapid convergence. In recent years, the research on distributed learning was mainly carried out along two directions: centralized learning and decentralized learning, whose schematic diagrams are shown in Fig. 1. A centralized topology uses a parameter server (PS) to coordinate all workers by gathering gradients, performing local updates, and broadcasting new parameters; while in a decentralized topology, all nodes are considered as equal and exchanges information without the intervention of a PS.

Decentralized learning has many advantages over centralized learning. In [lian2017can], Lian et al. rigorously proved that a decentralized algorithm has lower communication complexity and possesses the same convergence rate as those under a centralized parameter-server model. Furthermore, a centralized topology might not hold in a decentralized network where no one can be trusted enough to act as a parameter server. Therefore, We consider the following decentralized optimization during a learning process:

where is the network size, is the local data distribution for node ,

denotes the loss function given model parameter

and data sample . Let . In a fully-connected graph, during any synchronous round, each node performs a deterministic aggregation function (e.g. average function) on perturbed gradients received from all other peers to update its local parameter, i.e.,

where is the learning rate. Notice that since is deterministic, all nodes should have exactly the same parameter and should initialize it with the same value. In this case, we denote by an arbitrary synchronous distributed learning algorithm that updates the mutual parameter , and use to denote the output parameter at round . In the rest of this paper we omit the subscript if all peers have the same local parameter.

3.2 Blockchain Basics

Blockchain, as a distributed ledger, refers to a chain of blocks linked by hashes and spread over all the nodes in a peer-to-peer network, namely blockchain network. A full node stores a full blockchain in its local database. A blockchain starts from a genesis block, and each block except for the genesis block is chained to a previous block by referencing its hash. Typically, there are two categories of blockchain systems based on scale and openness: permissioned and permissionless. In this paper, we adopt a permissioned blockchain since it can provide faster speed and more restricted registration control than permissionless ones.

Blockchain consists of three major components: blockchain network, distributed ledger, and consensus algorithm. A blockchain system organizes registered nodes into a P2P network, formulating a complete graph. A distributed ledger is immutable and can be organized as a chain, a Direct Acyclic Graph (DAG), or a mesh. In this paper, we use a chain as the data structure of our ledger. As the core of a blockchain system, the consensus process determines how to append a new block to the chain. Two types of consensus algorithms are commonly adopted: proof-of-resources and message passing. Proof-of-resources means that nodes compete for proposing blocks by demonstrating their utilization of resources, e.g., computational resources, stake, storage, memory and specific trust hardware. On the other hand, message passing based consensus has been widely researched in the area of distributed computing. Such algorithms always provide clear assumptions on nodes’ faulty behaviors such as fail-stop and Byzantine attacks. In this paper, we leverage Byzantine fault tolerance (BFT) consensus algorithm, which can address Byzantine nodes who can launch arbitrary attacks.

3.3 Gradient Aggregation Rule (GAR)

A Gradient Aggregation Rule (GAR) is used to aggregate gradients received from peers during each round. A traditional GAR averages gradients to eliminate errors. Concerning gradients generated by Byzantine nodes, a GAR can be more elaborately designed to inject robustness. For example, Krum and Multi-Krum are the pioneering GARs that satisfy Byzantine resilience [blanchard2017machine]. The essence behind these two GARs are to choose the gradient with the closest ( denotes the network size and denotes the number of Byzantine nodes) neighbors based on the assumption that the honest majority should have similar gradients. Median [yin2018byzantine] and MDA [el2020genuinely] are another two GARs that adopt analogous ideas to ensure the BFT gradient aggregation. In this paper, SPDL leverages Krum as the BFT GAR but is not limited to it.

3.4 Differential Privacy

The privacy guarantee is of vital importance if the nodes carrying out decentralized learning do not admit their local training data to be shared. Although each node communicates with its neighbors by transmitting parameters instead of sending raw data, the risk of leaking information still exists [wang2019beyond]. Differential privacy is an effective method to avoid leaking any information of a single individual by adjusting the feedback of the query operations, no matter what auxiliary information the adversary node might have. In decentralized learning, the process of exchanging parameters involves a sequence of queries. The differential privacy in our setup can be formally defined as follows:

Definition 1.

(()-differential-privacy) Denote by an arbitrary synchronous distributed learning algorithm that updates model parameter . For any node in a decentralized system and any two possible local data-sets and with differing from by at most one record, if for any round and any , it holds that

we then claim that preserves ()-differential-privacy, and pack all into .

Definition 2.

For any function , the -sensitivity of is defined as

for all differing in at most one element.

Differential privacy can be realized by adding Gaussian noises to the query results [Jiang2021TKDE]. The following lemma, proved in [yu2021decentralized], demonstrates how to properly choose a Gaussian noise.

Lemma 1.

For each node transmitting gradients perturbed by Gaussian noise with distribution , the gradient exchanges within successive rounds preserve ()-differential-privacy as long as , where and is the learning rate.

In brief, the differential private scheme we employed in this paper is sketched as follows. For any node , we use random Gaussian noise to perturb its gradients before transmitting to other nodes. When nodes obtain the result of the aggregation function whose inputs are their perturbed gradients, the differential privacy for node is guaranteed.

4 The Protocol

4.1 Model and Assumptions

We focus on scenarios where nodes are able to communicate with each other in a decentralized network. Specifically, we formalize the decentralized communication topology as a directed and fully connected graph , where denotes the set of all peers and for any ,,we have

. Time is divided into epochs (denoted by

), with each consisting of synchronous rounds (denoted by ), and a model can be trained within each epoch. We denote frequently-used notations of transaction, block, blockchain, chain of block headers, by , , , and , respectively.

We assume the network is unreliable with at most possible Byzantine nodes, and . A Byzantine node can behave arbitrarily. For example, it may refuse to compute gradient, transmit arbitrary but unlawful gradients to mislead correct nodes, and make incorrect votes during the consensus process. When chooses not to send any data in a synchronous round , it’s neighbor acts like receiving .

Fig. 2: SPDL workflow (-th round)

4.2 Design Objectives

In this subsection, we briefly summarize our design goals.

  1. Decentralization: SPDL should work in a decentralized network setting without the intervention of any centralized party such as a parameter server.

  2. Differential Privacy: SPDL should guarantee -DP by adding random Gaussian noises to perturb gradients. Meanwhile we aim to reach a balance between privacy leakage and convergence rate.

  3. Byzantine Fault-Tolerance: SPDL can ensure convergence against at most Byzantine nodes who can behave arbitrarily.

  4. Immutability, Transparency, and Traceability: the full record of the machine learning process should be immutable and transparent, and provide traceability enabling Byzantine node detection.

Symbol Description
the gradient computed by in round
the random noise added to ’s gradient in round
the perturbed gradient to be transmitted in round
the learning rate
model parameters at the end of round
datasets randomly sampled from local datasets
the number of nodes and Byzantine nodes

the variance of Gaussian noise

the upper bound of
the output of BFT GARs
block and blockchain in -th round of -th epoch
budgets of differential privacy
the dimension of gradients
the neighbors of node
TABLE I: Summary of Notations

4.3 Protocol Details

4.3.1 Initialization

Initially, each node creates a pair of private key and public key , and generates its unique 256-bit identity based on . Public keys and identities are broadcast through the network to be publicly known by all nodes. Then a genesis block is created, which records the information of the nodes who initially participate in the blockchain network. A new coming node should have a related being added to the blockchain before joining the permissioned network, where contains necessary information including its , , IP address, etc. Initially, all nodes have an identical reputation value, that is . Each node initializes itself with a learning rate , the number of total rounds , and the variance of Gaussian noise for perturbing the gradients. For simplicity and consistency of the iterating process, we assume all nodes start the learning procedure with the same initial parameter value . After initialization, nodes undergo a leader election process to determine who is in charge of the training process.

4.3.2 Leader Election

1 = VRF(, ) Broadcast to the network and receive form peers while TRUE do
2       if receive from && VerifyVRF()=1 then
3             add to
4      if  then
             select the largest from and obtain Output:
5            
6      
Algorithm 1 Leader Election

Each node executes the leader election algorithm shown in Algorithm 1. The algorithm is based on the verifiable random function (VRF), which takes as inputs a private key and a random seed, and outputs a hash string as well as the corresponding proof . Each contender broadcasts to the network and receives from its peers. Note that each node is assigned with a reputation variable . A node with is prohibited from being a leader. Then the node with the largest and is recognized as the leader who is responsible for the blockchain consensus. The case when more than one leaders are selected is extremely small since has a large space of if we adopt the commonly used SHA-256; but if this extreme case happens, all nodes relaunch the leader election process to ensure that only one leader is finally selected. We also set a timeout for the leader election process as , where extracts the current UNIX time. To summarize, our leader election algorithm achieves the following three basic functionalities:

  • A node with a zero reputation value has no right of being a leader.

  • The leader election process possesses full randomness and unpredictability properties.

  • A Byzantine node cannot disguise itself as a leader since VRF ensures that the proof is unforgeable.

After leader election, nodes start the round-based training process, with each round consisting of the gradient computation and blockchain consensus processes.

4.3.3 Gradient Computation

At each round, node exchanges its perturbed gradients with all other nodes in the blockchain network. Specifically, each node preserves a true stochastic gradient and a perturbed one to be shared. The whole exchange process can be summarized into the following steps:

  • Local Gradient computation: compute local stochastic gradient , where is randomly sampled from local dataset .

  • Adding noise: add random Gaussian noise to the local gradient to be shared. The variance of the noise is denoted by input variable .

  • Broadcast gradients: send the perturbed local gradients to all other nodes, and receive gradients from others at the same time.

1 Initialize: , learning rate , number of total rounds , and variance of noise for  to  do
2       Local Computation Randomly sample and compute local stochastic gradient Adding Noise Randomly generate Gaussian noise and add noise to the variable Broadcast Gradients Broadcast to the network and receive from each peer
Algorithm 2 Gradient Computation

4.3.4 Blockchain Consensus

1 To prevent deadlock, each node starts a view change if PRE-PREPARE if role is leader then
2      
3      
4       broadcast
5      
6 PREPARE if role is follower then
7       compute while receive  do
8             if  and are valid and  then
9                   broadcast
10            
11      
12 COMMIT while receive  do
13       broadcast
14 DECIDE while receive  do
15       Update reputation
Algorithm 3 Blockchain Consensus

The blockchain consensus process deeply integrates a blockchain, a BFT consensus protocol (e.g., PBFT, Tendermint), and a BFT aggregation function (e.g., Krum, Median). In this paper, we adopt the Practical Byzantine Fault Tolerant (PBFT) protocol as our consensus backbone due to its effectiveness validated by the Hyperledger Sawtooth. The aggregation rule used in blockchain consensus is Krum. Concretely, the blockchain consensus consists of four phases: PRE-PREPARE, PREPARE, COMMIT, and DECIDE.

In the PRE-PREPARE phase, the leader computes an aggregated gradient using an aggregation function , where is a -Byzantine resilient Krum function. The core idea of Krum is to eliminate the gradients that are too far away from others. We use Euclid distance to measure how far two gradients are separated. Then we define to be the set of closest gradients to . We expect that the Krum function chooses one gradient that is the “closest” to its surrounding gradients. More precisely, the output of Krum is one of its input gradients, and the index of this gradient is:

takes as input and forms a new block which records . Then the leader broadcasts a signed pre-prepare message as .

Inthe PREPARE phase, each follower computes based on its local perturbed gradients, then waits for pre-prepare messages. If a pre-prepare message is received, the follower first verifies the digital signature and the block (height, block hash, etc.). Then it compares with . The requirement of means , where is a small variation. This condition indicates that each follower should have a similar view on non-Byzantine gradients as the leader. If verification is passed, the follower broadcasts a prepare message .

In the COMMIT phase, if a node receives valid commit messages, it can broadcast a decision and enter into the following phase.

In the DECIDE phase, upon receiving valid commit messages, a node can append a new block to its local blockchain , update its local gradient as , and finally update its reputation. The reputation of a certain node can be reduced if its gradient deviates from the aggregated gradient by more than . Even though we do not explicitly introduce the view change, we do have such process to address the case when a leader is a Byzantine node. To avoid the occurrence of a deadlock, we set a timeout in the blockchain consensus process. If , each node broadcasts a view change message and waits for other peers’ responses. Upon receiving view change messages, a node can abandon the current round.

5 Theoretical Analysis

In this section, we provide both convergence analysis and regret analysis on SPDL in the presence of Byzantine nodes.

Without loss of generality, let be the perturbed gradients sent out by honest nodes in round , and be the gradients sent out by possible Byzantine nodes in round . We assume that the gradients derived by correct nodes are independently sampled from the random viable and that

. By adding Gaussian noise, a perturbed gradient then can be considered as an instance sampled from random variable

, where

follows the Gaussian distribution of mean 0 and variance

. Therefore we also have . In this section, if we only concentrate on a certain round , we omit the superscript on variables when ambiguity can be avoided from context.

Definition 3.

(-Byzantine Resilience) Let and be the number of Byzantine nodes in a distributed system, then our anti-Byzantine mechanism is said to be -Byzantine Resilient if

satisfies that .

Theorem 1.

In round , If and

where

(1)

our anti-Byzantine mechanism achieves -Byzantine Resilience with

Proof.

Denote by the set of closest gradients of the -th node, the collection of the perturbed gradients in sent by correct nodes, while the gradients in sent by the Byzantine nodes. Let be the index of the gradient chosen by , we then have

(the above inequality holds since is convex.)

If is one of the correct nodes, i.e., ,

The above inequality holds since and are both 0. Because there are exactly correct nodes, we have

If is one of the Byzantine nodes, i.e., ,

where is any correct node. By the definition of , a node labeled is correct, but is farther away from any node in . Therefore we have:

Combining the two results where is a correct node or a Byzantine node, we have:

Since

then

Therefore

Finally, we have

Regret analysis is commonly used in online learning to investigate the loss difference caused by two learning methods. Therefore we present a regret analysis on SPDL to show the incurred loss difference with and without Byzantine nodes.

In a decentralized system with Byzantine nodes, let be the parameter model of a node in round , and be the raw data randomly sampled by nodes in round . The loss function in round can be written as . Consider the destructive effect on the training process caused by Byzantine nodes, we denote by the parameter learned by the system per round in the absence of Byzantine nodes. It’s meaningful to compare the gap of loss function computed by two different parameters and . If there is no Byzantine node, we assume is updated by any correct gradient given by a random node .

Definition 4.

(Regret)

Theorem 2.

If the loss function satisfies -Lipschitz continuity, and , then , where

Proof.

By the property of -Lipschitz continuity, we have

Since

we have

Then we obtain

Thus the theorem can be immediately proved by setting .

(a)
(b)
(c)
(d)
Fig. 3: Test error evolution with various network size .
(a)
(b)
(c)
(d)
Fig. 4: Latency of different stages (, and ) concerning .
(a)
(b)
(c)
(d)
Fig. 5: Test error evolution with various Byzantine ratio .
(a)
(b)
Fig. 6: Test error evolution with batch sizes .
(a)
(b)
Fig. 7: Test error evolution with and .

6 Evaluation

6.1 Configuration

We implement SPDL with 3500 lines of Python code and conduct the experiments on a DELL PowerEdge R740 server which has 2 CPUs (Intel Xeon 4214R) with 24 cores (2.40 GHz) and 128 GB of RAM. SPDL adopts the gRPC framework, a P2P network for underlying communications, Pytorch for machine learning libraries, and a blockchain system with the PBFT consensus algorithm. We make SPDL open-sourced at Github

111https://github.com/isSPDL/SPDL

. Nodes bootstrap by generating key pairs using ECDSA, initializing the genesis block, establishing the gRPC connection, exchanging node list, and joining the P2P network. We evaluate SPDL over the image classification of MNIST dataset, which consists of handwritten digits of 70,000

images in 10 classes. The dataset is equally divided into groups, with each assigned to one node. Each node can add Gaussian noise to its local gradients with the setting of (if not stated otherwise) and . We evaluate the performance of SPDL using the following standard metrics. 1) Test error: the fraction of wrong predictions among all predictions, using the test dataset. We measure the test error with respect to rounds, network size, batch size, privacy budget, and Byzantine ratio. 2) Latency: the latency of each round.

6.2 Evaluation Results

Convergence with Network Size: For simplicity in our context, we denote “PURE” as the decentralized learning scheme without leveraging any DP technique, BFT GARs, and blockchain system, and use “DP” to represent a decentralized learning scheme using the DP technique only based on “PURE”. We first compare our SPDL with PURE and DP schemes in a non-Byzantine environment. As shown in Fig. 3, the test error nearly converges after 20 rounds, but fluctuates a lot when is as small as four. When , all schemes almost achieve the same convergence. The SPDL and DP schemes sometimes (e.g., or in our experiments) have lower test error than PURE because adding noises could prevent the training process from over-fitting. Besides, the network size does not impact the convergence rate and a large network size contributes to stable convergence.

Latency: To better illustrate the latency of each round, we divide a round into three stages: local gradient computation plus adding noise whose overall time overhead is denoted by , gradient exchange (), and blockchain consensus (). As Fig. 4 shows, , and are in the same order of magnitude. grows with simply because more nodes contend for computational resources. The MNIST classification task can be finished quickly (0.1 s/round), so is lower than in our experiments. When the machine learning task becomes more difficult (e.g., 10 min/round), is an acceptable overhead and could be even ignored.

Convergence in the Presence of Byzantine Nodes: We then make a comparison of three different schemes PURE, DP, SPDL with respect to Byzantine Ratio (), which is the number of existing Byzantine nodes over . We set and considering , where represents the non-Byzantine case. As shown in Fig. 5, the results of the non–Byzantine experiments indicate that the three schemes can achieve similar convergence. However, DP and PURE schemes have high test error when , and fail to ensure model convergence even in the presence of Byzantine nodes. It is clearly shown that SPDL can still grantee the same convergence with respect to different levels of Byzantine attacks.

Batch Size: We then present the performance of the three schemes PURE, DP, SPDL with two different batch sizes (abbreviated as “BS”) in Fig. 6. When , the test error in all deployments fluctuate a lot, with the PURE scheme outperforming others because adding noises can perturb the model convergence. However, Fig. 6(b) indicates that we can increase the batch size to ensure a stable convergence and make our SPDL perform well as a PURE scheme.

Privacy Budget: We finally test our SPDL scheme by setting with two varying . A smaller represents stronger privacy guarantee. The results presented in Fig. 7 demonstrate that when , adding noise with or have similar convergence. However, when , smaller can cause larger test error. This implies that the tradeoff between accuracy and privacy preservation should be carefully adjusted according to specific demands on privacy protection and model accuracy.

7 Conclusion

SPDL is a new decentralized machine learning scheme which ensures efficiency while achieving strong security and privacy gurantee. In particular, SPDL utilizes BFT consensus and BFT GAR to protect model updates from harsh Byzantine behaviors, leverages blockchain to enjoy the benefits of transparency and traceability, and adopts the DP technique for privacy protection. We provide rigorous theoretical analysis on the effectiveness of our scheme and conduct extensive studies on the performance of SPDL with variations of network size, batch size, privacy budget, and Byzantine ratio.

References