Blockchain Assisted Decentralized Federated Learning (BLADE-FL): Performance Analysis and Resource Allocation

01/18/2021 ∙ by Jun Li, et al. ∙ CSIRO Princeton University 14

Federated learning (FL), as a distributed machine learning paradigm, promotes personal privacy by clients' processing raw data locally. However, relying on a centralized server for model aggregation, standard FL is vulnerable to server malfunctions, untrustworthy server, and external attacks. To address this issue, we propose a decentralized FL framework by integrating blockchain into FL, namely, blockchain assisted decentralized federated learning (BLADE-FL). In a round of the proposed BLADE-FL, each client broadcasts its trained model to other clients, competes to generate a block based on the received models, and then aggregates the models from the generated block before its local training of the next round. We evaluate the learning performance of BLADE-FL, and develop an upper bound on the global loss function. Then we verify that this bound is convex with respect to the number of overall rounds K, and optimize the computing resource allocation for minimizing the upper bound. We also note that there is a critical problem of training deficiency, caused by lazy clients who plagiarize others' trained models and add artificial noises to disguise their cheating behaviors. Focusing on this problem, we explore the impact of lazy clients on the learning performance of BLADE-FL, and characterize the relationship among the optimal K, the learning parameters, and the proportion of lazy clients. Based on the MNIST and Fashion-MNIST datasets, we show that the experimental results are consistent with the analytical ones. To be specific, the gap between the developed upper bound and experimental results is lower than 5 minimize the loss function.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 6

page 7

page 9

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the development of the Internet of Things (IoT), the amount of data from end devices is exploding at an unprecedented rate. Conventional machine learning (ML) technologies encounter the problem of how to efficiently collect distributed data from various IoT devices for centralized processing [TheNextGrandChallenges]. To tackle this issue raised by transmission bottleneck, distributed machine learning (DML) has emerged to process data at the network edge in a distributive manner [8805879]. DML can alleviate the burden on the central server by dividing a task into sub-tasks assigned to multiple nodes. However, DML needs to exchange samples when training a task [DBLP:conf/aistats/McMahanMRHA17], posing a serious risk of privacy leakage [9090973]. As such, federated learning (FL) [DBLP:journals/corr/KonecnyMRR16], proposed by Google as a novel DML paradigm, shows its potential advantages [8951246]. In a FL system, a machine learning model is trained across multiple distributed clients with local datasets and then aggregated on a centralized server. FL is able to cooperatively implement machine learning tasks without raw data transmissions, thereby promoting clients’ data privacy [9048613, DBLP:journals/corr/abs-2007-02056, DBLP:journals/tifs/WeiLDMYFJQP20]. FL has been applied to various data-sensitive scenarios, such as smart health-care, E-commerce [DBLP:journals/tist/YangLCT19], and the Google project Gboard [DBLP:journals/corr/abs-1912-01218].

However, due to centralized aggregations of models, standard FL is vulnerable to server malfunctions and external attacks, incurring either inaccurate model updates or even training failures. In order to solve this single-point-failure issue, blockchain [nakamoto2008peer, DBLP:journals/fgcs/ReynaMCSD18, 8436042] has been applied to FL systems. Leveraging advantages of blockchain techniques, the work in [DBLP:journals/corr/abs-1808-03949] developed a blockchain-enabled FL architecture to validate the uploaded parameters and investigated system performance, such as block generation rate and learning latency. Later, the work in [DBLP:conf/cyberc/MartinezFH19] incorporated Delegated Proof of Stake (DPoS) into blockchain-enabled FL to enhance the delay performance at the expense of robustness. The recent work in [DBLP:journals/tii/LuHDMZ20a] developed a tamper-proof architecture that utilized blockchain to enhance system security when sharing parameters, and proposed a novel consensus mechanism, i.e., Proof of Quality (PoQ), to optimize reward function. Since model aggregations are fulfilled by miners in a decentralized manner, the blockchained FL can solve the single-point-failure problem. In addition, owing to a validation process of local training, FL can be extended to untrustworthy devices in a public network [8470083].

Although the above mentioned works resorted to blockchain architecture for avoiding single-point-failure, they inevitably introduced a third-party, i.e., miners rooted from blockchain, to store the aggregated models distributively, causing potential information leakage. Also, these works did not analyze the convergence performance of model training, which is important for evaluating FL learning performance. In addition, the consumption of resources, e.g., computing capability, caused by mining in blockchain [8946151] is generally not taken into account in these works. However, resources consumed by mining are not negligible compared with those consumed by FL model training [DBLP:journals/fgcs/ReynaMCSD18]. Hence, blockchain-enabled FL needs to balance resource allocation between training and mining.

In this work, we propose a novel blockchain assisted decentralized FL (BLADE-FL) architecture. In BLADE-FL, training and mining processes are incorporated and implemented at each client, i.e., a client conducts both model training and mining tasks with its own computing capability. We analyze an upper bound on the loss function to evaluate the learning performance of BLADE-FL. Then we optimize the computing resource allocation between local training and mining on a client to approach optimal learning performance. We also pay special attention to a security issue that inherently exists in BLADE-FL, known as lazy clients problem. In this problem, lazy clients try to save their computing resources by directly plagiarizing models from others, leading to training deficiency and performance degradation.

The main contributions of this paper can be summarized as follows.

  • We propose a novel blockchain-assisted FL framework, called BLADE-FL, to overcome the issues raised by the centralized aggregations in conventional FL systems. In each round of BLADE-FL, clients first train local models and broadcast them to others. Then they play the role as miners to compete for generating a block based on the received models. Afterwards, each client aggregates these models from the verified block to form an initial model utilized for the local training in the next round. Compared with conventional blockchain-enabled FL, our BLADE-FL helps promote privacy against model leakage, and guarantees tamper-resistant model updates in a trusted blockchain network.

  • We analyze an upper bound on the loss function to evaluate the learning performance of BLADE-FL. In particular, we minimize the upper bound by optimizing the computing resource allocation between training and mining, and further explore the relationship among the optimal number of integrated rounds, the training time per iteration, the mining time per block, the number of clients, and the learning rate.

  • We develop a lazy model for BLADE-FL, where the lazy clients plagiarize others’ weights and add artificial noises. Moreover, we develop an upper bound on the loss function for this case, and investigate the impact of the number of lazy clients and the power of artificial noises on the learning performance.

  • We provide experimental results, which are consistent with the analytical results. In particular, the developed upper bound on the loss function is tight with reference to the experimental ones (e.g., the gap can be lower than ), and the optimized resource allocation approaches the minimum of the loss function.

Notation Description
The set of training samples in the -th client
The -th client
The total number of clients
The total number of lazy clients

The variance of artificial noise added by

lazy clients
The total number of integrated rounds
The number of iterations of local training
The global loss function
The local loss function of the -th client
Local model weights of the -th client
at the -th integrated round
Global model weights aggregated from local
models at the -th integrated round
Local model weights of the -th lazy client
at the -th integrated round
Learning rate of gradient descent algorithm
Training time per iteration
Mining time per block
Total computing time constraint of a FL task
TABLE I: Summary of main notation

The remainder of this paper is organized as follows. Section 2 first introduces the background of this paper. Then we propose BLADE-FL in Section 3, and optimize the upper bound on the loss function in Section 4. Section 5 investigates the issue of lazy clients. The experimental results are presented in Section 6. Section 7 concludes this paper. In addition, Table I lists the main notation used in this paper.

2 Background

2.1 Federated Learning

In a FL system, there are clients with the -th client possessing the dataset of size ,

. Each client trains its local model, e.g., a deep neural network, based on its local data and transmits the trained model to the server. Upon receiving the weights from all the clients, the server performs a global model aggregation. There are a number of communication rounds for exchanging models between the server and clients. Each round consists of an uploading phase where the clients upload their local models, and a downloading phase where the server aggregates the model and broadcasts it to the clients. Clients then update their local models based on the global one.

In the -th communication round, the server performs a global aggregation according to some combining rule, e.g., , where and denote the local weights of the -th client and the aggregated weights, respectively. The global loss function is defined as  [DBLP:journals/corr/abs-1907-09693], where is the local loss function of the -th client. In FL, each client is trained locally to minimize the local loss function, while the entire system is trained to minimize the global loss function . The FL system finally outputs , where is the overall communication rounds. Different from the training process in conventional DML systems [DBLP:conf/aistats/McMahanMRHA17], each client in FL only shares their local models rather than their personal data, to update the global model [9247530], promoting the clients’ privacy.

2.2 Blockchain

Blockchain is a shared and decentralized ledger. In a blockchain system, each block stores a group of transactions. The blocks are linked together to form a chain by referencing the hash value of the previous block. Owing to the cryptographic chain structure, any data modification within any block destroys the preceding chain structure. Therefore, it is impossible to tamper with the data that has been stored in the blockchain.

In addition, thanks to the consensus mechanism, each transaction included in the newly generated block is also immutable. The consensus mechanism validates the data within the blocks and ensures that all the nodes participating in the blockchain store the same data. The most prevalent consensus mechanism is Proof of Work (PoW), used in the Bitcoin system [nakamoto2008peer]. In Bitcoin, the process of the block generation is as follows. First, a node broadcasts a transaction with its signature to the blockchain network by the gossip protocol [DBLP:journals/sigops/DemersGHILSSST88]. Then the nodes in blockchain verify the transaction by the signature. Afterward, each node collects the verified transactions and competes to generate a new block that includes these transactions, by finding a unique, one-time number (called a nonce). This is to make the hash value of the data meet a specific target value. The node that finds the proper nonce is eligible to generate a new block and broadcasts the block to the entire network. Finally, the nodes validate the new block and append the verified block into the existing blockchain [DBLP:journals/cem/PuthalMMKD18]. Notably, the work in PoW is a mathematical problem that is easy to verify but extremely hard to solve. The nodes in the blockchain consume massive computing resources to figure out this complex problem. This process is called mining, and those who take part in it are known as miners. Because of the mining process, PoW can defense attacks on the condition that the total computing power of malicious devices are less than the sum of honest devices (i.e., 51% attack) [DBLP:journals/cacm/EyalS18].

In this context, blockchain is safe and reliable with the aid of the chain structure and consensus mechanism. Driven by these merits, we deploy the blockchain to replace the central server, and build up a decentralized FL network with privacy protection.

3 Proposed Framework

In this section, we detail the proposed BLADE-FL framework in Section 3.1 and the computing resource allocation model in Section 3.2.

3.1 Blade-Fl

The BLADE-FL network consists of clients each with equal computing power111In this paper, computing resource and computing power are used exchangeably, and both are measured by CPU cycles per second.. In this network, each client acts as not only a trainer but also a miner, and the role transition is designed as follows. First, each client (as a trainer) trains the local model, and then broadcasts the local model to the entire network as a requested transaction of the blockchain. Second, the client (as a miner) mines the block that includes all the local models that are ready to be aggregated. Once the newly generated block is validated by the majority of clients, the verified models in the block are immutable. Without the intervention of any centralized server, each client performs the global aggregation to update its local model by using all the shared models in the validated block. Suppose that the uploading and downloading phases cannot be tampered with external attackers.

Let us consider that all the clients deploy the same time allocation strategy for local training and mining. In other words, all the clients start the training at the same time, and then turn to the mining simultaneously. In this context, for each global model update and block generation, we define an integrated round for BLADE-FL that combines a communication round of FL and a mining round of blockchain. As illustrated in Fig. 1, the -th integrated round can be specified as the following steps222In the very beginning of first integrated round, each client initializes its local parameters, such as initial weight, learning rate, etc...


Fig. 1: Key steps in the -th integrated round of the proposed BLADE-FL.

4em

Local Training. Each client performs the local training by iterating the learning algorithm times to update its own model .

Model Broadcasting and Verification. Each client signs its models by the digital signature and propagates the models as its requested transactions. Other clients verify the transactions of the requested client (i.e., identity of the client).

Mining. Upon receiving the models from others, all the clients compete to mine the -th block.

Block Validation. All the clients append the new block onto their local ledgers only if the block is validated.

Local Updating. Upon receipt of verified transactions in this block, each client updates its local model. Then the system proceeds to the -th round.

In contrast to [DBLP:journals/tii/LuHDMZ20a], BLADE-FL does not rely on an additional third-party for global aggregation, thereby promoting privacy against model leakage. From the above steps, the consensus mechanism builds a bridge between the local models from clients and the model aggregation. Thanks to PoW, BLADE-FL guarantees the tamper-resistant model update in a trusted blockchain network.

3.2 Computing Resource Allocation Model

In this subsection, we model the time required for training and mining, to show the relationship between FL and blockchain in BLADE-FL.

Block Generation Rate: The block generation rate is determined by the computation complexity of the hash function and the total computing power of the blockchain network (i.e., total CPU cycles). The average CPU cycles required to generate a block in PoW is defined as , where is the mining difficulty333Following PoW, the mining difficulty is adjusted at different intervals but maintains unaltered over each interval. Thus, we consider that the average CPU is invariant over the period with a fixed mining difficulty., and denotes the average number of total CPU cycles to generate a block [DBLP:journals/tpds/XuWLGLYG19]. Thus, we define the average generation time of a block as

(1)

where denotes the CPU cycles per second of each client. Given a fixed , is a constant.

Local Training Rate: Recall that the local training of each client contains iterations. The training time consumed by each training iteration at the -th client is given by [9242286]

(2)

where denotes the number of samples in the -th client, and denotes the number of CPU cycles required to train one sample. This paper considers that each client is equipped with the same hardware resources (e.g., CPU, battery, and cache memory). Therefore, each client is loaded with the same number of local samples, and has the same and . However, the contents of samples owned by different clients are diverse. For simplicity, we assume that each client uses the same training algorithm and trains the same number of iterations for its local model update. Consequently, each client has an identical local training time per iteration. In this context, we let , as a constant.

Consider that a typical FL learning task is required to be accomplished within a fixed duration of . Given the same hardware configuration, each client has the total number of CPU cycles . From to (1) and (2), the number of iterations for local training in each integrated round is given by

(3)

where denotes the floor function, and is a positive integer that represents the number of total integrated round. Furthermore, denotes the total training time, while is the total mining time. Under the constraint of computing time , we notice that the longer the mining takes, the shorter the training occupies. That is because that (3) implies a fundamental tradeoff in BLADE-FL, i.e., the more iterations each client trains locally, the fewer integrated rounds the BLADE-FL network performs. Moreover, due to the floor operation in (3), there may exists some computing time left, i.e., . We stress that the extra time is not sufficient to perform another integrated round, and thereby the global model cannot update during this period. In this context, we ignore this computing time and assume in the following analysis.

In what follows, we optimize the learning performance of BLADE-FL based on (3).

4 Performance Analysis of the BLADE-FL System

In this section, we evaluate the learning performance of BLADE-FL with the upper bound on the loss function in Section 4.1, and optimize the learning performance with respect to the number of integrated rounds in Section 4.2.

4.1 Achievable Upper Bound Analysis

Existing works such as [3]-[11] evaluated the learning performance of the standard FL based on the loss function, where a smaller value of the loss function corresponds to a learning model with higher accuracy. Recently, the work in [DBLP:journals/jsac/WangTSLMHC19] derived an upper bound on the loss function between the iterations of local training and global aggregation.

Compared with the standard FL, our BLADE-FL replaces the centralized server with a blockchain network for global aggregation. Notably, the training process and the aggregation rule are the same as the centralized FL. Thus, the derived upper bound on the loss function in [DBLP:journals/jsac/WangTSLMHC19] can be applied to BLADE-FL.

We make the following assumption for all the clients.

Assumption
  1. [For any two different and , we assume that]

  2. is convex, i.e., ;

  3. is -Lipschitz, i.e., ;

  4. is L-smooth, i.e., .

According to Assumption 4.1, is convex, -Lipschitz, and L-smooth [DBLP:conf/pkdd/KarimiNS16].

The work in [DBLP:journals/jsac/WangTSLMHC19] also defined the following definition of measurements to capture the divergence between the gradient of the local loss function and that of the global loss function.

Definition ((Gradient Divergence) [DBLP:journals/jsac/WangTSLMHC19])

For each client, we define as an upper bound on , i.e., . Thus, the global gradient divergence can be expressed as .

This divergence is related to the distribution of local datasets over different clients. From [DBLP:journals/jsac/WangTSLMHC19], the following lemma presents an upper bound on the loss function in the standard FL.

Lemma ([DBLP:journals/jsac/WangTSLMHC19])

An upper bound on the loss function is given by

(4)

where

(5)

denotes the initial weight, denotes the optimal global weight, and denotes the learning rate with , respectively.

From (3), we have . Substituting into (4) yields

(6)

where

(7)

and denotes the upper bound on the loss function in BLADE-FL.

The upper bound in (6) shows that the learning performance depends on the total number of integrated rounds , the local training time per iteration , the average mining time per block , the learning rate , the data distribution , and the total computing time . From Definition 4.1, is fixed given and the datasets of each client, and is preset. Recall that and are both constant in (1) and (2). Given any fixed , , , and , in (6) is an univariate function of . In the following theorem, we verify that is a convex function with respect to .

Theorem

is convex with respect to .

Proof:

From (6), we define

(8)

where and . Since is an univariate function, we can optimize to maximize . Notice that are independent of , and is a function with respect to . Therefore, we compute the first derivative and second derivative of with respect to , respectively, as

(9)

Then, we have

(10)

and

(11)

We substitute (9) into (11), and obtain

(12)

Thus, is convex. Since we have , we prove that is convex [DBLP:journals/pieee/LiFL20].

Remark

In practice, should not be too small, since a tiny will make the system vulnerable to external attacks [9119406].

4.2 Optimal Computing Resource Allocation

First, the following theorem shows the optimal solution that minimizes .

Theorem

Given any fixed , , (or and , the optimal number of integrated rounds that minimizes the upper bound on the loss function in (6) is given by

(13)

when .

Proof:

Let and . We first have

(14)

where

(15)

Using (14), we obtain

(16)

Then, we approximate as a quadratic term with Taylor expansion:

(17)

Thus, can be written as

(18)

To solve the convex problem, we let , i.e.,

(19)

Finally, we have

(20)

This completes the proof.

Then, under a fixed constraint , let us focus on the effect of and on under fixed and by the following corollary444The following analytical results in Corollary 1, 2, 3, 4, and 5 are with respect to . Due to the fundamental tradeoff between and , the opposite results with respect to also hold. .

Corollary

Given and , the optimal value decreases as either or goes up. In this case, more time is allocated to training when gets larger or to mining when gets larger.

Proof:

This corollary is a straightforward result from Theorem 4.2.

Recall that denotes the training time per iteration, and denotes the mining time per block. From Corollary 4.2, the longer a local training iteration takes, the more computing power allocated to the local training at each client. Similarly, each client allocates more computing power to the mining when the mining time is larger.

Next, we investigate the impact of and on when and are fixed by the following corollaries (i.e., Corollary 4.2 and Corollary 4.2).

Corollary

Given fixed and , becomes larger as grows. In this case, more time is allocated to the mining.

Proof:

Without approximation of (17), we first let , i.e.,

(21)

For simplicity, we let

(22)

where

(23)

Then, the first derivative of is given by

(24)

Notice that in (22) is a decreasing function with respect to , is an increasing function with respect to , and

(25)

is a decreasing function function with respect to , respectively. Thus, the solution of (22) drops as grows. Finally, we conclude that increases as rises.

Corollary

Given fixed and , becomes smaller as grows. In this case, more time is allocated to the training.

Proof:

Based on Corollary 4.2, the proof of Corollary 4.2 is straightforward, since drops as grows from Definition 1.

The explanation of Corollary 4.2 is that each client may have trained an accurate local model but not an accurate global model ( is large), and thus BLADE-FL needs to perform more global aggregation especially when is small. This paper considers a number of honest clients in BLADE-FL to defend the malicious mining [DBLP:conf/trustbus/AbramsonHPPB20]. When is sufficiently large,

converges to its mean value according to the law of large number. In this context, Corollary

4.2 shows that approaches a constant as converges, and further implies that is independent of .

Corollary

Given fixed and , increases as goes larger. Meanwhile, the upper bound in (6) drops as grows if .

Proof:

From (23), we know that increases as the learning rate rises, which leads to larger . Thus, from the proof of Corollary 4.2, descends as ascends. Then the derivative of the function with respect to is

(26)

where . It indicates that the loss function decreases as rate increases if . However, the condition is not satisfied when is sufficiently large. In this case, is not an increasing function with respect to , resulting in larger loss function. This completes the proof.

The reason behind Corollary 4.2 is that the global model may not converge when each client is allocated with limited learning resources and a small learning rate. In addition, a higher learning rate may lead to faster convergence but a less inaccurate local model. To compensate for the inaccurate training, more computing power is allocated to the local training. In practice, the learning rate is decided by the learning algorithm, and the learning rates of different learning algorithms are diverse. Therefore, we can treat as a constant in BLADE-FL.

5 Performance Analysis with Lazy clients

Different from the conventional FL, a new problem of learning deficiency caused by lazy clients emerges in the BLADE-FL system. This issue is fundamentally originated from the lack of an effective detection and penalty mechanism in an unsupervised network such as blockchain, where the lazy client is able to plagiarize models from others to save its own computing power. The lazy client does not contribute to the global aggregation, and even causes training deficiency and performance degradation. To study this issue, we first model the lazy client in Section 5.1. Then, we develop an upper bound on the loss function to evaluate the learning performance of BLADE-FL with the presence of lazy clients in Section 5.2. Next, we investigate the impact of the ratio of lazy clients and the power of artificial noises on the learning performance in Section 5.3. In this section, suppose that there exist lazy clients in BLADE-FL and . Let us define the lazy ratio as .

5.1 Model of Lazy Clients

A lazy client can simply plagiarize other models before mining a new block. To avoid being spotted by the system, each lazy client adds artificial noises to its model weights as

(27)

where denotes the set of lazy clients,

is the artificial noise vector following a Gaussian distribution with mean zero and variance

. As Fig. 2 illustrates, the -th client is identified as the lazy client if it plagiarizes an uploaded model from others and add artificial noise onto it in Step . Except the plagiarism in Step , the lazy clients follow the honest clients555In this paper, we assume that each client is honest in the mining, because of the mining reward. to perform Step ②-⑤.


Fig. 2: Model of the lazy client in BLADE-FL.

5.2 Achievable Upper Bound with Lazy Clients

In this subsection, we develop an upper bound on the loss function with the lazy ratio and the power of artificial noise in the following theorem.

Theorem

Using the model of lazy clients in (27), an upper bound on the loss function after integrated rounds with the lazy ratio of is given by

(28)

where denotes the aggregated weights of BLADE-FL with lazy clients after integrated rounds, and denotes the performance degradation caused by lazy clients after integrated rounds.

Proof:

Define the model weights of lazy clients as

(29)

where is the model parameters that are plagiarized by lazy clients.

Since is -Lipschitz, the proof of Lemma 1 in [DBLP:journals/jsac/WangTSLMHC19] has shown that

(30)

Therefore, the upper bound can be expressed as [DBLP:journals/jsac/WangTSLMHC19]

(31)

In addition, plugging (6) into (31), we have

(32)

From (32), we further have

(33)

If each lazy client adds the Gaussian noise with the same variance to its plagiarized model,

is a chi-square distribution with

degrees of freedom (i.e., ). Given the mean value , we have

(34)

The upper bound in (31) can be written as

(35)

This completes the proof.

Thereafter, we use the upper bound in (28) to evaluate the learning performance of BLADE-FL with lazy clients.

5.3 Discussion on Performance with Lazy Clients

Practically, a lazy node tends not to add either huge or tiny noise in order to conceal itself. To this end, it is required that the value of is comparable to that of .

Remark

From (28), the plagiarism behavior contributes a term proportional to to the bound, while the artificial noise exhibits an impact term proportional to . This indicates that the plagiarism has a more significant effect on the learning performance compared with the noise perturbation.

Then, we reveal the impact of and on the optimal value of in the following corollary.

Corollary

The optimal that minimizes in (28) decreases as either the lazy ratio or the noise variance grows.


Fig. 3: Visual illustration of the standard MNIST dataset
Proof:

From the definition of in (8), we let

(36)

As such, represents the loss function of BLADE-FL with lazy clients.

Since we have

(37)

and

(38)

we obtain that is still convex with respect to .

Furthermore, we let . Plugging this into , we have

(39)

Then we let

(40)

and express (39) as

(41)

We notice that

(42)

Thus is an increasing function with respect to . Let

(43)

Thus (41) can be rewritten as

(44)

where grows as either or increases, goes up as grows, and declines as goes up. Finally, that minimizes in (28) decreases as either or grows. This concludes the proof.

When the system is infested with a large number of lazy clients (i.e., the lazy ratio approaches 1), more computing power should be allocated to local training to compensate for the insufficient learning.


Fig. 4: Visual illustration of the Fashion-MNIST dataset.

6 Experimental Results


Fig. 5: The comparison of the upper bound in (6) and the experimental results for (a) , , , , (b) , , , , .

In this section, we evaluate the analytical results with various learning parameters under limited computing time . First, we evaluate the developed upper bound in (6), and then investigate the optimal value of overall integrated rounds under training time per iteration , mining time per block , number of clients , learning rate , the ratio , and power of artificial noise .

6.1 Experimental Setting

1) Datasets: In our experiments, we use two datasets for non-IID setting to demonstrate the loss function and accuracy versus different values of .

MNIST. Standard MNIST handwritten digit recognition dataset consists of 60,000 training examples and 10,000 testing examples [726791]. Each example is a 2828 sized handwritten digit in grayscale format from 0 to 9. In Fig. 3, we illustrate several samples from the standard MNIST dataset.

Fashion-MNIST. Fashion-MNIST for clothes has 10 different types, such as T-shirt, trousers, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot, in Fig. 4.

Fig. 6: Loss function and accuracy versus for different values of for MNIST and Fashion-MNIST.
Training time
per iteration
Maximal accuracy
MNIST Fashion- MNIST Fashion-
MNIST MNIST
40 46 87.44% 59.57%
58 70 82.16% 57.18%
64 82 66.47% 50.11%
TABLE II: The optimal training time and corresponding accuracy for different values of .

2) FL setting. Each client progresses the learning of a Multi-Layer Perceptron (MLP) model. The MLP network has a single hidden layer that contains 256 hidden units. Each unit applies softmax function and rectified linear units of 10 classes (corresponding to the 10 digits in Fig.

3 and 10 clothes in Fig. 4).

3) Parameters setting. In our experiments, we set the total computing time , the samples of each client , the number of clients , the mining time per block , the number of lazy clients , and the learning rate as default, where the time is normalized by the training time per iteration .

6.2 Experiments on Performance of BLADE-FL

Fig. 7: Loss function and accuracy versus for different values of for MNIST and Fashion-MNIST.
Mining time
per block
Maximal accuracy
MNIST Fashion- MNIST Fashion-
MNIST MNIST
60 30 87.47% 61.51%
64 40 85.68% 60.34%
72 48 79.32% 55.68%
TABLE III: The optimal mining time and corresponding accuracy for different values of .

Fig. 5 plots the gap between the developed upper bound in (6) and the experimental results. We set learning rate and lazy ratio in conditions (a) and (b), respectively. First, we can see that the developed bound is close but always higher than the experimental one under both conditions. Second, both the developed upper bound and the experimental results are convex with respect to , which agrees with Theorem 4.1. Third, both the upper bound in (6) and the experimental results reach the minimum at the same optimal value of .

Fig. 8: Loss function and accuracy versus for different values of for MNIST and Fashion-MNIST.
Number of
clients
Maximal accuracy
MNIST Fashion- MNIST Fashion-
MNIST MNIST
N=10 70 42 74.52% 52.66%
N=15 60 36 75.74% 55.83%
N=20 50 30 82.89% 62.91%
N=25 50 30 83.03% 62.64%
TABLE IV: The optimal mining time and corresponding accuracy for different values of .

Fig. 6 plots the experimental results of the loss function and accuracy on MNIST and Fashion-MNIST for different values of values of , while Table II shows the optimal training time and corresponding accuracy. Here, we set . First, Fig. 6(a) shows that larger leads to larger loss function. This is due to the fact that both and from (3) drops as grows. Second, from Table II, the longer a training iteration consumes, the more training time each client takes. For example, using MNIST, the training time increases from to as rises from to . This observation is consistent with Corollary 4.2.

Fig. 7 plots the experimental results of the loss function and accuracy on MNIST and Fashion-MNIST for different values of values of , while Table III shows the optimal mining time and corresponding accuracy. First, Fig. 7(a) shows that larger leads to larger loss function, since both and from (3) drops as grows. Second, from Table III, reduces as rises, but the optimal mining time goes up as rises. For example, using MNIST, the mining time increases from to as grows from to . This observation agrees with Corollary 4.2.

Fig. 9: Loss function and accuracy versus for different values of for MNIST and Fashion-MNIST.
Learning
rate
Maximal accuracy
MNIST Fashion- MNIST Fashion-
MNIST MNIST
54 30 74.70% 58.57%
60 54 88.17% 72.50%
72 42 85.51% 70.14%
TABLE V: The optimal mining time and corresponding accuracy for different values of .

Fig. 8 shows the experimental results of the loss function and accuracy on MNIST and Fashion-MNIST for different values of values of , while Table IV illustrates the optimal mining time and corresponding accuracy. We set . First, from Table IV, we notice that the optimal mining time drops as increases, which is consistent with Proposition 4.2. For example, using MNIST, drops from to as rises from to . Second, from Fig. 8(a), larger leads to lower loss function. This is because the involved datasets are larger as grows, which causes a smaller loss function. Third, from both Fig. 8(a) and (b), approaches a fixed value when is sufficiently large (e.g., ). This observation is in line with Corollary 4.2.

Fig. 10: Loss function and accuracy versus under various for MNIST and Fashion-MNIST.
Lazy
ratio
Maximal accuracy
MNIST Fashion- MNIST Fashion-
MNIST MNIST
30 50 85.53% 54.86%
40 50 85.33% 54.76%
50 80 78.11% 48.92%
50 80 78.80% 46.25%
TABLE VI: The optimal training time and corresponding accuracy for different values of .

Fig. 9 plots the experimental results of the loss function and accuracy on MNIST and Fashion-MNIST for different values of values of , while Table V illustrates the optimal mining time and corresponding accuracy. First, from Table V, we find that the optimal mining time rises as grows, which is in line with Corollary 4.2. For example, using MNIST, rises from to as grows from to . Second, from Fig. 9(a), the loss function drops as increases except . This is because grows significantly when , and our developed upper bound is no longer suitable. For example, when , the loss function increases as rises in our experiments for both MNIST and Fashion-MNIST.

6.3 Experiments on Performance with Lazy Clients

Fig. 10 plots the experimental results of the loss function and accuracy on MNIST and Fashion-MNIST for different values of values of lazy ratio , while Table VI shows the optimal training time and corresponding accuracy. We set the power of artificial noise . First, from Table VI, it is observed the optimal training time steps up as increases. For example, using MNIST, the time allocated to training rises from to as increases from to . This observation is consistent with Corollary 5.3. Second, from Fig. 10(a), the learning performance degrades as grows. This is because more lazy clients involved in the system as grows, leading to lower training efficiency.

Fig. 11: Loss function and accuracy versus under various for MNIST and Fashion-MNIST.
Power of
artificial
noise
Maximal accuracy
MNIST Fashion- MNIST Fashion-
MNIST MNIST
30 50 78.35% 57.44%
50 50 77.22% 53.19%
50 50 59.96% 52.06%
50 60 50.94% 44.08%
TABLE VII: The optimal training time and corresponding accuracy for different values of .

Fig. 11 plots the experimental results of the loss function and accuracy on MNIST and Fashion-MNIST for different values of values of , while Table VII shows the optimal training time and corresponding accuracy. We set . First, from Table VII, we notice that the optimal training time grows as increases, which agrees with Corollary 5.3. For example, using MNIST, grows from to as increases from to . Second, from Fig. 11(a), the learning performance of BLADE-FL (i.e., loss function and accuracy) degrades as the noise power goes larger.

7 Conclusions

In this paper, we have proposed a BLADE-FL framework that integrates the training and mining process in each client, to overcome the single-point-failure of centralized network and maintain the privacy promoting capabilities of the FL system. In order to evaluate the learning performance of BLADE-FL, we have developed an upper bound on the loss function. Also, we have verified that the upper bound is convex with respect to the total number of integrated rounds and have minimized the upper bound by optimizing . Moreover, we have investigated a unique problem in the proposed BLADE-FL system, called the lazy client problem and have derived an upper bound on the loss function with lazy clients. We have included experimental results, which have been seen to be consistent with the analytical results. In particular, the developed upper bound is close to the experimental results (e.g., the gap can be lower than ), and the optimal that minimizes the upper bound also reaches the minimum of the loss function in the experimental results.

References