Secure Aggregation for Buffered Asynchronous Federated Learning

Federated learning (FL) typically relies on synchronous training, which is slow due to stragglers. While asynchronous training handles stragglers efficiently, it does not ensure privacy due to the incompatibility with the secure aggregation protocols. A buffered asynchronous training protocol known as FedBuff has been proposed recently which bridges the gap between synchronous and asynchronous training to mitigate stragglers and to also ensure privacy simultaneously. FedBuff allows the users to send their updates asynchronously while ensuring privacy by storing the updates in a trusted execution environment (TEE) enabled private buffer. TEEs, however, have limited memory which limits the buffer size. Motivated by this limitation, we develop a buffered asynchronous secure aggregation (BASecAgg) protocol that does not rely on TEEs. The conventional secure aggregation protocols cannot be applied in the buffered asynchronous setting since the buffer may have local models corresponding to different rounds and hence the masks that the users use to protect their models may not cancel out. BASecAgg addresses this challenge by carefully designing the masks such that they cancel out even if they correspond to different rounds. Our convergence analysis and experiments show that BASecAgg almost has the same convergence guarantees as FedBuff without relying on TEEs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/11/2021

Federated Learning with Buffered Asynchronous Aggregation

Federated Learning (FL) trains a shared model across distributed devices...
06/07/2021

Securing Secure Aggregation: Mitigating Multi-Round Privacy Leakage in Federated Learning

Secure aggregation is a critical component in federated learning, which ...
02/04/2021

Semi-Synchronous Federated Learning

There are situations where data relevant to a machine learning problem a...
02/07/2022

Preserving Privacy and Security in Federated Learning

Federated learning is known to be vulnerable to security and privacy iss...
12/23/2021

EIFFeL: Ensuring Integrity for Federated Learning

Federated learning (FL) enables clients to collaborate with a server to ...
09/29/2021

LightSecAgg: Rethinking Secure Aggregation in Federated Learning

Secure model aggregation is a key component of federated learning (FL) t...
12/27/2020

Federated Block Coordinate Descent Scheme for Learning Global and Personalized Models

In federated learning, models are learned from users' data that are held...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Federated learning (FL) allows users to collaboratively train a machine learning model without sharing their data and while protecting their privacy

[18]. The training is typically coordinated by a central server. The main idea that enables decentralized training without sharing data is that each user trains a local model using its dataset and the global model maintained by the server. The users then only share their local models with the server which updates the global model and pushes it again to the users for the next training round until convergence. Recent studies, however, showed that sharing the local models still breaches the privacy of the users through inference or inversion attacks e.g., [10, 19, 30, 11]. To overcome this challenge, secure aggregation protocols were developed to ensure that the server only learns the global model without revealing the local models [4, 24, 13, 29, 9, 2]. FL protocols commonly rely on synchronous training [18], which suffers from stragglers due to waiting for the updates of a sufficient number of users at each round. Asynchronous FL tackles this by incorporating the updates of the users as soon as they arrive at the server [27, 26, 6, 7]. While asynchronous FL handles stragglers efficiently, it is not compatible with the secure aggregation protocols designed particularly for synchronous FL. This is because these protocols securely aggregate many local models together each time the global model is updated and hence they are not suitable for asynchronous FL in which each single local model updates the global model. Another approach that can be applied in asynchronous FL to protect the privacy of the users is local differential privacy (LDP) [25]. In this approach, each user adds a noise to the local model before sharing it with the server. This approach, however, degrades the training accuracy.

In [20], an asynchronous aggregation protocol known as FedBuff has been proposed to mitigate stragglers and enable secure aggregation jointly. FedBuff enables secure aggregation through trusted execution environments (TEEs) as Intel software guard extensions (SGX) [8]. Specifically, the individual updates are not incorporated by the server as soon they arrive. Instead, the server keeps the received local models in a TEE-enabled secure buffer of size , where is a tunable parameter. The server then updates the global model when the buffer is full. This idea has been shown to be times faster than the conventional synchronous FL schemes.

Contributions. Since TEEs have limited memory, which limits the buffer size , and are inefficient compared to the untrusted hardware [8], we instead develop a buffered asynchronous secure aggregation protocol that does not rely on TEEs. The main challenge of leveraging the conventional secure aggregation protocols in the buffered asynchronous setting is that the pairwise masks may not cancel out. This is because of the asynchronous nature of this setting which may result in local models of different rounds in the buffer, while the pairwise masks cancel out if they belong to the same round. This requires a careful design of the masks such that they can be cancelled even if they do not correspond to the same round. Specifically, our contributions are as follows.

  1. [leftmargin=*]

  2. We propose a buffered asynchronous secure aggregation protocol that extends a recently proposed synchronous secure aggregation protocol known as LightSecAgg [28] to this buffered asynchronous setting. The key idea of our protocol, BASecAgg, is that we design the masks such that they cancel out even if they correspond to different training rounds.

  3. We extend the convergence analysis of [20] to the case where the local updates are quantized, which is necessary for the secure aggregation protocols to protect the privacy of the local updates.

  4. Our extensive experiments on MNIST and CIFAR datasets show that

    BASecAgg almost has the same convergence guarantees as FedBuff despite the quantization.

2 Related Works

Secure aggregation protocols typically rely on exchanging pairwise random-seeds and secret sharing them to tolerate users’ dropouts [4, 24, 13, 2]. The running time of such approaches, however, increases significantly with the number of dropped users since the server needs to reconstruct the mask of each dropped user. Recently, a secure aggregation protocol known as LightSecAgg has been proposed to address this challenge [28]. In LightSecAgg, unlike the prior works, the server does not reconstruct the pairwise random-seeds of each dropped user. Instead, the server directly reconstructs the aggregate masks of all surviving users. This one-shot reconstruction of the masks of all surviving users results in a much faster training. It is also worth noting that the protocol of [29] is based on the one-shot reconstruction idea, but it requires a trusted third party unlike LightSecAgg.

Prior secure aggregation protocols [4, 24, 13, 2] are designed for the synchronous FL algorithms such as FedAvg [18], which suffer from stragglers. Asynchronous FL handles this problem by updating the global model as soon as the server receives any local model [27, 26, 6, 7]. The larger staleness is of the local model, the greater is the error when updating the global model [27]. To address this staleness problem, an asynchronous protocol known as FedAsync has been developed in [27] that updates the global model through staleness-aware weighted averaging of the old global model and the received local model. In [6], an asynchronous protocol known as FedAt has been proposed, which bridges the gap between synchronous FL and asynchronous FL by developing a semi-synchronous protocol that groups the users, synchronously updates the model of each group and then asynchronously updates the global model across groups. Similarly, a semi-synchronous FL protocol has been developed in [21] to handle the staleness problem and also mitigates Byzantine users simultaneously.

Asynchronous FL, however, is not compatible with secure aggregation. A potential approach to ensure privacy then is through DP approaches that add noise to the local models before sharing them with the sever [26]. A similar approach has been also leveraged in [12]

to develop a privacy-preserving protocol for a limited class of learning problems as linear regression, logistic regression and least-squares support vector machine in the vertically partitioned (VP) asynchronous decentralized FL setting. Adding noise, however, degrades the training accuracy. In

[20], an asynchronous aggregation protocol known as FedBuff has been proposed to mitigate stragglers while ensuring privacy. The key idea of FedBuff is that the server stores the local models in a TEE-enabled secure buffer of size until the buffer is full and then securely aggregates them. Due to the memory limitations of TEEs, this approach is only feasible when is small. This motivates us in this work to develop a buffered asynchronous secure aggregation protocol without TEEs.

3 Synchronous Secure Aggregation

In this section, we provide an overview of secure aggregation of synchronous FL.
The goal in FL is to collaboratively learn a global model with dimension , using the local datasets of

users without sharing them. This problem can be formulated as minimizing a global loss function as follows

(1)

where is the local loss function of user and are the weight parameters that indicate the relative impact of the users and are selected such that .
This problem is solved iteratively. At round , the server sends the global model to the users. Some of the users may dropout due to various reasons such as wireless connectivity. We assume that at most users may dropout in any round. We denote the set of the surviving users at round by and the set of dropped users by . User updates the global model by carrying out

local stochastic gradient descent (SGD) steps. The goal of the server is to get the sum of the local models of the surviving users to update its global model as

The server then sends to the users for the next round. While the users do not share their data with the server and just share their local models, the local models still reveal significant information about their datasets [10, 19, 30, 11]. To address this challenge, a secure aggregation protocol known as SecAgg was developed in [3] to ensure that the server does not learn anything about the local models except at round . Specifically, we assume that up to users can collude with each other as well as with the server to reveal the local models of other users. The secure aggregation protocol then must ensure that nothing is revealed beyond the aggregate model despite such collusions.

3.1 Overview of SecAgg

We now provide an overview of SecAgg. In this discussion, we omit the round index for simplicity since the procedure is the same at each round. SecAgg ensures privacy against any subset of up to colluding users and resiliency against colluding workers provided that .
In SecAgg, the users mask their models before sharing them with the server using random keys. Specifically, each pair of users agree on a pairwise random seed . Moreover, user also uses a private random seed that is used when the update of this user is delayed but eventually reaches the server. The model of user is then masked as follows

(2)

where is a pseudo random generator. The server then reconstructs the private random-seed of each surviving user, the pairwise random-seed of each dropped user and recovers the aggregate model of the surviving users as follows

(3)

3.2 Overview of LightSecAgg

Next, we provide an overview of LightSecAgg. LightSecAgg has three parameters that represents the privacy guarantee, that represents that dropout guarantee and which represents the targeted number of surviving users. These parameters must be selected such that . In LightSecAgg, user selects a random mask and partitions it to sub-masks denoted by . User also selects another random masks denoted by . These partitions are then encoded through an Maximum Distance Separable (MDS) code [17] as follows

(4)

where is the -th column of a Vandermonde matrix . After that, user sends to user . User then masks its model as

The goal of the server now is to recover the aggregate model , where is the set of surviving users in this phase. To do so, each surviving users sends to the server. The server then directly recovers for through MDS decoding when it receives at least messages from the surviving users. We denote this subset of the surviving users by , where . Finally, the server recovers the aggregate model as .

4 Buffered Asynchronous Secure Aggregation

In this section, we provide a brief overview of FedBuff [20]. Then, we illustrate the incompatibility of the conventional secure aggregation with asynchronous FL in Section 4.1. Later on, in Section 4.2, we introduce BASecAgg.

In asynchronous FL, the updates of the users are not synchronized while the goal is the same as the synchronous FL to minimize the global loss function in (1). In the buffered asynchronous setting, the server stores each local model that it receives in a buffer of size and updates the global model when the buffer is full. In our setting, this buffer is not a secure buffer. Hence, our goal is to design the secure aggregation protocol where users send the masked updates to protect the privacy in a way that the server can aggregate the local updates while the server (and potential colluding users) learns no information about the local updates beyond the aggregate of the updates stored in the buffer.

FedBuff. Before presenting our protocol, BASecAgg, we first provide an overview about the buffered asynchronous aggregation framework, named FedBuff [20], and describe the challenges that render SecAgg incompatible with this framework. The key intuition of FedBuff is to introduce a new design parameter , the buffer size at the server, so that FedBuff

has two degrees of freedom,

and the concurrency while the synchronous FL frameworks have only one degree of freedom, concurrency. The concurrency is the number of users training concurrently and is an important parameter to provide a trade-off between the training time and the data inefficiency. Synchronous FL speeds up the training by increasing the concurrency, but higher concurrency results in data inefficiency [20]. In FedBuff, however, a high concurrency coupled with a proper value of results in fast training. In other words, the additional degree of freedom allows the server to update more frequently than concurrency, which enables FedBuff to achieve data efficiency at high concurrency.

At round , users are locally training the model by carrying out local SGD steps. When the local update is done, user sends the difference between the downloaded global model and updated local model to the server. The local update of user sent to the server at round is given by

(5)

where is the latest round index when the global model is downloaded by user and is the round index when the local update is sent to the server, hence the staleness of user is given by . denotes the local model after local SGD steps and the local model at user is updated as

(6)

for , where , denotes learning rate of the local updates. denotes the stochastic gradient with respect to the random sampling on user , and we assume for all where is the local loss function of user defined in (1). The server stores the received local updates in a buffer of size . When the buffer is full, the server updates the global model by subtracting the aggregate of all local updates from the current global model. Specifically, the global model at the server is updated as

(7)

where is an index set of the users whose local models are in the buffer at round and is the learning rate of the global updates. is a function that compensates for the staleness satisfying and is monotonically decreasing as increases. There are many functions that satisfy these two properties and we consider a polynomial function as it shows similar or better performance than the other functions e.g., Hinge or Constant stale function [27].

Privacy and Dropout Model. We assume at most users may dropout in any round and a threat model where the users and the server are honest but curious who follow the protocol but try to infer the local updates of the other users. The secure aggregation protocol guarantees that nothing beyond the aggregate of the local updates is revealed, even if up to users collude with the server. We consider information-theoretic privacy where from every subset of users of size at most , we must have mutual information , where and are the collection of information at the server and at the users in at round , respectively.

4.1 Incompatibility of SecAgg with Buffered Asynchronous FL

As described in Section 3.1, SecAgg [4] is designed for synchronous FL. At round , each pair of users agree on a pairwise random-seed

, and generate a random vector by running PRG based on the random seed of

to mask the local update. This additive structure has the unique property that these pairwise random vectors cancel out when the server aggregates the masked models because user adds to and user subtracts from .

In the buffered asynchronous FL, however, the cancellation of the pairwise random masks based on the key agreement protocol is not guaranteed due to the mismatch in staleness between users. Specifically, at round , user sends the masked model to the server that is given by

(8)

where is the local update defined in (5). When , the pairwise random vectors in and are not canceled out as . We note that the identity of the staleness of each user is not known a priori, hence each pair of users cannot use the same pairwise random-seed.

4.2 The Proposed BASecAgg Protocol

To address the challenge of asynchrony in the buffered asynchronous secure aggregation, we propose BASecAgg by modifying the idea of one-shot recovery leveraged in LightSecAgg [28] to our setting. We provide a brief overview of LightSecAgg in Section 3.2. Our key intuition is to encode the local masks in a way that the server can recover the aggregate of masks from the encoded masks via a one-shot computation even though the masks are generated in different training rounds.

BASecAgg has three phases. First, each user generates a random mask to protect the privacy of the local update, and further creates encoded masks via a -private Maximum Distance Separable (MDS) code that provides privacy against colluding users. Each user sends one of the encoded masks to one of the other users for the purpose of one-shot recovery. Second, each user trains a local model and converts it from the domain of real numbers to the finite field as generating random masks and MDS encoding are required to be carried out in the finite field to provide information-theoretic privacy. Then, the quantized model is masked by the random mask generated in the first phase, and sent to the server. The server stores the masked update in the buffer. Third, when the buffer is full, the server aggregates the masked updates in the buffer. To remove the randomness in the aggregate of the masked updates, the server reconstructs the aggregated masks of the users in the buffer. To do so, each surviving user sends the aggregate of encoded masks to the server. After receiving a sufficient number of aggregated encoded masks, the server reconstructs the aggregate of masks and hence the aggregate of the local updates. We now describe these three phases in detail.

4.2.1 Offline Encoding and Sharing of Local Masks

User generates uniformly at random from the finite field , where is the global round index when user downloads the global model from the server. The mask is partitioned into sub-masks denoted by , where denotes the targeted number of surviving users and . User also selects another random masks denoted by . These partitions are then encoded through an Maximum Distance Separable (MDS) code as follows

(9)

where is the -th column of a Vandermonde matrix . After that, user sends to user . At the end of this phase, each user has from .

4.2.2 Training, Quantizing, Masking, and Uploading of Local Updates

Each user trains the local model as in (5) and (6). User quantizes its local update from the domain of real numbers to the finite field as masking and MDS encoding are carried out in the finite field to provide information-theoretic privacy. The field size is assumed to be large enough to avoid any wrap-around during secure aggregation.

The quantization is a challenging task as it should be performed in a way to ensure the convergence of the global model. Moreover, the quantization should allow the representation of negative integers in the finite field, and enable computations to be carried out in the quantized domain. Therefore, we cannot utilize well-known gradient quantization techniques such as in [1], which represents the sign of a negative number separately from its magnitude. BASecAgg addresses this challenge with a simple stochastic quantization strategy combined with the two’s complement representation as described next. For any positive integer , we first define a stochastic rounding function as

(10)

where is the largest integer less than or equal to , and this rounding function is unbiased, i.e., . The parameter

is a design parameter to determine the number of quantization levels. The variance of

decreases as the value of increases, which will be described in Lemma 1 in Appendix A in detail. We then define the quantized update

(11)

where the function from (10) is carried out element-wise, and is a positive integer parameter to determine the quantization level of the local updates. The mapping function is defined to represent a negative integer in the finite field by using the two’s complement representation,

(12)

To protect the privacy of the local updates, user masks the quantized update in (11) as

(13)

and sends the pair of to the server. The local round index will be used in two cases: (1) when the server identifies the staleness of each local update and compensates it, and (2) when the users aggregate the encoded masks for one-shot recovery, which will be explained in Section 4.2.3.

4.2.3 One-shot Aggregate-update Recovery and Global Model Update

The server stores in the buffer, and when the buffer of size is full the server aggregates the masked local updates. In this phase, the server intends to recover

(14)

where is the local update in the real domain defined in (5), () is the index set of users whose local updates are stored in the buffer and aggregated by the server at round , and is the staleness function defined in (7). To do so, the first step is to reconstruct . This is challenging as the decoding should be performed in the finite field, but the value of is a real number. To address this problem, we introduce a quantized staleness function ,

(15)

where is a stochastic rounding function defined in (10), and is a positive integer to determine the quantization level of staleness function. Then, the server broadcasts information of to all surviving users. After identifying the selected users in , the local round indices and the corresponding staleness, user aggregates its encoded sub-masks and sends it to the server for the purpose of one-shot recovery. The key difference between BASecAgg and LightSecAgg is that in BASecAgg, the time stamp for encoded masks for each can be different, hence user must aggregate the encoded mask with the proper round index. Due to the commutative property of coding and linear operations, each is an encoded version of for using the MDS matrix (or Vandermonde matrix) defined in (9). Thus, after receiving a set of any results from surviving users in , where , the server reconstructs for via MDS decoding. By concatenating the aggregated sub-masks , the server can recover . Finally, the server obtains the desired global update as follows

(16)

where is defined in (11) and is the demapping function defined as follows

(17)

Finally, the server updates the global model as , which is equivalent to

(18)

where and are the stochastic rounding function defined in (10) with respect to quantization parameters and , respectively.

5 Convergence Analysis

In this section, we provide the convergence guarantee of BASecAgg in the -smooth and non-convex setting. For simplicity, we consider the constant staleness function for all in (18). Then, the global update equation of BASecAgg is given by

(19)

where is the stochastic round function defined in (10), is the positive constant to determine the quantization level, and is the local update of user defined in (5). We first introduce our assumptions, which are commonly made in analyzing FL algorithms [16, 20, 22, 23].

Assumption 1.

(Unbiasedness of local SGD). For all and , where

is the stochastic gradient estimator of user

defined in (6).

Assumption 2.

(Lipschitz gradient). in (1) are all -smooth: for all and , .

Assumption 3.

(Bounded variance of local and global gradients). The variance of the stochastic gradients at each user is bounded, i.e., for and . For the global loss function defined in (1), holds.

Assumption 4.

(Bounded gradient). For all , .

In addition, we make an assumption on the staleness of the local updates under asynchrony [20].

Assumption 5.

(Bounded staleness). For each global round index and all users , the delay is not larger than a certain threshold where is the latest round index when the global model is downloaded to user .

Now, we state our main result for the convergence guarantee of BASecAgg.

Theorem 1.

Selecting the constant learning rates and such that , the global model iterates in (19) achieve the following ergodic convergence rate

(20)

where , , and .

The proof of Theorem 1 is provided in Appendix A.

Remark 1.

Theorem 1 shows that convergence rates of BASecAgg and FedBuff (see Corollary 1 in [20]) are the same except for the increased variance of the local updates due to the quantization noise in BASecAgg. The amount of the increased variance in is negligible for large , which will be demonstrated in our experiments in Section 6.

6 Experiments

In this section, we demonstrate the convergence performance of BASecAgg compared to the buffered asynchronous FL scheme from [20] termed FedBuff. We measure the performance in terms of the model accuracy evaluated over the test samples with respect to the global round index .

Datasets and network architectures. We consider an image classification task on the MNIST dataset [15] and CIFAR-10 dataset [14]. For MNIST dataset, we train LeNet [15]

. For CIFAR-10 dataset, we train the convolutional neural network (CNN) used in

[27]

. These network architectures are sufficient for our needs as our goal is to evaluate various schemes, not to achieve the best accuracy. More details about hyperparameters are provided in Appendix

B.

Setup. We consider a buffered asynchronous FL setting with users and a single server having the buffer of size . For IID data distribution, the training samples are shuffled and partitioned into

users. For asynchronous training, we assume the staleness of each user is uniformly distributed over

, i.e., , as used in [27]. We set the field size , which is the largest prime within bits.

Implementations. We implement two schemes, FedBuff and BASecAgg. The key difference between two schemes is that in BASecAgg, the local updates are quantized and converted into the finite field to provide privacy of the individual local updates while all operations are carried out over the domain of real numbers in FedBuff. For both schemes, to compensate the staleness of the local updates, we employ the two strategies for the weighting function: a constant function and a polynomial function .

Empirical results. In Figure 1(a) and 1(b), we demonstrate that BASecAgg has almost the same performance as FedBuff on both MNIST and CIFAR-10 datasets while BASecAgg includes quantization noise to protect the privacy of individual local updates of users. This is because the quantization noise in BASecAgg is negligible as explained in Remark 1. To compensate the staleness of the local updates over the finite field in BASecAgg, we implement the quantized staleness function defined in (15) with , which has the same performance to mitigate the staleness as the original staleness function carried out over the domain of real numbers.

(a) MNIST dataset.
(b) CIFAR- dataset.
Figure 1: Accuracy of BASecAgg and FedBuff with two strategies for the weighting function to mitigate the staleness: a constant function (no compensation) named Constant; and a polynomial function named Poly where .
(a) MNIST dataset.
(b) CIFAR- dataset.
Figure 2: Accuracy of BASecAgg and FedBuff with various values of quantization parameter .

Performance with various quantization levels. To investigate the impact of the quantization, we measure the performance with various values of quantization parameter on MNIST and CIFAR-10 datasets in Fig. 2. We can observe that has the best performance while small or large value of has the poor performance. This is because the value of provides a trade-off between two sources of quantization noise: 1) the rounding error from the stochastic rounding function defined in (10) and 2) the wrap-around error when modulo operations are carried out in the finite field. When has small value the rounding error is dominant while the wrap-around error is dominant when has large value. To find a proper value of , we can utilize the auto-tuning algorithm proposed in [5].

7 Conclusions

In this paper, we have proposed a buffered asynchronous secure aggregation protocol (BASecAgg) that is not based on TEEs. The independence of TEEs allows BASecAgg to have any buffer size unlike FedBuff. The crux of BASecAgg is that it designs the masks of the users such that they cancel out in the buffer even if they belong to different training rounds. Our convergence analysis and experiments show that BASecAgg almost has the same convergence guarantees as FedBuff.

References

Appendix A Theoretical Guarantees of BASecAgg: Proof of Theorem 1

The proof of Theorem 1 directly follows from the following useful lemma that shows the unbiasedness and bounded variance still hold for the quantized gradient estimator for any .

Lemma 1.

For the quantized gradient estimator with a given vector where

is a uniform random variable representing the sample drawn,

is a gradient estimator such that and , and the stochastic rounding function is given in (10), the following holds,

(21)
(22)

where .

Proof.

(Unbiasedness). Given in (10) and any random variable , it follows that,

(23)

from which we obtain the unbiasedness condition in (21),

(24)

(Bounded variance). Next, we observe that,

(25)

from which we obtain the bounded variance condition in (22) as follows,

(26)
(27)

where (26) follows from the triangle inequality and (27) follows form (25). ∎

Now, the update equation of BASecAgg is equivalent to the update equation of FedBuff except that BASecAgg has an additional random source, stochastic quantization , which also satisfies the unbiasedness and bounded variance. One can show the convergence rate of BASecAgg presented in Theorem 1 by exchanging and variance-bound in [20] with and variance-bound , respectively.

Appendix B Experiment Details

In this appendix, we provide more details about the experiments of Section 6.

Hyperparameters. For all experiments, we tune the hyperparameters based on the validation accuracy for each dataset by partitioning of the training samples into the validation dataset. We use mini-batch SGD for all tasks with a mini-batch size of . We select the best parameters for the global learning rate , local learning rate , regularization parameter , and staleness exponent with the following sweep ranges

We have found that the best values of , , and are , , and , respectively for both MNIST and CIFAR-10 datasets. Finally, we have found that the best value of is and for MNIST and CIFAR-10 datasets, respectively.