Byzantine-Resilient Secure Federated Learning

07/21/2020 ∙ by Jinhyun So, et al. ∙ University of Southern California University of California, Riverside 0

Secure federated learning is a privacy-preserving framework to improve machine learning models by training over large volumes of data collected by mobile users. This is achieved through an iterative process where, at each iteration, users update a global model using their local datasets. Each user then masks its local model via random keys, and the masked models are aggregated at a central server to compute the global model for the next iteration. As the local models are protected by random masks, the server cannot observe their true values. This presents a major challenge for the resilience of the model against adversarial (Byzantine) users, who can manipulate the global model by modifying their local models or datasets. Towards addressing this challenge, this paper presents the first single-server Byzantine-resilient secure aggregation framework (BREA) for secure federated learning. BREA is based on an integrated stochastic quantization, verifiable outlier detection, and secure model aggregation approach to guarantee Byzantine-resilience, privacy, and convergence simultaneously. We provide theoretical convergence and privacy guarantees and characterize the fundamental trade-offs in terms of the network size, user dropouts, and privacy protection. Our experiments demonstrate convergence in the presence of Byzantine users, and comparable accuracy to conventional federated learning benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Federated learning is a distributed training framework that has received significant interest in the recent years, by allowing machine learning models to be trained over the vast amount of data collected by mobile devices [1, 2]. In this framework, training is coordinated by a central server who maintains a global model, which is updated by the mobile users through an iterative process. At each iteration, the server sends the current version of the global model to the mobile devices, who update it using their local data and create a local model. The server then aggregates the local updates of the users and updates the global model for the next iteration [1, 2, 3, 4, 5, 6, 7, 8].

Security and privacy considerations of distributed learning are mainly focused around two seemingly separate directions: 1) ensuring the robustness of the global model against adversarial manipulations and 2) protecting the privacy of individual users. The first direction aims at ensuring that the trained model is robust against Byzantine faults that may occur in the training data or during protocol execution. These faults may result either from an adversarial user who can manipulate the training data or the information exchanged during the protocol, or due to device malfunctioning. Notably, it has been shown that even a single Byzantine fault can significantly alter the trained model [9]. The primary approach for defending against Byzantine faults is by comparing the local updates received from different users and removing the outliers at the server [9, 10, 11, 12]. Doing so, however, requires the server to learn the true values of the local updates of each individual user. The second direction aims at protecting the privacy of the individual users, by keeping each local update private from the server and the other users participating in the protocol [2, 3, 4, 5, 6, 7]. This is achieved through what is known as a secure aggregation protocol [2]. In this protocol, each user masks its local update through additive secret sharing using private and pairwise random keys before sending it to the server. Once the masked models are aggregated at the server, the additional randomness cancels out and the server learns the aggregate of all user models. At the end of the protocol, the server learns no information about the individual models beyond the aggregated model, as they are masked by the random keys unknown to the server. In contrast, conventional distributed training frameworks that perform gradient aggregation and model updates using the true values of the gradients may reveal extensive information about the local datasets of the users, as shown in [13, 14, 15].

This presents a major challenge in developing a Byzantine-resilient, and at the same time, privacy-preserving federated learning framework. On the one hand, robustness against Byzantine faults requires the server to obtain the individual model updates in the clear, to be able to compare the updates from different users with each other and remove the outliers. On the other hand, protecting user privacy requires each individual model to be masked with random keys, as a result, the server only observes the masked model, which appears as a uniformly random vector that could correspond to any point in the parameter space. Our goal is to reconcile these two critical directions. In particular, we want to address the following question,

“How can one make federated learning protocols robust against Byzantine adversaries while preserving the privacy of individual users?”.

In this paper, we propose the first single-server Byzantine-resilient secure aggregation framework, BREA, towards addressing this problem. Our framework is built on the following main principles. Given a network of mobile users with up to adversaries, each user initially secret shares its local model with the other users, through a verifiable secret sharing protocol [16]. However, doing so requires the local models to be masked by uniformly random vectors in a finite field [17], whereas the model updates during training are performed in the domain of real numbers. In order to handle this problem, BREA utilizes stochastic quantization to transfer the local models from the real domain into a finite field.

Verifiable secret sharing allows the users to perform consistency checks to validate the secret shares and ensure that every user follows the protocol. However, a malicious user can still manipulate the global model by modifying its local model or private dataset. BREA handles such attacks through a robust gradient descent approach, enabled by secure computations over the secret shares of the local models. To do so, each user locally computes the pairwise distances between the secret shares of the local models belonging to other users, and sends the computation results to the server. Since these computations are carried out using the secret shares, users do not learn the true values of the local models belonging to other users.

In the final phase, the server collects the computation results from a sufficient number of users, recovers the pairwise distances between the local models, and performs user selection for model aggregation. The user selection protocol is based on a distance-based outlier removal mechanism [9]

, to remove the effect of potential adversaries and to ensure that the selected models are sufficiently close to an unbiased gradient estimator. After the user selection phase, the secret shares of the models belonging to the selected users are aggregated locally by the mobile users. The server then gathers the secure computation results from the users, reconstructs the true value of the aggregate of the selected user models, and updates the global model. Our framework guarantees the privacy of individual user models, in particular, the server learns no information about the local models, beyond their aggregated value and the pairwise distances.

In our theoretical analysis, we demonstrate provable convergence guarantees for the model and robustness guarantees against Byzantine adversaries. We then identify the theoretical performance limits in terms of the fundamental trade-offs between the network size, user dropouts, number of adversaries, and privacy protection. Our results demonstrate that, in a network with mobile users, BREA can theoretically guarantee: i) robustness of the trained model against up to Byzantine adversaries, ii) tolerance against up to user dropouts, iii) privacy of each local model, against the server and up to colluding users, as long as , where is the number of selected models for aggregation.

We then numerically evaluate the performance of BREA and compare it to the conventional federated learning protocol, the federated averaging scheme of [1]. To do so, we implement BREA in a distributed network of users with up to Byzantine users who can send arbitrary vectors to the server or to the honest users. We demonstrate that BREA guarantees convergence against Byzantine users and its convergence rate is comparable to the convergence rate of federated averaging. BREA also has comparable test accuracy to the federated averaging scheme while BREA entails quantization loss to preserve the privacy of individual users.

Ii Related Work

In the non-Byzantine federated learning setting, secure aggregation is performed through a procedure known as additive masking [2], [18]. In this setup, users first agree on pairwise secret keys using a Diffie-Hellman type key exchange protocol [19]. After this step, users send a masked version of their local model to the server, where the masking is done using pairwise and private secret keys. Additive masking has a unique property that, when the masked models are aggregated at the server, additive masks cancel out, allowing the server to learn the aggregate of the local models. On the other hand, no information is revealed to the server about the local models beyond their aggregated value, which protects the privacy of individual users. This process works well if no users drop during the execution of the protocol. In wireless environments, however, users may drop from the protocol anytime due to the variations in channel conditions, or user preferences. Such user dropouts are handled by letting each user secret share their private and pairwise keys through Shamir’s secret sharing [17]. Then, the server can remove the additive masks by collecting the secret shares from the surviving users. This approach leads to a quadratic communication overhead in the number of users. More recent approaches have focused on reducing the communication overhead, by training in a smaller parameter space [20], autotuning the parameters[21], or by utilizing coding techniques [6].

Another line of work has focused on differentially-private federated learning approaches [22, 23], to protect the privacy of personally-identifiable information against inference attacks that may be initiated from the trained model. Although our focus in not on differential-privacy, we believe our approach may in principle be combined with differential privacy techniques [24], which is an interesting future direction. Another important direction in federated learning is the study of fairness and how to avoid biasing the model towards specific users [25, 26]. The convergence properties of federated learning models are investigated in [27].

Distributed training protocols have been extensively studied in the Byzantine setting using clear (unmasked) model updates [9, 10, 11, 12]. The main defense mechanism to protect the trained model against Byzantine users is by comparing the model updates received from different users, and removing the outliers. Doing so ensures that the selected model updates are close to each other, as long as the network has a sufficiently large number of honest users. A related line of work is model poisoning attacks, which are studied in [28, 29].

In concurrent work, a Byzantine-robust secure gradient descent algorithm has been proposed for a two-server model in [30], however, unlike federated learning (which is based on a single-server architecture) [1, 2], this work requires two honest (non-colluding) servers who both interact with the mobile users and communicate with each other to carry out a secure two-party protocol, but do not share any sensitive information with each other in an attempt to breach user privacy. In contrast, our goal is to develop a single-server Byzantine-resilient secure training framework, to facilitate robust and privacy-preserving training architectures for federated learning. Compared to the two-server models, single server models carry the additional challenge where all information has to be collected at a single server, while still being able to keep the individual models of the mobile users private from the server.

The remainder of the paper is organized as follows. In Section III, we provide background on federated learning. Section IV presents our system model along with the key parameters that are used to evaluate the system performance. Section V introduces our framework and the details of the specific components. Section VI presents our theoretical results, whereas our numerical evaluations are provided in Section VII, to demonstrate the convergence and Byzantine-resilience of the proposed framework. The paper is concluded in Section VIII. The following notation is used throughout the paper. We represent a scalar variable with , whereas represents a vector. A set is denoted by , whereas refers to the set

. The term i.i.d. refers to independent identically distributed random variables.

Iii Background

Fig. 1: Secure aggregation in federated learning. At iteration , the server sends the current state of the global model, denoted by , to the mobile users. User forms a local model by updating the global model using its local dataset. The local models are aggregated in a privacy-preserving protocol at the server, who then updates the global model, and sends the new model, , to the mobile users.

Federated learning is a distributed training framework for machine learning in mobile networks while preserving the privacy of mobile users. Training is coordinated by a central server who maintains a global model with dimension . The goal is to train the global model using the data held at mobile devices, by minimizing a global objective function as,

(1)

The global model is updated locally by mobile users on sensitive private datasets, by letting

(2)

where is the total number of mobile users, denotes the local objective function of user , is the number of data points in user ’s private dataset , and . For simplicity, we assume that users have equal-sized datasets, i.e., for all .

Training is performed through an iterative process where mobile users interact through the central server to update the global model. At each iteration, the server shares the current state of the global model, denoted by , with the mobile users. Each user creates a local model,

(3)

where is an estimate of the gradient of the cost function and is a random variable representing the random sample (or a mini-batch of samples) drawn from . We assume that the private datasets have the same distribution and are i.i.d. where is a uniform random variable such that each

is an unbiased estimator of the true gradient

, i.e.,

(4)

The local models are aggregated at the server in a privacy-preserving protocol, such that the server only learns the aggregate of a large fraction of the local models, ideally the sum of all user models , but no further information is revealed about the individual models beyond their aggregated value. Using the aggregate of the local models, the server updates the global model for the next iteration,

(5)

where is the learning rate at iteration , and sends the updated model to the users. This process is illustrated in Figure 1.

Conventional secure aggregation protocols require each user to mask its local model using random keys before aggregation [2, 6, 31]. This is typically done by creating pairwise keys between the users through a key exchange protocol [19]. Using the pairwise keys, each pair of users agree on a pairwise random seed . User also creates a private random seed , which protects the privacy of the local model in case the user is delayed instead of being dropped, in which case the pairwise keys are not sufficient for privacy, as shown in [2]. User then sends a masked version of its local model , given by

(6)

to the server, where PRG is a pseudo random generator. User then secret shares as well as with the other users, via Shamir’s secret sharing [17]. For computing the aggregate of the user models, the server collects either the secret shares of the pairwise seeds belonging to a dropped user, or the shares of the private seed belonging to a surviving user (but not both). The server then recovers the private seeds of the surviving users and the pairwise seeds of the dropped users, and removes them from the aggregate of the masked models,

(7)

and obtains the aggregate of the local models as shown in (III), where and denote the set of surviving and dropped users, respectively.

Iv Problem Formulation

In this section, we describe the Byzantine-resilient secure aggregation problem, by extending the conventional secure aggregation scenario from Section III to the case when some users, known as Byzantine adversaries, can manipulate the trained model by modifying their local datasets or by sharing false information during the protocol.

We consider a distributed network with mobile users and a single server. User holds a local model111For notational clarity, throughout Sections IV and V, we omit the iteration number from the models . of dimension . The goal is to aggregate the local models at the server, while protecting the privacy of individual users. However, unlike the non-Byzantine setting of Section III, the aggregation operation in the Byzantine setting should be robust against potentially malicious users. To this end, we represent the aggregation operation by a function,

(8)

where is a set of users selected by the server for aggregation. The role of is to remove the effect of potentially Byzantine adversaries on the trained model, by removing the outliers. Similar to prior works on federated learning, our focus is on computationally-bounded parties, whose strategies can be described by a probabilistic polynomial time algorithm [2].

We evaluate the performance of a Byzantine-resilient secure aggregation protocol according to the following key parameters:

  • Robustness against Byzantine users: We assume that up to users are Byzantine (malicious), who manipulate the protocol by modifying their local datasets or by sharing false information during protocol execution. The protocol should be robust against such Byzantine adversaries.

  • Privacy of local models: The aggregation protocol should protect the privacy of any individual user from the server and any collusions between up to users. Specifically, the local model of any user should not be revealed to the server or the remaining users, even if up to users cooperate with each other by sharing information.222Collusions that may occur between the server and the users are beyond the scope of our paper.

  • Tolerance to user dropouts: Due to potentially poor wireless channel conditions, we assume that up to users may get dropped or delayed at any time during protocol execution. The protocol should be able to tolerate such dropouts, i.e., the privacy and convergence guarantees should hold even if up to users drop or get delayed.

In this paper, we present a single-server Byzantine-resilient secure aggregation framework (BREA) for the computation of (8). BREA consists of the following key components:

  1. Stochastic quantization: Users initially quantize their local models from the real domain to the domain of integers, and then embed them in a field of integers modulo a prime . To do so, our framework utilizes stochastic quantization, which is instrumental in our theoretical convergence guarantees.

  2. Verifiable secret sharing of the user models: Users then secret share their quantized models using a verifiable secret sharing protocol. This ensures that the secret shares created by the mobile users are valid, i.e., Byzantine users cannot cheat by sending invalid secret shares.

  3. Secure distance computation: In this phase, users compute the pairwise distances between the secret shares of the local models, and send the results to the server. Since this computation is performed using the secret shares of the models instead of their true values, users do not learn any information about the actual model parameters.

  4. User selection at the server: Upon receiving the computation results from the users, the server recovers the pairwise distances between the local models and selects the set of users whose models will be included in the aggregation, by removing the outliers. This ensures that the aggregated model is robust against potential manipulations from Byzantine users. The server then announces the list of the selected users.

  5. Secure model aggregation: In the final phase, each user locally aggregates the secret shares of the models selected by the server, and sends the computation result to the server. Using the computation results, the server recovers the aggregate of the models of the selected users, and updates the model.

In the following, we describe the details of each phase.

V The BREA Framework

In this section, we present the details of the BREA framework for Byzantine-resilient secure federated learning.

V-a Stochastic Quantization

In BREA, the operations for verifiable secret sharing and secure distance computations are carried out over a finite field for some large prime . To this end, user initially quantizes its local model from the domain of real numbers to the finite field. We assume that the field size is large enough to avoid any wrap-around during secure distance computation and secure model aggregation, which will be described in Sections V-C and V-E, respectively.

Quantization requires a challenging task as it should be performed in a way to ensure the convergence of the model. Moreover, the quantization function should allow the representation of negative integers in the finite field, and facilitate computations to be performed in the quantized domain. Therefore, we cannot utilize well-known gradient quantization techniques such as in [32], which represents the sign of a negative number separately from its magnitude. BREA addresses this challenge with a simple stochastic quantization strategy as follows. For any integer , we first define a stochastic rounding function:

(9)

where is the largest integer less than or equal to , and note that this function is unbiased, i.e., . Parameter

is a tuning parameter that corresponds to the number of quantization levels. Variance of

decreases as the value of increases, which will be detailed in Lemma 1 in Section VI. We then define the quantized model,

(10)

where the function from (9) is carried out element-wise, and a mapping function is defined to represent a negative integer in the finite field by using two’s complement representation,

(11)

V-B Verifiable Secret Sharing of the User Models

BREA protects the privacy of individual user models through verifiable secret sharing. This is to ensure that the individual user models are kept private while preventing the Byzantine users from breaking the integrity of the protocol by sending invalid secret shares to the other users.

To do so, user secret shares its quantized model with the other users through a non-interactive verifiable secret sharing protocol [16]. Our framework leverages Feldman’s verifiable secret sharing protocol from [16], which combines Shamir’s secret sharing with homomorphic encryption. In this setup, each party creates the secret shares using Shamir’s secret sharing [17], then broadcasts commitments to the coefficients of the polynomial they use for Shamir’s secret sharing, so that other parties can verify that the secret shares are constructed correctly. To verify the secret shares from the given commitments, the protocol leverages the homomorphic property of exponentiation, i.e., , whereas the privacy protection is based on the assumption that computation of the discrete logarithm in the finite field is intractable.

The individual steps carried out for verifiable secret sharing in our framework are as follows. Initially, the server and users agree on distinct elements from . This can be done offline by using a conventional majority-based consensus protocol [33, 34]. User then generates secret shares of the quantized model by forming a random polynomial of degree ,

(12)

in which the vectors are generated uniformly at random from by user . User then sends a secret share of to user , denoted by,

(13)

To make these shares verifiable, user also broadcasts commitments to the coefficients of , given by,

(14)

where denotes a generator of , and all arithmetic is taken modulo for some large prime such that divides .

Upon receiving the commitments in (14), each user can verify the validity of the secret share by checking the equality,

(15)

where all arithmetic is taken modulo . This commitment scheme ensures that the secret shares are created correctly from the polynomial in (12), hence they are valid. On the other hand, as we assume the intractability of computing the discrete logarithm [16], the server or the users cannot compute the discrete logarithm and reveal the quantized model from in (14).

V-C Secure Distance Computation

Verifiable secret sharing of the model parameters, as described in Section V-B, ensures that the users follow the protocol correctly by creating valid secret shares. However, malicious users can still try to manipulate the trained model by modifying their local models instead. In this case, the secret shares will be created correctly but according to a false model. In order to ensure that the trained model is robust against such adversarial manipulations, BREA leverages a distance-based outlier detection mechanism, such as in [35, 9]. The main principle behind these mechanisms is to compute the pairwise distances between the local models and select a set of models that are sufficiently close to each other. On the other hand, the outlier detection mechanism in BREA has to protect the privacy of local models, and performing the distance computations on the true values of the model parameters would breach the privacy of individual users.

We address this by a privacy-preserving distance computation approach, in which the pairwise distances are computed locally by each user, using the secret shares of the model parameters received from the other users. In particular, upon receiving the secret shares of the model parameters as described in Section V-B, user computes the pairwise distances,

(16)

between each pair of users , and sends the result to the server. Since the computations in (16) are performed over the secret shares, user learns no information about the true values of the model parameters and of users and , respectively. Finally, we note that the computation results from (16) are scalar values.

V-D User Selection at the Server

Upon receiving the computation results in (16) from a sufficient number of users, the server reconstructs the true values of the pairwise distances. During this phase, Byzantine users may send incorrect computation results to the server, hence the reconstruction process should be able to correct the potential errors that may occur in the computation results due to malicious users. Our decoding procedure is based on the decoding of Reed-Solomon codes.

The main intuition of the decoding process is that the computations from (16) correspond to evaluation points of a univariate polynomial of degree at most , where

(17)

for and . Accordingly, can be viewed as the encoding polynomial of a Reed-Solomon code with degree at most , such that the missing computations due to the dropped users correspond to the erasures in the code, and manipulated computations from Byzantine users refer to the errors in the code. Therefore, the decoding process of the server corresponds to decoding an Reed-Solomon code with at most erasures and at most errors. By utilizing well-known Reed-Solomon decoding algorithms [36], the server can recover the polynomial and obtain the true value of the pairwise distances by using the relation .

At the end, the server learns the pairwise distances

(18)

between the models of each pair of users . Then the server converts (18) from the finite field to the real domain as follows,

(19)

for , where is the integer parameter in (9) and the demapping function is defined as

(20)

We assume the field size is large enough to ensure the correct recovery of the pairwise distances,

(21)
(22)

where is the stochastic rounding function defined in (9) and (21) holds if

(23)

By utilizing the pairwise distances in (22), the server carries out a distance-based outlier removal algorithm to select the set of users to include in the final model aggregation. The outlier removal procedure of BREA follows the multi-Krum algorithm from [35, 9]. The main difference is that our framework considers the multi-Krum algorithm in a quantized stochastic gradient setting, as BREA utilizes quantized gradients instead of the true gradients, in order to enable privacy-preserving computations on the secret shares. We present the theoretical convergence guarantees of this quantized multi-Krum algorithm in Section VI, and numerically demonstrate its convergence behaviour in our experiments in Section VII.

In this setup, the server selects users through the following iterative process. At each iteration , the server selects one user, denoted by , by finding

(24)

where denotes the index set of the users selected in up to iterations and is a score function assigned to user at iteration . The score function of user is defined as

(25)

where denotes the set of users whose models are closest to the model of user . After selecting , the server updates the selected index set as where . After iterations, the server obtains the index set .

V-E Secure Model Aggregation

The final phase of BREA is to securely aggregate the local models of the selected users, without revealing the individual models to the server. To do so, the server initially announces the list of selected users via broadcasting. We denote the set of selected users by . Then, each user locally aggregates the secret shares belonging to the selected users,

(26)

and sends the result to the server. Upon receiving the computation results from a sufficient number of users, the server can decode the aggregate of the models through the decoding of Reed-Solomon codes.

The intuition of the decoding process is similar to the decoding of the pairwise distances described in Section V-D. Specifically, the computations from (26) can be viewed as evaluation points of a univariate polynomial of degree at most ,

(27)

One can then observe that is the encoding polynomial of a Reed-Solomon code with degree at most , the missing computations due to the dropped users correspond to the erasures in the code, and manipulated computations from Byzantine users correspond to the errors in the code. Therefore, the decoding process of the server corresponds to decoding an Reed-Solomon code with at most erasures and at most errors. Hence, by using a Reed-Solomon decoding algorithm, the server can recover the polynomial and obtain the true value of the aggregate of the selected user models by using the relation . We note that the total number of users selected by the server for aggregation, i.e., , should be sufficiently large, which can be agreed offline between the users and the server. Then, if the set announced by the server is too small (e.g., consisting of a single user), the honest users may opt to not send the computation results.

Upon learning the aggregate of the user models, the server updates the global model for the next iteration as follows,

(28)

where is the demapping function defined in (20) and is the integer parameter in (9). We assume that the field size is large enough to avoid wrap-around in such that

(29)
(30)

where (29) follows from (10).

Finally, it follows from (30) that the update equation in (28) is equivalent to

(31)

where is the stochastic rounding function defined in (9).

Having all above steps, the overall BREA framework can now be presented in Algorithm 1.

0:  Local dataset of users , number of iterations .  
0:  Global model .  
1:  for iteration  do
2:     for user  do
3:        Download the global model from the server.
4:        Create a local model from (3).
5:        Create the quantized model from (10).
6:        Generate secret shares from (13) and send to user .
7:        Generate commitments from (14) and broadcast to all users.
8:        Verify the secret shares by testing (15).
9:        Compute from (16) and send the results to the server.
10:     Server recovers in (18) from the computation results by utilizing Reed-Solomon decoding.
11:     Server converts from the finite field to the real domain to obtain the pairwise distances from (19).
12:     Server selects the set by utilizing the multi-Krum algorithm [9] based on the pairwise distances .
13:     Server broadcasts the set to all users.
14:     for user  do
15:        Compute from (26) and send the result to the server.
16:     Server recovers from the computation results by utilizing Reed-Solomon decoding.
17:     Server updates the global model, .
Algorithm 1 Byzantine-Resilient Secure Aggregation (BREA)

Vi Theoretical Analysis

In this section, we analyze the fundamental performance limits of BREA. The global model update equation of BREA can be expressed as follows,

(32)

where is the aggregation operation from (8) and represents the user selection and model aggregation procedures from Sections V-D and V-E, respectively, while is the stochastic rounding function defined in (9).

As described in Section III, the local model created by an honest user is an unbiased estimator of the gradient, i.e., with where and

is a uniform random variable representing the random sample (or a mini-batch of samples) drawn from the dataset. We define the local standard deviation

of the gradient estimator by

(33)

for all . The model created by a Byzantine user can refer to any random vector , which we represent as . Accordingly, the quantized model belonging to a Byzantine user could refer to any vector in .

Our first lemma states the unbiasedness and bounded variance of the quantized gradient estimator for any vector .

Lemma 1.

For the quantized gradient estimator with a given vector where is a uniform random variable representing the sample drawn, is a gradient estimator such that and , and the stochastic rounding function is given in (9), the following holds,

(34)
(35)

where .

Proof.

(Unbiasedness) Given in (9) and any random variable , it follows that,

(36)

from which we obtain the unbiasedness condition in (34),

(37)

(Bounded variance) Next, we observe that,

(38)

from which one can obtain the bounded variance condition in (35) as follows,

(39)
(40)

where (39) follows from the triangle inequality and (40) follows form (38). ∎

As discussed in Section IV, Byzantine users can manipulate the training protocol via two means, either by modifying their local model (directly or by modifying the local dataset), or by sharing false information during protocol execution. In this section, we demonstrate how BREA provides robustness in both cases. We first focus on the former case and study the resilience of the global model, i.e., conditions under which the trained model remains close to the true model, even if some users modify their local models adversarially. The second case, i.e., robustness of the protocol when some users exchange false information during the protocol execution, will be considered in Theorem 1.

In order to evaluate the resilience of the global model against Byzantine adversaries, we adopt the notion of -Byzantine resilience from [35].

Definition 1 (-Byzantine resilience, [35]).

Let be any angular value and be any integer. Let be any i.i.d random vectors such that with . Let be any random vectors. Then, function is -Byzantine resilient if, for any ,

(41)

satisfies, i) , and ii) for , is bounded above by where denotes a generic constant.

Lemma 2 below states that if the standard deviation caused by random sample selection and quantization is smaller than the norm of the true gradient, and , then the aggregation function from (32) is -Byzantine resilient where depends on the ratio of the standard deviation over the norm of the gradient [35].

Lemma 2.

Assume that and where

(42)

Let be i.i.d. random vectors in such that with and . Then, the aggregation function from (32) is -Byzantine resilient where is defined by .

Proof.

From Lemma 1, and . Then, the quantized multi-Krum algorithm described in Section V-D, where the multi-Krum algorithm applied to the quantized vectors , is -Byzantine resilient from Proposition of [35]. Hence, function in (32) is -Byzantine resilient. ∎

We now state our main result for the theoretical performance guarantees of BREA.

Theorem 1.

We assume that: 1) the cost function is three times differentiable with continuous derivatives, and is bounded from below, i.e., ; 2) the learning rates satisfy, and

; 3) the second, third, and fourth moments of the quantized gradient estimator do not grow too fast with the norm of the model, i.e.,

, for some constants and ; 4) there exist a constant such that for all , ; 5) the gradient of the cost function satisfies that for , there exist constants and such that

(43)
(44)

Then, BREA guarantees,

  • (Robustness against Byzantine users) The protocol executes correctly against up to Byzantine users and the trained model is -Byzantine resilient.

  • (Convergence) The sequence of the gradients converges almost surely to zero,

    (45)
  • (Privacy) The server or any group of up to users cannot compute an unknown local model. For any set of users of size at most ,

    (46)

    for all where denotes the messages that the members of receive.

for any , where is the number of selected models for aggregation.

Remark 1.

The two conditions and

are instrumental in the convergence of stochastic gradient descent algorithms

[37]. Condition states that the learning rates decrease fast enough, whereas condition bounds the rate of their decrease, to ensure that the learning rates do not decrease too fast.

Remark 2.

We consider a general (possibly non-convex) objective function . In such scenarios, proving the convergence of the model directly is challenging, and various approaches have been proposed instead. Our approach follows [37] and [9], where we prove the convergence of the gradient to a flat region instead. We note, however, that such a region may refer to any stationary point, including the local minima as well as saddle and extremal points.

Proof.

(Robustness against Byzantine users) The -Byzantine resilience of the trained model follows from Lemma 2. We next provide sufficient conditions for BREA to correctly evaluate the update function (32), in the presence of up to Byzantine users. Byzantine users may send any arbitrary random vector to the server or other users in every step of the protocol in Section V. In particular, Byzantine users can create and send incorrect computations in three attack scenarios: i) sending invalid secret shares in (13), ii) sending incorrect secure distance computations in (16), and iii) sending incorrect aggregate of the secret shares in (26).

The first attack scenario occurs when the secret shares in (13) do not refer to the same polynomial from (12). BREA utilizes verifiable secret sharing to prevent such attempts. The correctness (validity) of the secret shares can be verified by testing (15), whenever the majority of the surviving users are honest, i.e.,  [16, 33].

The second attack scenario can be detected and corrected by the Reed-Solomon decoding algorithm. In particular, as described in Section V-D, given , can be viewed as evaluation points of the polynomial given in (17) whose degree is at most . The decoding process at the server then corresponds to the decoding of an Reed-Solomon code with at most erasures and at most errors. As an Reed-Solomon code with erasures can tolerate a maximum number of errors [36], the server can recover the correct pairwise distances as long as , i.e. .

The third attack scenario can also be detected and corrected by the Reed-Solomon decoding algorithm. As described in Section V-E, are evaluation points of polynomial in (27) of degree at most . This decoding process corresponds to the decoding of an Reed-Solomon code with at most erasures and at most errors. As such, the server can recover the desired aggregate model as long as . Therefore, combining with the condition of Lemma 2, the sufficient conditions under which BREA guarantees robustness against Byzantine users is given by

(47)

(Convergence) We now consider the update equation in (32) and prove the convergence of the random sequence . From Lemma 2, the quantized multi-Krum function in (32) is -Byzantine resilient. Hence, from Proposition of [35], the random sequence converges almost surely to zero,

(48)

(Privacy) As described in Section V-B, we assume the intractability of computing discrete logarithms, hence the server or any user cannot compute from in (14). It is therefore sufficient to prove the privacy of each individual model against a group of colluding users, in the case where has size . If users cannot get any information about , then neither can fewer than users. Without loss of generality, let and where in (13) is the secret share of sent from user to user , in (16) is the pairwise distance of the secret shares sent from users and to user , and in (26) is the aggregate of the secret shares. As and are determined by , we can simplify the left hand side of (46) as

(49)

For any , is independent of . Hence, we have,

(50)

Then, for any realization of vectors , we obtain,

(51)

where (51) follows from the fact that any evaluation points define a unique polynomial of degree , which completes the proof of privacy. ∎

Vii Experiments

In this section, we demonstrate the convergence and resilience properties of BREA compared to conventional federated learning, i.e., the federated averaging scheme from [1], which is termed FedAvg throughout the section. We measure the performance in terms of the cross entropy loss evaluated over the training samples and the model accuracy evaluated over the test samples, with respect to the iteration index, .

Fig. 2: Test accuracy of BREA and FedAvg [1] for different number of Byzantine users.
Fig. 3: Convergence of BREA and FedAvg [1] for different number of Byzantine users.
Fig. 4: Convergence of BREA for different values of the quantization parameter in (9) with Byzantine users.

Network architecture: We consider an image classification task with classes on the MNIST dataset [38]

and train a convolutional neural network with 6 layers 

[1] including two

convolutional layers with stride 1, where the first and the second layers have 32 and 64 channels, respectively, and each is followed by ReLu activation and

max pooling layer. It also includes a fully connected layer with units and ReLu activation followed by a final softmax output layer.

Experiment setup: We assume a network of users where users may collude and users are malicious. We consider two cases for the number of Byzantine users: i) Byzantine users () and ii) Byzantine users (). Honest users utilize the ADAM optimizer [39] to update the local model by setting the size of the local mini-batch sample to for all where is the total number of iterations. Byzantine users generate vectors uniformly at random from where we set the field size , which is the largest prime within bits. For both schemes, BREA and FedAvg, the number of models to be aggregated is set to . FedAvg randomly selects models at each iteration while BREA selects users from (24).

Convergence and robustness against Byzantine users: Figure 2 shows the test accuracy of BREA and FedAvg for different number of Byzantine users. We can observe that BREA with and Byzantine users is as efficient as FedAvg with Byzantine users, while FedAvg does not tolerate Byzantine users. Figure 3 presents the cross entropy loss for BREA versus FedAvg for different number of Byzantine users. We omit the FedAvg with Byzantine users as it diverges. We observe that BREA with Byzantine users achieves convergence with comparable rate to FedAvg with Byzantine users, while providing robustness against Byzantine users and being privacy-preserving. For all cases of BREA in Figures 2 and 3, we set the quantization value in (9) to