Secure Byzantine-Robust Machine Learning

06/08/2020 ∙ by Lie He, et al. ∙ 0

Increasingly machine learning systems are being deployed to edge servers and devices (e.g. mobile phones) and trained in a collaborative manner. Such distributed/federated/decentralized training raises a number of concerns about the robustness, privacy, and security of the procedure. While extensive work has been done in tackling with robustness, privacy, or security individually, their combination has rarely been studied. In this paper, we propose a secure two-server protocol that offers both input privacy and Byzantine-robustness. In addition, this protocol is communication-efficient, fault-tolerant and enjoys local differential privacy.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Problem setup, privacy, and robustness

We consider the distributed setup of user devices, which we call workers, with the help of two additional servers. Each worker

has its own private part of the training dataset. The workers want to collaboratively train a public machine learning model benefitting from the joint training data of all participants. In every training step, each worker computes its own private model update (for example a gradient based on its own data) denoted by the vector

. The aggregation protocol aims to compute the sum (or a robust version of this aggregation), which is then used to update a public model. While the result is public in all cases, the protocol must keep each private from any adversary or other workers. Security model. We consider honest-but-curious servers which do not collude with each other but may collude with malicious workers. An honest-but-curious server follows the protocol but may try to inspect all messages. We also assume that all communication channels are secure. We guarantee the strong notion of input privacy, which means the servers and workers know nothing more about each other than what can be inferred from the public output of the aggregation . Byzantine robustness model. We allow the standard Byzantine worker model which assumes that workers can send arbitrary adversarial messages trying to compromise the process. We assume that a fraction of up to () of the workers is Byzantine, i.e. are malicious and not follow the protocol. Additive secret sharing. Secret sharing is a way to split any secret into multiple parts such that no part leaks the secret. Formally, suppose a scalar is a secret and the secret holder shares it with  parties through secret-shared values . In this paper, we only consider additive secret-sharing where is a notation for the set which satisfy , with held by party . Crucially, it must not be possible to reconstruct from any . For vectors like , their secret-shared values are simply component-wise scalar secret-shared values. Two-server setting. We assume there are two non-colluding servers: model server (S1) and worker server (S2). S1 holds the output of each aggregation and thus also the machine learning model which is public to all workers. S2 holds intermediate values to perform Byzantine aggregation. Another key assumption is that the servers have no incentive to collude with workers, perhaps enforced via a potential huge penalty if exposed. It is realistic to assume that the communication link between the two servers S1 and S2 is faster than the individual links to the workers. To perform robust aggregation, the servers will need access to a sufficient number of Beaver’s triples. These are data-independent values required to implement secure multiplication in MPC on both servers, and can be precomputed beforehand. For completeness, the classic algorithm for multiplication is given in in Appendix LABEL:sec:beaver_s_mpc_protocol. Byzantine-robust aggregation oracles. Most of existing robust aggregation algorithms rely on distance measures to identity potential adversarial behavior blanchard2017machine,yin2018byzantinerobust,mhamdi2018hidden,li2019rsa,ghosh2019robust. All such distance-based aggregation rules can be directly incorporated into our proposed scheme, making them secure. While many aforementioned papers assume that the workers have i.i.d datasets, our protocol is oblivious to the distribution of the data across the workers. In particular, our protocol also works with schemes such as li2019rsa,ghosh2019robust designed for non-iid data.

[t]0.23[ page=2, trim=5.5in 2.2in 4.35in 3in, clip, width=]figures/MPCDL-diagram   [t]0.23[ page=3, trim=5.5in 2.2in 4.35in 3in, clip, width=]figures/MPCDL-diagram   [t]0.23[ page=4, trim=5.5in 2.2in 4.35in 3in, clip, width=]figures/MPCDL-diagram   [t]0.23[ page=5, trim=5.5in 2.2in 4.35in 3in, clip, width=]figures/MPCDL-diagram

Figure : WorkerSecretSharing: each worker secret-shares its update locally and uploads them to S1 and S2 separately.
Figure : RobustWeightSelection: Compute and reveal model distances on S2 and select a robust set of indices represented by by calling the Byzantine-robust oracle.
Figure : AggregationAndUpdate: Compute and reveal aggregation on S1. S1 updates the public model.
Figure : WorkerPullModel: Each worker pulls model from S1.
Figure : Illustration of protocol:two_server:robustdist. The orange components on servers represent the computation-intensive operations at low communication cost between servers.

Secure aggregation protocol: two-server model

Each worker first splits its private vector into two additive secret shares, and transmits those to each corresponding server, ensuring that neither server can reconstruct the original vector on its own. The two servers then execute our secure aggregation protocol. On the level of servers, the protocol is a two-party computation (2PC). In the case of non-robust aggregation, servers simply add all shares (we present this case in detail in protocol:two_server:nonrobust). In the robust case which is of our main interest here, the two servers exactly emulate an existing Byzantine robust aggregation rule, at the cost of revealing only distances of worker gradients on the server (the robust algorithm is presented in protocol:two_server:robustdist). Finally, the resulting aggregated output vector is sent back to all workers and applied as the update to the public machine learning model.

Non-robust aggregation

In each round, protocol:two_server:nonrobust consists of two stages:

  • [nolistsep]

  • WorkerSecretSharing (fig:diagram:1): each worker randomly splits its private input into two additive secret shares . This can be done e.g. by sampling a large noise and then using as the shares. Worker sends to S1 and to S2. We write for the secret-shared values distributed over the two servers.

  • AggregationAndUpdate (fig:diagram:3): Given some weights , each server locally computes . Then S2 sends its share to S1 so that S1 can then compute . S1 updates the public model with .

Our secure aggregation protocol is extremely simple, and as we will discuss later, has very low communication overhead, does not require cryptographic primitives, gives strong input privacy and is compatible with differential privacy, and is robust to worker dropouts and failures. We believe this makes our protocol especially attractive for federated learning applications. We now argue about correctness and privacy. It is clear that the output of the above protocol satisfies , ensuring that all workers compute the right update. Now we argue about the privacy guarantees. We track the values stored by each of the servers and workers:

  • [nolistsep]

  • S1: The secret share and the sum of other share .

  • S2: The secret share .

  • Worker : and .

Clearly, the workers have no information other than the aggregate and their own data. S2 only has the secret share which on their own leak no information about any data. Hence surprisingly, S2 learns no information in this process. S1 has its own secret share and also the sum of the other share. If , then and hence S1 is allowed to learn everything. If , then S1 cannot recover information about any individual secret share from the sum. Thus, S1 learns and nothing else.

Robust aggregation

We now describe how protocol:two_server:robustdist replaces the simple aggregation with any distance-based robust aggregation rule such as e.g. Multi-Krum [blanchard2017machine] which relies on computing for all pairs. The key idea is to use two-party MPC to do these computations in a secure way.

  • [nolistsep]

  • WorkerSecretSharing (fig:diagram:1): As before, each worker secret shares distributed over the two servers S1 and S2.

  • RobustWeightSelection (fig:diagram:2): After collecting all secret-shared values , the servers compute pairwise difference locally. S2 then reveals—to itself exclusively—in plain text all of the pairwise Euclidean distances between workers with the help of precomputed Beaver’s triples and protocol:beaver. The distances are kept private from S1 and workers. S2 then feeds these distances to the distance-based robust aggregation rule (e.g. Multi-Krum), returning (on S2) a weight vector (a seleced subset indices can be converted to a vector of binary values), and secret-sharing them with S1 for aggregation.

  • AggregationAndUpdate (fig:diagram:3): Given weight vector from previous step, we would like S1 to compute . S2 secret shares with S1 the values of instead of sending in plain-text since they may be private. Then, S1 reveals to itself, but not to S2, in plain text the value of using secret-shared multiplication and updates the public model.

  • WorkerPullModel (fig:diagram:4): Workers pull the latest public model on S1 and update it locally.

The key difference between the robust and the non-robust aggregation scheme is the weight selection phase where S2 computes all pairwise distances and uses this to run a robust-aggregation rule in a black-box manner. S2 computes these distances i) without leaking any information to S1, and ii) without itself learning anything other than the pair-wise distances (and in particular none of the actual values of ). To perform such a computation, S1 and S2 use precomputed Beaver’s triplets (protocol:beaver in the Appendix), which can be made available in a scalable way [smart2018taas].

Salient features

Overall, our protocols are very resource-light and straightforward from the perspective of the workers. Further, since we use Byzantine-robust aggregation, our protocols are provably fault-tolerant even if a large fraction of workers misbehave. This further lowers the requirements of a worker. In particular, Communication overhead. We denote by the time to upload a model (or gradient) to a server, and accordingly  for the time to download from a server, and for the time to transmit data between servers. In applications, individual uplink speed from worker and servers is typically the main bottleneck, as it is typically much slower than downlink, and the bandwidth between servers can be very large. For our protocols, the time spent on the uplink is within a factor of of the non-secure variants. Besides, our protocol only requires one round of communication, which is an advantage over interactive proofs. Fault tolerance. The workers in protocol:two_server:nonrobust and protocol:two_server:robustdist are completely stateless across multiple rounds and there is no offline phase required. This means that workers can start participating in the protocols simply by pulling the latest public model. Further, our protocols are unaffected if some workers drop out in the middle of a round. Unlike in [bonawitz2017practical], there is no entanglement between workers and we don’t face unbounded recovery issues. Compatibility with local differential privacy. One byproduct of our protocol can be used to convert differentially private mechanisms, such as abadi2016deep which only of the aggregate model which guarantees privacy, into the stronger locally differentially private mechanisms which guarantee user-level privacy. Other Byzantine-robust oracles. We can also use some robust-aggregation rules which are not based on pair-wise distances such as Byzantine SGD [alistarh2018byzantine]. Since the basic structures are very similar to protocol:two_server:robustdist, we put protocol:two_server:byzantinesgd in the appendix. Security. The security of protocol:two_server:nonrobust is straightforward as we previously discussed. The security of protocol:two_server:robustdist again relies on the separation of information between S1 and S2 with neither the workers nor S1 learning anything other than the aggregate . We will next formally prove that this is true even in the presence of malicious workers. [t] Two-Server Secure Aggregation (Non-robust variant) Setup: workers (non-Byzantine) with private vectors . Two non-colluding servers S1 and S2. Workers:  (WorkerSecretSharing)

  1. [nolistsep,noitemsep]

  2. split private into additive secret shares (such that )

  3. send to S1 and to S2


  1. [nolistsep,noitemsep]

  2. , S1 collects and S2 collects

  3. (AggregationAndUpdate):

    1. [nolistsep,noitemsep]

    2. On S1 and S2, compute locally

    3. S2 sends its share of to S1

    4. S1 reveals to everyone

[t] Two-Server Secure Robust Aggregation (Distance-Based) Setup: workers, of which are Byzantine. Two non-colluding servers S1 and S2. Workers:  (WorkerSecretSharing)

  1. [nolistsep,noitemsep]

  2. split private into additive secret shares (such that )

  3. send to S1 and to S2


  1. [nolistsep,noitemsep]

  2. , S1 collects gradient and S2 collects

  3. (RobustWeightSelection):

    1. [nolistsep,noitemsep]

    2. For each pair compute their Euclidean distance :

      • [nolistsep,noitemsep]

      • On S1 and S2, compute locally

      • Use precomputed Beaver’s triples (see protocol:beaver) to compute the distance

    3. S2 perform robust aggregation rule MultiKrum()

    4. S2 secret-shares with S1

  4. (AggregationAndUpdate):

    1. [nolistsep,noitemsep]

    2. On S1 and S2, use MPC multiplication to compute locally

    3. S2 sends its share of to S1

    4. S1 reveals to all workers.


  1. [nolistsep,noitemsep]

  2. (WorkerPullModel): Collect and update model locally

Theoretical guarantees


In the following lemma we show that protocol:two_server:robustdist gives the exact same result as non-privacy-preserving version. [Exactness of protocol:two_server:robustdist] The resulting in protocol:two_server:robustdist is identical to the output of the non-privacy-preserving version of the used robust aggregation rule. After secret-sharing to to two servers, protocol:two_server:robustdist performs local differences . Using shared-values multiplication via Beaver’s triple, S2 obtains the list of true Euclidean distances . The result is fed to a distance-based robust aggregation rule oracle, all solely on S2. Therefore, the resulting indices as used in are identical to the aggregation of non-privacy-preserving robust aggregation. With the exactness of the protocol established, we next focus on the privacy guarantee.


We prove probabilistic (information-theoretic) notions of privacy which gives the strongest guarantee possible. Formally, we will aim to show that the distribution of the secret does not change even after being conditioned on all observations made by all participants, i.e. each worker , S1 and S2. This implies that the observations carry absolutely no information about the secret. Our results rely on the existence of simple additive secret-sharing protocols as discussed in the Appendix. Each worker only receives the final aggregated at the end of the protocol and is not involved in any other manner. Hence no information can be leaked to them. We will now examine S1. The proofs below rely on Beaver’s triples which we summarize in the following lemma. [Beaver’s triples] Suppose we secret share and between S1 and S2 and want to compute on S2. There exists a protocol which enables such computation which uses precomputed shares such that S1 does not learn anything and S2 only learns . Due to the page limit, we put the details about Beaver’s triples, multiplying secret shares, as well as the proofs for the next two theorems to the Appendix. [Privacy for S1]theoremtheoremsOne Let where is the output of byzantine oracle or a vetor of 1s (non-private). Let and be the Beaver’s triple used in the multiplications. Let be the share of the secret-shared values on S1. Then for all workers

Note that the conditioned values are what S1 observes throughout the algorithm. and are intermediate values during shared values multiplication. For S2, the theorem to prove is a bit different because in this case S2 doesn’t know the output of aggregation . In fact, this is more similar to an independent system which knows little about the underlying tasks, model weights, etc. We show that while S2 has observed many intermediate values, it can only learn no more than what can be inferred from model distances. [Privacy for S2]theoremtheoremsTwo Let is the output of byzantine oracle or a vetor of 1s (non-private). Let and be the Beaver’s triple used in the multiplications. Let be the share of the secret-shared values on S2. Then for all workers


Note that the conditioned values are what S1 observed throughout the algorithm. and are intermediate values during shared values multiplication. The model distances indeed only leaks similarity among the workers. Such similarity, however, does not tell S2 information about the parameters; in mhamdi2018hidden the leeway attack attacks distance based-rules because they don’t distinguish two gradients with evenly distributed noise and two different gradients very different in one parameter. This means the leaked information has low impact to the privacy. It is also worth noting that curious workers can only inspect others’ values by learning from the public model/update. This is because in our scheme, workers don’t interact directly and there is only one round of communication between servers and workers. So the only message a worker receives is the public model update.

Combining with differential privacy

While input privacy is our main goal, our approach is naturally compatible with other orthogonal notions of privacy. Global differential privacy (DP) [shokri2015privacy, abadi2016deep, chase2017private] is mainly concerned about the privacy of the aggregated model, and whether it leaks information about the training data. On the other hand, local differential privacy (LDP) [evfimievski2003limiting, kasiviswanathan2011can]

is stronger notions which is also concerned with the training process itself. It requires that every communication transmitted by the worker does not leak information about their data. In general, it is hard to learn deep learning models satisfying LDP using iterate perturbation (which is the standard mechanism for DP) bonawitz2017practical. Our non-robust protocol

is naturally compatible with local differential privacy. Consider the usual iterative optimization algorithm which in each round performs


Here is the aggregate update, is the model parameters, and is the noise added for DP abadi2016deep. [from DP to LDP]theoremtheoremsThree Suppose that the noise in eq:opt-update is sufficient to ensure that the set of model parameters satisfy -DP for . Then, running eq:opt-update with using Alg. Document to compute by securely aggregating satisfies -LDP. Unlike existing approaches, we do not face a tension between differential privacy which relies on real-valued vectors and cryptographic tools which operate solely on discrete/quantized objects. This is because our protocols do not rely on any cryptographic primitives, in contrast to e.g. bonawitz2017practical. In particular, the vectors  can be full-precision (real-valued), and do not need to be quantized. Thus, our secure aggregation protocol can be integrated with a mechanism which has global DP properties e.g. abadi2016deep, and prove local DP guarantees for the resulting mechanism.

Empirical analysis of overhead

r0.5 [width=0.5]figures/performance.pdf Left: Actual time spent; Right: Time adjusted for network bandwidth.

We present an illustrative simulation on a local machine (i7-8565U) to demonstrate the overhead of our scheme. We use PyTorch with MPI to train a neural network of 1.2 million parameters on the MNIST dataset. We compare the following three settings: simple aggregation with 1 server, secure aggregation with 2 servers, robust secure aggregation with 2 servers (with Krum 

[blanchard2017machine]). The number of workers is always 5. fig:performance shows the time spent on all parts of training for one aggregation step. is the time spent on batch gradient computation; refers to the time spend on uploading and downloading gradients; is the time spend on communication between servers. Note that the server-to-server communication could be further reduced by employing more efficient aggregation rules. Since the simulation is run on a local machine, time spent on communication is underestimated. In the right hand side figure, we adjusts time by assuming the worker-to-server link has 100Mbps bandwidth and 1Gbps respectively for the server-to-server link. Even in this scenario, we can see that the overhead from private aggregation is small. Furthermore, the additional overhead by the robustness module is moderate comparing to the standard training, even for realistic deep-learning settings. For comparison, a zero-knowledge-proof-based approach need to spend 0.03 seconds to encode a submission of 100 integers [corrigan2017prio].

Literature review

Secure Aggregation. In the standard distributed setting with 1 server, bonawitz2017practical proposes a secure aggregation rule which is also fault tolerant. They generate a shared secret key for each pair of users. The secret keys are used to construct masks to the input gradients so that masks cancel each other after aggregation. To achieve fault tolerance, they employ Shamir’s secret sharing. To deal with active adversaries, they use a public key infrastructure (PKI) as well as a second mask applied to the input. A followup work mandal2018nike minimizes the pairwise communication by outsourcing the key generation to two non-colluding cryptographic secret providers. However, both protocols are still not scalable because each worker needs to compute a shared-secret key and a noise mask for every other client. When recovering from failures, all live clients need to be notified and send their corresponding masks to the server, which introduces significant extra communications. In contrast, workers in our scheme are freed from coordinating with other workers, which leads to a more scalable system. Byzantine-Robust Aggregation/SGD. blanchard2017machine first proposes Krum and Multi-Krum for training machine learning models in the presence of Byzantine workers. mhamdi2018hidden proposes a general enhancement recipe for Byzantine-resilient rules, termed Bulyan, to defend poisoning attacks. alistarh2018byzantine proves a robust SGD training scheme with optimal sample complexity and the number of SGD computations. muozgonzlez2019byzantinerobust uses HMM to detect and exclude Byzantine workers for federated learning. yin2018byzantinerobust proposes median and trimmed-mean -based Byzantine robust algorithms which achieve optimal statistical performance. Many of the aforementioned papers require i.i.d. datasets. Byzantine robust rules for non-i.i.d dataset have appeared only recently li2019rsa,ghosh2019robust. Further, xie2018phocas extends the Byzantine setting to attackers manipulating data transfer between workers and server and xie2018zeno extends it to tolerate an arbitrary number of Byzantine workers. pillutla2019robust proposes a robust aggregation rule RFA which is also privacy-preserving. However, it is only robust to the noise in the dataset but not Byzantine-tolerant because it relies on workers to compute aggregation weights according to the protocol. corrigan2017prio proposes a private and robust aggregation system (Prio) based on secret-shared non-interactive proof (SNIP). Similar to our settings, the clients’ secret share their input and send them the servers for aggregation. In fact, each client sends input shares along with a SNIP proof which will be used by servers to validate the submission. However, the generation of a SNIP proof on client is expansive and the cost increases with the dimension of vectors submitted. Besides, the robustness in this paper is limited to validating the range of the data, e.g. validate that input for a binary value should be 0 or 1. But it does not work if malicious clients misreport their private data. Inference As A Service. An orthogonal line of work is inference as a service or oblivious inference. A user encrypts its own data and uploads it to the server for inference. gilad2016cryptonets,rouhani2017deepsecure,hesamifard2017cryptodl,liu2017oblivious,mohassel2017secureml,rouhani2017deepsecure,chou2018faster,juvekar2018gazelle,riazi2019xonn falls into a general category of 2-party computation (2PC). A number of issues have to be taken into account: the non-linear activations should be replaced with MPC-friendly activations, represent the floating number as integers. [ryffel2019partially] uses functional encryption on polynomial networks. [gilad2016cryptonets]

also have to adapt activations to polynomial activations and max pooling to scaled mean pooling.

Server-Aided MPC. One common setting for training machine learning model with MPC is the server-aided case [mohassel2017secureml, chen2019secure]

. In previous works, both the model weights and the data are stored in shared values, which in turn makes the inference process computationally very costly. Another issue is that only a limited number of operations (function evaluations) are supported by shared values. Therefore, approximating non-linear activation functions again introduces significant overhead. In our paper, the computation of gradients are local to the workers, only output gradients are sent to the servers. Thus no adaptations of the worker’s neural network architectures for MPC are required.


In this paper, we propose a novel secure and Byzantine-robust aggregation rule. To our knowledge, this is the first work to address these two key properties jointly. Our algorithm is simple and fault tolerant and scales well with the number of workers. The protocol is based on two non-colluding honest-but-curious auxiliary servers. In addition, the Byzantine-robust aggregation rule used internally can be replaced by any existing distance-based robust rule. The communication overhead of our algorithm is roughly bounded by a factor of 2. The computation overhead, as shown in protocol:beaver, is simply the cost of a few linear operations which is marginal. No computational overhead is incurred on the workers.