Problem setup, privacy, and robustness
We consider the distributed setup of user devices, which we call workers, with the help of two additional servers. Each worker
has its own private part of the training dataset. The workers want to collaboratively train a public machine learning model benefitting from the joint training data of all participants. In every training step, each worker computes its own private model update (for example a gradient based on its own data) denoted by the vector
. The aggregation protocol aims to compute the sum (or a robust version of this aggregation), which is then used to update a public model. While the result is public in all cases, the protocol must keep each private from any adversary or other workers. Security model. We consider honestbutcurious servers which do not collude with each other but may collude with malicious workers. An honestbutcurious server follows the protocol but may try to inspect all messages. We also assume that all communication channels are secure. We guarantee the strong notion of input privacy, which means the servers and workers know nothing more about each other than what can be inferred from the public output of the aggregation . Byzantine robustness model. We allow the standard Byzantine worker model which assumes that workers can send arbitrary adversarial messages trying to compromise the process. We assume that a fraction of up to () of the workers is Byzantine, i.e. are malicious and not follow the protocol. Additive secret sharing. Secret sharing is a way to split any secret into multiple parts such that no part leaks the secret. Formally, suppose a scalar is a secret and the secret holder shares it with parties through secretshared values . In this paper, we only consider additive secretsharing where is a notation for the set which satisfy , with held by party . Crucially, it must not be possible to reconstruct from any . For vectors like , their secretshared values are simply componentwise scalar secretshared values. Twoserver setting. We assume there are two noncolluding servers: model server (S1) and worker server (S2). S1 holds the output of each aggregation and thus also the machine learning model which is public to all workers. S2 holds intermediate values to perform Byzantine aggregation. Another key assumption is that the servers have no incentive to collude with workers, perhaps enforced via a potential huge penalty if exposed. It is realistic to assume that the communication link between the two servers S1 and S2 is faster than the individual links to the workers. To perform robust aggregation, the servers will need access to a sufficient number of Beaver’s triples. These are dataindependent values required to implement secure multiplication in MPC on both servers, and can be precomputed beforehand. For completeness, the classic algorithm for multiplication is given in in Appendix LABEL:sec:beaver_s_mpc_protocol. Byzantinerobust aggregation oracles. Most of existing robust aggregation algorithms rely on distance measures to identity potential adversarial behavior blanchard2017machine,yin2018byzantinerobust,mhamdi2018hidden,li2019rsa,ghosh2019robust. All such distancebased aggregation rules can be directly incorporated into our proposed scheme, making them secure. While many aforementioned papers assume that the workers have i.i.d datasets, our protocol is oblivious to the distribution of the data across the workers. In particular, our protocol also works with schemes such as li2019rsa,ghosh2019robust designed for noniid data.Secure aggregation protocol: twoserver model
Each worker first splits its private vector into two additive secret shares, and transmits those to each corresponding server, ensuring that neither server can reconstruct the original vector on its own. The two servers then execute our secure aggregation protocol. On the level of servers, the protocol is a twoparty computation (2PC). In the case of nonrobust aggregation, servers simply add all shares (we present this case in detail in protocol:two_server:nonrobust). In the robust case which is of our main interest here, the two servers exactly emulate an existing Byzantine robust aggregation rule, at the cost of revealing only distances of worker gradients on the server (the robust algorithm is presented in protocol:two_server:robustdist). Finally, the resulting aggregated output vector is sent back to all workers and applied as the update to the public machine learning model.
Nonrobust aggregation
In each round, protocol:two_server:nonrobust consists of two stages:

[nolistsep]

WorkerSecretSharing (fig:diagram:1): each worker randomly splits its private input into two additive secret shares . This can be done e.g. by sampling a large noise and then using as the shares. Worker sends to S1 and to S2. We write for the secretshared values distributed over the two servers.

AggregationAndUpdate (fig:diagram:3): Given some weights , each server locally computes . Then S2 sends its share to S1 so that S1 can then compute . S1 updates the public model with .
Our secure aggregation protocol is extremely simple, and as we will discuss later, has very low communication overhead, does not require cryptographic primitives, gives strong input privacy and is compatible with differential privacy, and is robust to worker dropouts and failures. We believe this makes our protocol especially attractive for federated learning applications. We now argue about correctness and privacy. It is clear that the output of the above protocol satisfies , ensuring that all workers compute the right update. Now we argue about the privacy guarantees. We track the values stored by each of the servers and workers:

[nolistsep]

S1: The secret share and the sum of other share .

S2: The secret share .

Worker : and .
Clearly, the workers have no information other than the aggregate and their own data. S2 only has the secret share which on their own leak no information about any data. Hence surprisingly, S2 learns no information in this process. S1 has its own secret share and also the sum of the other share. If , then and hence S1 is allowed to learn everything. If , then S1 cannot recover information about any individual secret share from the sum. Thus, S1 learns and nothing else.
Robust aggregation
We now describe how protocol:two_server:robustdist replaces the simple aggregation with any distancebased robust aggregation rule such as e.g. MultiKrum [blanchard2017machine] which relies on computing for all pairs. The key idea is to use twoparty MPC to do these computations in a secure way.

[nolistsep]

WorkerSecretSharing (fig:diagram:1): As before, each worker secret shares distributed over the two servers S1 and S2.

RobustWeightSelection (fig:diagram:2): After collecting all secretshared values , the servers compute pairwise difference locally. S2 then reveals—to itself exclusively—in plain text all of the pairwise Euclidean distances between workers with the help of precomputed Beaver’s triples and protocol:beaver. The distances are kept private from S1 and workers. S2 then feeds these distances to the distancebased robust aggregation rule (e.g. MultiKrum), returning (on S2) a weight vector (a seleced subset indices can be converted to a vector of binary values), and secretsharing them with S1 for aggregation.

AggregationAndUpdate (fig:diagram:3): Given weight vector from previous step, we would like S1 to compute . S2 secret shares with S1 the values of instead of sending in plaintext since they may be private. Then, S1 reveals to itself, but not to S2, in plain text the value of using secretshared multiplication and updates the public model.

WorkerPullModel (fig:diagram:4): Workers pull the latest public model on S1 and update it locally.
The key difference between the robust and the nonrobust aggregation scheme is the weight selection phase where S2 computes all pairwise distances and uses this to run a robustaggregation rule in a blackbox manner. S2 computes these distances i) without leaking any information to S1, and ii) without itself learning anything other than the pairwise distances (and in particular none of the actual values of ). To perform such a computation, S1 and S2 use precomputed Beaver’s triplets (protocol:beaver in the Appendix), which can be made available in a scalable way [smart2018taas].
Salient features
Overall, our protocols are very resourcelight and straightforward from the perspective of the workers. Further, since we use Byzantinerobust aggregation, our protocols are provably faulttolerant even if a large fraction of workers misbehave. This further lowers the requirements of a worker. In particular, Communication overhead. We denote by the time to upload a model (or gradient) to a server, and accordingly for the time to download from a server, and for the time to transmit data between servers. In applications, individual uplink speed from worker and servers is typically the main bottleneck, as it is typically much slower than downlink, and the bandwidth between servers can be very large. For our protocols, the time spent on the uplink is within a factor of of the nonsecure variants. Besides, our protocol only requires one round of communication, which is an advantage over interactive proofs. Fault tolerance. The workers in protocol:two_server:nonrobust and protocol:two_server:robustdist are completely stateless across multiple rounds and there is no offline phase required. This means that workers can start participating in the protocols simply by pulling the latest public model. Further, our protocols are unaffected if some workers drop out in the middle of a round. Unlike in [bonawitz2017practical], there is no entanglement between workers and we don’t face unbounded recovery issues. Compatibility with local differential privacy. One byproduct of our protocol can be used to convert differentially private mechanisms, such as abadi2016deep which only of the aggregate model which guarantees privacy, into the stronger locally differentially private mechanisms which guarantee userlevel privacy. Other Byzantinerobust oracles. We can also use some robustaggregation rules which are not based on pairwise distances such as Byzantine SGD [alistarh2018byzantine]. Since the basic structures are very similar to protocol:two_server:robustdist, we put protocol:two_server:byzantinesgd in the appendix. Security. The security of protocol:two_server:nonrobust is straightforward as we previously discussed. The security of protocol:two_server:robustdist again relies on the separation of information between S1 and S2 with neither the workers nor S1 learning anything other than the aggregate . We will next formally prove that this is true even in the presence of malicious workers. [t] Setup: workers (nonByzantine) with private vectors . Two noncolluding servers S1 and S2. Workers: (WorkerSecretSharing)

[nolistsep,noitemsep]

split private into additive secret shares (such that )

send to S1 and to S2
Servers:

[nolistsep,noitemsep]

, S1 collects and S2 collects

(AggregationAndUpdate):

[nolistsep,noitemsep]

On S1 and S2, compute locally

S2 sends its share of to S1

S1 reveals to everyone

[t] Setup: workers, of which are Byzantine. Two noncolluding servers S1 and S2. Workers: (WorkerSecretSharing)

[nolistsep,noitemsep]

split private into additive secret shares (such that )

send to S1 and to S2
Servers:

[nolistsep,noitemsep]

, S1 collects gradient and S2 collects

(RobustWeightSelection):

[nolistsep,noitemsep]

For each pair compute their Euclidean distance :

[nolistsep,noitemsep]

On S1 and S2, compute locally

Use precomputed Beaver’s triples (see protocol:beaver) to compute the distance


S2 perform robust aggregation rule MultiKrum()

S2 secretshares with S1


(AggregationAndUpdate):

[nolistsep,noitemsep]

On S1 and S2, use MPC multiplication to compute locally

S2 sends its share of to S1

S1 reveals to all workers.

Workers:

[nolistsep,noitemsep]

(WorkerPullModel): Collect and update model locally
Theoretical guarantees
Exactness
In the following lemma we show that protocol:two_server:robustdist gives the exact same result as nonprivacypreserving version. [Exactness of protocol:two_server:robustdist] The resulting in protocol:two_server:robustdist is identical to the output of the nonprivacypreserving version of the used robust aggregation rule. After secretsharing to to two servers, protocol:two_server:robustdist performs local differences . Using sharedvalues multiplication via Beaver’s triple, S2 obtains the list of true Euclidean distances . The result is fed to a distancebased robust aggregation rule oracle, all solely on S2. Therefore, the resulting indices as used in are identical to the aggregation of nonprivacypreserving robust aggregation. With the exactness of the protocol established, we next focus on the privacy guarantee.
Privacy
We prove probabilistic (informationtheoretic) notions of privacy which gives the strongest guarantee possible. Formally, we will aim to show that the distribution of the secret does not change even after being conditioned on all observations made by all participants, i.e. each worker , S1 and S2. This implies that the observations carry absolutely no information about the secret. Our results rely on the existence of simple additive secretsharing protocols as discussed in the Appendix. Each worker only receives the final aggregated at the end of the protocol and is not involved in any other manner. Hence no information can be leaked to them. We will now examine S1. The proofs below rely on Beaver’s triples which we summarize in the following lemma. [Beaver’s triples] Suppose we secret share and between S1 and S2 and want to compute on S2. There exists a protocol which enables such computation which uses precomputed shares such that S1 does not learn anything and S2 only learns . Due to the page limit, we put the details about Beaver’s triples, multiplying secret shares, as well as the proofs for the next two theorems to the Appendix. [Privacy for S1]theoremtheoremsOne Let where is the output of byzantine oracle or a vetor of 1s (nonprivate). Let and be the Beaver’s triple used in the multiplications. Let be the share of the secretshared values on S1. Then for all workers
Note that the conditioned values are what S1 observes throughout the algorithm. and are intermediate values during shared values multiplication. For S2, the theorem to prove is a bit different because in this case S2 doesn’t know the output of aggregation . In fact, this is more similar to an independent system which knows little about the underlying tasks, model weights, etc. We show that while S2 has observed many intermediate values, it can only learn no more than what can be inferred from model distances. [Privacy for S2]theoremtheoremsTwo Let is the output of byzantine oracle or a vetor of 1s (nonprivate). Let and be the Beaver’s triple used in the multiplications. Let be the share of the secretshared values on S2. Then for all workers
() 
Note that the conditioned values are what S1 observed throughout the algorithm. and are intermediate values during shared values multiplication. The model distances indeed only leaks similarity among the workers. Such similarity, however, does not tell S2 information about the parameters; in mhamdi2018hidden the leeway attack attacks distance basedrules because they don’t distinguish two gradients with evenly distributed noise and two different gradients very different in one parameter. This means the leaked information has low impact to the privacy. It is also worth noting that curious workers can only inspect others’ values by learning from the public model/update. This is because in our scheme, workers don’t interact directly and there is only one round of communication between servers and workers. So the only message a worker receives is the public model update.
Combining with differential privacy
While input privacy is our main goal, our approach is naturally compatible with other orthogonal notions of privacy. Global differential privacy (DP) [shokri2015privacy, abadi2016deep, chase2017private] is mainly concerned about the privacy of the aggregated model, and whether it leaks information about the training data. On the other hand, local differential privacy (LDP) [evfimievski2003limiting, kasiviswanathan2011can]
is stronger notions which is also concerned with the training process itself. It requires that every communication transmitted by the worker does not leak information about their data. In general, it is hard to learn deep learning models satisfying LDP using iterate perturbation (which is the standard mechanism for DP) bonawitz2017practical. Our nonrobust protocol
is naturally compatible with local differential privacy. Consider the usual iterative optimization algorithm which in each round performs() 
Here is the aggregate update, is the model parameters, and is the noise added for DP abadi2016deep. [from DP to LDP]theoremtheoremsThree Suppose that the noise in eq:optupdate is sufficient to ensure that the set of model parameters satisfy DP for . Then, running eq:optupdate with using Alg. Document to compute by securely aggregating satisfies LDP. Unlike existing approaches, we do not face a tension between differential privacy which relies on realvalued vectors and cryptographic tools which operate solely on discrete/quantized objects. This is because our protocols do not rely on any cryptographic primitives, in contrast to e.g. bonawitz2017practical. In particular, the vectors can be fullprecision (realvalued), and do not need to be quantized. Thus, our secure aggregation protocol can be integrated with a mechanism which has global DP properties e.g. abadi2016deep, and prove local DP guarantees for the resulting mechanism.
Empirical analysis of overhead
r0.5 [width=0.5]figures/performance.pdf
We present an illustrative simulation on a local machine (i78565U) to demonstrate the overhead of our scheme. We use PyTorch with MPI to train a neural network of 1.2 million parameters on the MNIST dataset. We compare the following three settings: simple aggregation with 1 server, secure aggregation with 2 servers, robust secure aggregation with 2 servers (with Krum
[blanchard2017machine]). The number of workers is always 5. fig:performance shows the time spent on all parts of training for one aggregation step. is the time spent on batch gradient computation; refers to the time spend on uploading and downloading gradients; is the time spend on communication between servers. Note that the servertoserver communication could be further reduced by employing more efficient aggregation rules. Since the simulation is run on a local machine, time spent on communication is underestimated. In the right hand side figure, we adjusts time by assuming the workertoserver link has 100Mbps bandwidth and 1Gbps respectively for the servertoserver link. Even in this scenario, we can see that the overhead from private aggregation is small. Furthermore, the additional overhead by the robustness module is moderate comparing to the standard training, even for realistic deeplearning settings. For comparison, a zeroknowledgeproofbased approach need to spend 0.03 seconds to encode a submission of 100 integers [corrigan2017prio].Literature review
Secure Aggregation. In the standard distributed setting with 1 server, bonawitz2017practical proposes a secure aggregation rule which is also fault tolerant. They generate a shared secret key for each pair of users. The secret keys are used to construct masks to the input gradients so that masks cancel each other after aggregation. To achieve fault tolerance, they employ Shamir’s secret sharing. To deal with active adversaries, they use a public key infrastructure (PKI) as well as a second mask applied to the input. A followup work mandal2018nike minimizes the pairwise communication by outsourcing the key generation to two noncolluding cryptographic secret providers. However, both protocols are still not scalable because each worker needs to compute a sharedsecret key and a noise mask for every other client. When recovering from failures, all live clients need to be notified and send their corresponding masks to the server, which introduces significant extra communications. In contrast, workers in our scheme are freed from coordinating with other workers, which leads to a more scalable system. ByzantineRobust Aggregation/SGD. blanchard2017machine first proposes Krum and MultiKrum for training machine learning models in the presence of Byzantine workers. mhamdi2018hidden proposes a general enhancement recipe for Byzantineresilient rules, termed Bulyan, to defend poisoning attacks. alistarh2018byzantine proves a robust SGD training scheme with optimal sample complexity and the number of SGD computations. muozgonzlez2019byzantinerobust uses HMM to detect and exclude Byzantine workers for federated learning. yin2018byzantinerobust proposes median and trimmedmean based Byzantine robust algorithms which achieve optimal statistical performance. Many of the aforementioned papers require i.i.d. datasets. Byzantine robust rules for noni.i.d dataset have appeared only recently li2019rsa,ghosh2019robust. Further, xie2018phocas extends the Byzantine setting to attackers manipulating data transfer between workers and server and xie2018zeno extends it to tolerate an arbitrary number of Byzantine workers. pillutla2019robust proposes a robust aggregation rule RFA which is also privacypreserving. However, it is only robust to the noise in the dataset but not Byzantinetolerant because it relies on workers to compute aggregation weights according to the protocol. corrigan2017prio proposes a private and robust aggregation system (Prio) based on secretshared noninteractive proof (SNIP). Similar to our settings, the clients’ secret share their input and send them the servers for aggregation. In fact, each client sends input shares along with a SNIP proof which will be used by servers to validate the submission. However, the generation of a SNIP proof on client is expansive and the cost increases with the dimension of vectors submitted. Besides, the robustness in this paper is limited to validating the range of the data, e.g. validate that input for a binary value should be 0 or 1. But it does not work if malicious clients misreport their private data. Inference As A Service. An orthogonal line of work is inference as a service or oblivious inference. A user encrypts its own data and uploads it to the server for inference. gilad2016cryptonets,rouhani2017deepsecure,hesamifard2017cryptodl,liu2017oblivious,mohassel2017secureml,rouhani2017deepsecure,chou2018faster,juvekar2018gazelle,riazi2019xonn falls into a general category of 2party computation (2PC). A number of issues have to be taken into account: the nonlinear activations should be replaced with MPCfriendly activations, represent the floating number as integers. [ryffel2019partially] uses functional encryption on polynomial networks. [gilad2016cryptonets]
also have to adapt activations to polynomial activations and max pooling to scaled mean pooling.
ServerAided MPC. One common setting for training machine learning model with MPC is the serveraided case [mohassel2017secureml, chen2019secure]. In previous works, both the model weights and the data are stored in shared values, which in turn makes the inference process computationally very costly. Another issue is that only a limited number of operations (function evaluations) are supported by shared values. Therefore, approximating nonlinear activation functions again introduces significant overhead. In our paper, the computation of gradients are local to the workers, only output gradients are sent to the servers. Thus no adaptations of the worker’s neural network architectures for MPC are required.
Conclusion
In this paper, we propose a novel secure and Byzantinerobust aggregation rule. To our knowledge, this is the first work to address these two key properties jointly. Our algorithm is simple and fault tolerant and scales well with the number of workers. The protocol is based on two noncolluding honestbutcurious auxiliary servers. In addition, the Byzantinerobust aggregation rule used internally can be replaced by any existing distancebased robust rule. The communication overhead of our algorithm is roughly bounded by a factor of 2. The computation overhead, as shown in protocol:beaver, is simply the cost of a few linear operations which is marginal. No computational overhead is incurred on the workers.
Comments
There are no comments yet.