Distributed Privacy-Preserving Prediction

10/25/2019 ∙ by Lingjuan Lyu, et al. ∙ 0

In privacy-preserving machine learning, individual parties are reluctant to share their sensitive training data due to privacy concerns. Even the trained model parameters or prediction can pose serious privacy leakage. To address these problems, we demonstrate a generally applicable Distributed Privacy-Preserving Prediction (DPPP) framework, in which instead of sharing more sensitive data or model parameters, an untrusted aggregator combines only multiple models' predictions under provable privacy guarantee. Our framework integrates two main techniques to guarantee individual privacy. First, we improve the previous analysis of the Binomial mechanism to achieve distributed differential privacy. Second, we utilize homomorphic encryption to ensure that the aggregator learns nothing but the noisy aggregated prediction. We empirically evaluate the effectiveness of our framework on various datasets, and compare it with other baselines. The experimental results demonstrate that our framework has comparable performance to the non-private frameworks and delivers better results than the local differentially private framework and standalone framework.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In real world, many practical applications would benefit from large-scale machine learning across sensitive datasets owned by different parties. This trend is motivated by the fact that the data owned by a single organization may be very homogeneous, resulting in an overfit model that might deliver inaccurate results when applied to other data. On the other hand, there is much demand to perform machine learning in a collaborative manner, since massive amount of data are often required to ensure sufficient computational power for test purpose. A common practice for distributed learning is to parallelize computation among multiple parties to help reduce the demand for resources on any single party, while reaping benefits from multiple data resources.

However, the increasing privacy and confidentiality concerns pose obstacles to collaboration [ohno2004protecting]. For the sake of privacy, most approaches cannot afford to share the trained model publicly. Even the prediction output by a trained model can reveal training data privacy through black-box attacks [tramer2016stealing, shokri2017membership]. Therefore, neither training data, trained model nor model prediction should be directly shared. Meanwhile, these privacy concerns can be largely reduced if appropriate privacy-preserving schemes can be applied before the relevant statistic is released.

In consideration of privacy concerns in a distributed setting, instead of sharing more sensitive local data or model parameters of any party, we examine an alternative approach which ensures that the aggregated prediction is differentially private, without compromising privacy or degrading test accuracy. As such, we are motivated to build a distributed privacy-preserving prediction (DPPP) framework, which distinguishes itself from existing techniques by allowing parties to keep full control of their own data, and distributes the total amount of noise required to guarantee global differential privacy (GDP) of the final prediction among multiple parties.

In particular, we study the applicability of DPPP to horizontally partitioned databases, where multiple parties each owns different groups of individuals with similar features. For example, different hospitals, each holding the same kind of information for different patients, can collaboratively perform statistical analyses of the union of their patients, while ensuring privacy for each patient. Consequently, hereafter, instead of training a centralized model to solve the task associated with the whole database , the whole database is partitioned into disjoint subsets, that are held by parties who are unwilling to make their training data, model parameters or model predictions public or share them with others. Here refers to the total number of training records in , and represent party ’s training data and labels respectively. Individual models are trained separately on each subset .

To better preserve privacy, we investigate a predicate function that releases model prediction for any test record. In the distributed scenario, the answer to each test record is obtained by applying to the disjoint subsets and summing their predictions. Considering individual privacy and potential black-box attacks [papernot2017practical], we appropriately perturb model predictions before releasing them for aggregation. The primitive of this provably private noisy sum is demonstrated by Blum et al. [blum2005practical]. Each party first trains a local model based on its training data , to answer the prediction query for any test point , each party applies a predicate function to , which returns the votes for all classes, i.e., , where

is an one-hot prediction vector that sums up to 1. Hence, the aggregated prediction for each class falls within the range

, where is total number of parties. It then follows that the sensitivity of the aggregate is 1 with respect to the change of any record, corresponding to the scenario when exactly one party changes its data. In other words, changing one record can at most affect one party’s prediction, thus the aggregated prediction for each class can be changed by at most 1. To further minimize the privacy leakage from local prediction

, maintain utility and ensure aggregator obliviousness, we combine distributed differential privacy and homomorphic encryption. This simple privacy-preserving aggregation of local model predictions is agnostic to the underlying machine-learning techniques (cf. differentially-private stochastic gradient descent in 

[abadi2016deep]).

Our contributions are summarized as follows:

  • We formulate a distributed privacy-preserving prediction framework, named DPPP, which combines distributed differential privacy (DDP) and homomorphic encryption to ensure individual privacy, maintain utility and provide aggregator obliviousness, i.e., the aggregator learns nothing but the noisy aggregated prediction.

  • We explore the stability of Binomial Mechanism (BM) to guarantee ()-differential privacy in the distributed setting, and formally provide a tightest bound to date.

  • The experimental results on various datasets demonstrate that DPPP delivers comparable performance to the non-private frameworks, and yields better results than the local differentially private and standalone frameworks.

2 Preliminaries and Related Work

2.1 Distributed Differential Privacy

Given that the aggregator evaluates a function on randomized result of parties (here we use hatted variables to denote the randomized versions of a party’s local result), and a set of compromised parties , we let and denote be the complement of , i.e., uncompromised set . A formal DDP definition adapted from [shi2011privacy] is given as follows.

Definition 1 (-Distributed DP).

Let , and

, We say that the data randomization procedure with randomness over the joint distribution of

preserves -distributed differential privacy (DDP), with respect to the function and under fraction of uncompromised parties if the following conditions hold: For any neighbouring databases that differ in one record, for any measurable subset , and for any subset of at least honest parties,

In the above definition, the probability is conditioned on the randomness

from compromised parties, i.e., it ensures that if at least participants are honest and uncompromised, we will accumulate noise of a similar magnitude as that of the GDP. For differentially-private aggregation of local statistics, DDP permits each party to randomise its local statistic to a lesser degree than would local differential privacy (LDP).

2.2 Homomorphic Encryption

Additive homomorphic encryption allows the calculation of the encrypted sum of plaintexts from their corresponding ciphertexts. Although there are several additive homomorphic cryptographic schemes, we use the threshold variant of Paillier scheme [damgaard2001generalisation] in our framework, because it not only allows additive homomorphic encryption, but also distributes decryption among parties.

In this cryptosystem, a party can encrypt the plaintext with the public key as

(1)

where ( denotes the multiplicative group of invertible elements of ) is selected randomly and privately by each party. The homomorphic properties of this cryptosystem can be described as:

(2)
(3)

where , are the plaintexts that need to be encrypted, , are the private randoms and is a constant.

In this paper, -threshold Paillier cryptosystem is adopted, in which the private key is distributed among parties (denoted as ), thus no single party has the complete private key. For any ciphertext , each party () computes a partial decryption with its own partial private key as:

(4)

Then based on the combining algorithm in [damgaard2001generalisation], at least partial decryptions are required to recover the plaintext .

2.3 Multi-party Privacy

In multi-party scenario where data is sourced from multiple parties and the server is not trustworthy, individual privacy has to be protected. Without homomorphic encryption, each party has to add sufficient noise to their statistics before sending them to the central server to ensure LDP. Since the aggregation sums up individual noise shares, the aggregated noise might render the aggregation useless. To preserve privacy without significantly degrading utility, differential privacy can be made distributed by combining with cryptographic protocols, as evidenced by the following schemes:

Rastogi et al. [rastogi2010differentially] allow an untrusted aggregator to privately sum over multiple data sources by combining distributed differential privacy and Paillier encryption. However, their proposed Distributed Laplace Perturbation Algorithm

(DLPA) generates Laplace noise using four Gaussian variables, while our solution is more efficient and simpler by exploiting the stability property of Binomial distribution.

Ács et al. [acs2011have] propose to differential-privately aggregate smart metering data over multiple slots. Smart meters are grouped into clusters, where each cluster covers thousands of smart meters corresponding to a quarter of a city. The Laplace noise-tainted readings are sent to an electricity distributor, but the requirement that all meters in a cluster share pairwise keys largely limits its applicability.

Shi et al. [shi2011privacy] formalize the notion of distributed differential privacy (DDP) to ensure (

)-differential privacy of the aggregate of time-series data in every time period, where noise is sourced from multiple participants. The symmetric geometric distribution is used as a discrete approximation to the Laplace distribution, where each participant adds noise probabilistically. However, their encryption scheme relies on a trusted dealer allocating

secrets that sum to 0, to the aggregator and participants. Moreover, their construction is not robust against malicious nodes and node failures.

More recently, Agarwal et al. [agarwal2018cpsgd] demonstrate that Binomial mechanism can achieve nearly the same utility as the Gaussian mechanism. However, they focus on the privacy of the gradients aggregated from clients in federated learning, which is different from the problem studied in this work. More importantly, they didn’t provide a complete scheme to protect against the untrusted aggregator.

In the absence of an aggregator, a distributed implementation is highly desired. The most relevant one called “Our Data, Ourselves” (ODO) [dwork2006our] is used to generate shares of random noise. For example, the shares of random Binomial noise can be generated by coin flipping, which is secure against malicious parties, however, it requires communication among parties and the expensive verifiable secret sharing technique results in multiplications and additions in shares, where is the number of parties, thus not preferable for a large number of parties.

Another line of work is Private Aggregation of Teacher Ensembles (PATE) proposed by Papernot et al. [papernot2016semi]. PATE first trains an ensemble of teachers on disjoint subsets of private data. These teachers are then used to train a student model that can accurately mimic the ensemble. However, PATE assumes a trusted aggregator, who counts teacher votes assigned to each class, adds carefully calibrated Laplace noise to the resulting vote histogram, and outputs the class with the most noisy votes as the ensemble’s prediction. Therefore, PATE fails to take into consideration an important issue in their design, i.e., the protection of individual teachers’ private information. In practical scenario like federated learning [mcmahan2016federated, mcmahan2018learning], these teachers could be mobile participates with high privacy standards.

3 Problem Definition

Similar as Shi et al. [shi2011privacy], we consider an untrusted aggregator who may have arbitrary auxiliary information. For example, the aggregator may collude with a set of compromised parties, who can reveal their data and noise values to the aggregator as a form of auxiliary information. Our goal is to guarantee the privacy of each individual’s data against an untrusted aggregator, even when the aggregator has arbitrary auxiliary information. To achieve this goal, we blind and encrypt the local statistics of parties before sharing them with the aggregator. Moreover, like most of the previous work [rastogi2010differentially, acs2011have, shi2011privacy], to ensure the correctness and functionality of the system, we do not consider protocol breakers, as protocol breakers may not be desirable in many practical settings, and are not in the commercial interest of collaborative service providers to provide prediction service. We remark that our privacy model is stronger than [shi2011privacy] in the sense that we do not trust the aggregator, and our framework is also robust to party failures, as indicated in Section 6. This assumption is often more realistic, as hospitals or financial institutions are probably not willing to disclose their private statistics to any third party. And one or more parties may fail to upload their encrypted values or fail to respond in the form of a Denial-of-Service attack.

In particular, we assume fewer than

of teachers are compromised – the rest are assumed to be honest. Decryption can be done by the remaining 2/3 teachers using threshold Paillier. Similar to PATE, we first train an ensemble of teachers on disjoint datasets, then a teacher aggregator aggregates local predictions instead of local data or local model parameters, which makes our approach applicable to any arbitrary and mixed classifiers,

i.e., the classifiers of teachers could be different. To preserve privacy and maintain utility, we combine distributed differential privacy (DDP) and homomorphic encryption. To be specific, for DDP, we distribute the noise generation task among multiple teachers, who jointly contribute randomness to ensure differential privacy of the global statistic, i.e., aggregated prediction. While a teacher may add less noise, the aggregated noise must meet the required level to guarantee the privacy of the global statistic upon release. Hence, the ultimate goal is to release a differentially-private global statistic, i.e., noisy aggregated prediction , where is the required noise to guarantee -differential privacy of the aggregated prediction. Moreover, since cryptographic protocol requires discrete inputs, instead of adding floating-point Gaussian noise to each each individual’s prediction, we leverage a Binomial mechanism to generate discrete Binomial noise, as described in the next section.

4 Binomial Mechanism

The Binomial Mechanism is based on the Binomial distribution parameterized by , , where is the number of tosses, and is the success probability. We now define the Binomial Mechanism (BM) for the query with the output space in , where is the number of classes, i.e., for each data point, a predicate function returns an one-hot prediction vector with a total of elements in that sum up to 1. Consider party ’s database and predicate : let be the local prediction vector produced by party given data , and be the -th element of , i.e., prediction (vote status) for class . If party assigns class to input , then while other elements are all 0’s, i.e., for . The noisy vote count on each class equals , replacing the whole database with differing only in one row changes the summation in each class by at most 1. Bounding the ratio of probabilities that occurs with inputs and amounts to bounding the ratio of probabilities that noise = and noise = + 1, for different possible ranges of values of . Given predicate , the goal of the BM is to compute the noisy vote count for each class (each coordinate of the aggregated prediction): , where is the random Binomial noise added to the vote count for each class

, and Binomial random variable

is independent for each class.

Theorem 1.

(Tighter bound). For , Binomial Mechanism is -differentially private so long as the total number of tosses . Note that this lower bound is tighter than given in [dwork2006our], but they both share the term.

Proof.

For Binomial distribution with , is termed as Binomial noise, where is a Binomial random variable sampled from with mean and success probability by performing coin flipping. To investigate how we can size Binomial noise, suppose is the random Binomial noise added to the vote count for each class , then

For the Binomial random variable with bias , whose mass at is

-differential privacy requires that

To express in terms of in an algebraically simple way, we use the inequality :

Therefore, the Binomial random variable can apparently achieve -differential privacy, as long as . Note:

  • . This equation will be used in Eq. (5) later.

  • This noise upper bound is tighter than Dwork et al.’s [dwork2006our].

However, note that when exceeds this upper bound, -differential privacy will be violated. Hence, we turn to the relaxed -differential privacy, which requires that

where . Since the Binomial distribution is symmetrical about its mean , the inequality above is equivalent to

According to Chernoff bound theorem, for any Binomial, and ,

Rewriting , and replacing with , with , the requirement for -differential privacy reduces to:

(5)

Therefore, Theorem 1 follows.∎

It should be noted that when , , which shows a constant-factor improvement over the original Binomial mechanism given in [dwork2006our]. It also implies that the Binomial and Gaussian Mechanism perform identically as .

Unlike Laplace or Gaussian distribution used in original PATE 

[papernot2016semi, papernot2018scalable], Binomial distribution avoids floating-point representation issues and enables efficient transmission, thus it can be seamlessly used with cryptosystem. Furthermore, the stability of Binomial distribution, as stated in Lemma 1, facilitates the noise distribution among multiple teachers.

Lemma 1.

(Stability of Binomial distribution). If and are independent Binomial variables with the same probability , then is also a Binomial variable, with distribution following .

5 Distributed Privacy-preserving Prediction

In the case of prediction for any unlabeled public test point , , i.e., a predicate function that given a test record returns the prediction . The aggregate of multiple predictions becomes: , where is the one-hot prediction vector produced by teacher ’s local model built on individual training data , hence the aggregate for each class equals to the sum of scalars. In distributed privacy-preserving prediction (DPPP), the goal is to privately release the aggregated prediction, i.e., noisy sum: , where , , and . Our framework aims to deliver the differentially private aggregated prediction that is close to the desired aggregate , while providing privacy guarantee.

To realize this goal, we take inspiration from PATE framework, which first trains an ensemble of teachers on disjoint subsets of the sensitive data, then the aggregator aggregates their outputs, this makes our approach applicable to any arbitrary and mixed classifiers. Moreover, we minimize the need for parties to trust any aggregator, hence improving PATE as follows: teachers add shares of noise to their predictions before forwarding to the aggregator. Aggregation at the aggregator results in an aggregated prediction with the noise accumulated from all teachers. Hence, each teacher may add less noise, as long as the aggregated noise in the aggregated prediction meets the required noise level, the privacy of the final output (aggregated prediction) can be guaranteed, as stated in Theorem 2, which follows immediately from Theorem 1 and Lemma 1.

Theorem 2.

Suppose are neighboring databases that differ by one record, then each coordinate of the aggregated prediction, as given by , differ by at most . Let be the mechanism that reports . Then satisfies -differential privacy, provided .

The privacy guarantee stems from the aggregation of teacher ensemble. If teacher assigns class to input , then equals 1 while all the other elements are all 0’s. For any test point , independent noise share is added to each teacher’s prediction for each class . Hence, the aggregated prediction for class is equivalent to , and the aggregator outputs the predicted class as:

(6)

where , and is the total number of tosses in Theorem 1. When there is a strong consensus among

teachers, the label they almost all agree on (maximum of the aggregated prediction) does not depend on any particular teacher model. Overall, DPPP provides a differentially private API: the privacy cost of each aggregated prediction made by the teacher ensemble is known. Semi-supervised learning can be used to train a student model with comparable utility given a limited set of labels from the aggregation mechanism 

[papernot2016semi, papernot2018scalable].

The frameworks of PATE and DPPP are illustrated in Fig. 1. As can be observed, the main difference between the original PATE and our DPPP is that teachers directly share non-private labels in PATE, while in DPPP, each teacher releases an one-hot prediction vector

with the position of value 1 corresponding to the predicted class. For example, for anomaly detection with two classes, each teacher outputs prediction

for any test point , indicating whether is an anomaly or not.

To preserve privacy of the final output (aggregated prediction) before releasing it to the student, we distribute the total amount of noise using DDP: each teacher adds a share of noise to its local prediction . The noise shares are chosen such that is sufficient to ensure -differential privacy of the aggregated prediction, but alone is not sufficient to ensure -differential privacy of local prediction, thus cannot be directly released to the aggregator. Therefore, this necessitates the help of cryptographic techniques to maintain utility and ensure aggregator obliviousness, as evidenced in [rastogi2010differentially, acs2011have, shi2011privacy, lyu2018ppfa]. In particular, we combine DDP with a distributed cryptosystem to achieve this goal. As shown in Fig. 1, for DPPP, each teacher first computes the encryption of before sending it to the aggregator. The aggregator computes the encryption of the noisy sum of all the local predictions as E . This aggregated encryption E is then sent back to all teachers who then use the threshold property to compute their respective decryption shares. Finally, the decryption shares (E) are forwarded to the aggregator, who combines all the decryption shares to get the final decryption, i.e., the aggregated prediction. Due to homomorphic encryption, DPPP provides aggregator obliviousness, i.e., the aggregator cannot learn anything about , the only information the aggregator can learn is the aggregated noisy prediction, thereby significantly reducing privacy leakage.

Fig. 1: Overview of the PATE and DPPP: (1) an ensemble of teachers is trained on disjoint subsets of the sensitive data, (2) the query from the student is labeled using the ensemble.

6 Distributed cryptosystem

As part of DPPP, we design a secure aggregation protocol based on the threshold Paillier cryptosystem [cramer2001multiparty]. As shown in Protocol 1, the proposed secure aggregation protocol can calculate the summation of teachers’ local predictions without disclosing any of them.

Input: Noisy prediction from each teacher
Output: The aggregated noisy prediction
  1. Each teacher encrypts its noisy prediction according to Eq. (1), and sends the ciphertext to the aggregator;

  2. The aggregator computes based on Eq. (2);

  3. The aggregator sends to the randomly chosen teachers;

  4. Each selected teacher calculates a partial decryption based on Eq. (4), and sends it to the aggregator;

  5. The aggregator combines all the partial decryptions from the chosen teachers to get the summation .

Protocol 1 Secure aggregation protocol

As we can see, the protocol mainly executes in two phases. In the first phase, the aggregator aggregates the encrypted noisy predictions as . Then in the second phase, a distributed decryption process is run to recover the aggregated noisy predictions . In this protocol, what the aggregator received from all teachers are the encrypted noisy predictions and partial decryptions. Moreover, all the calculations on the aggregator are conducted on the encrypted data. What the aggregator can know is only the summation of all teachers’ noisy predictions, based on which each teacher’s local prediction cannot be inferred. Note that the key generation needs to be done only once, hence secret-sharing protocols can be used for this purpose.

Our proposed DPPP can be made robust to the fault tolerance of less than 1/3 compromised teachers by adopting the following two solutions: (i) -threshold decryption [pedersen1991threshold, damgaard2001generalisation] requires the cooperation of at least honest teachers for decryption, where is the fraction of uncompromised teachers in Definition 1. If teachers fail to send their decryption shares, -threshold decryption ensures that a decryption can still be computed as long as . Threshold Paillier cryptosystem can tolerate the passive corruption of teachers; (ii) during the noise addition, each teacher sets its binomial noise as , where , i.e., leaving out teachers’ randomness is still sufficient to ensure differential privacy. We remark that the number of honest parties could be set as per different applications, and -threshold Paillier cryptosystem requires . Here the assumption of less than 1/3 compromised parties is often practical enough in most real scenarios.

7 Performance Evaluation

To demonstrate the effectiveness of our proposed DPPP, we compare it with the following four baseline frameworks.

Centralized non-private framework requires all teachers to send their local training data to the aggregator to train a global model. This framework delivers maximum utility, but minimum privacy.

Distributed non-private framework excludes both DP and cryptosystem, teachers directly share their local predictions with the aggregator.

Local differentially private (LDP) framework excludes cryptosystem but requires each teacher to add the required level of noise to ensure ()-LDP. The added noise is of the same level as the aggregated noise in DPPP, hence much more noise is added to local prediction compared to the noise added to local prediction in DPPP.

Standalone framework allows teachers to individually train local models on their limited local training data without any collaboration, and an end user directly sends test query to each teacher, then local prediction is released under the guarantee of ()-LDP. Therefore, this framework is expected to deliver maximum privacy, but minimum utility.

Fig. 2: Prediction accuracy of student queries for all datasets with varying .
Fig. 3: Prediction accuracy of student queries for all datasets with varying .

Here we remark that under the same noise mechanism, our proposed DPPP performs similarly as the centralized/global DP framework which resorts to a trusted aggregator to add noise to the aggregated predictions or vote counts, the only difference is that DPPP adopts DDP to distribute the required level of noise in global DP to multiple teachers, while ensuring privacy of each individual.

For experimentation, we investigate four different datasets. The MNIST dataset consists of a total of 70,000 handwritten digits formatted as 28x28 gray-level images, with digits locating at the center of the image111http://yann.lecun.com/exdb/mnist/. The classification objective is to classify the input as one of 10 possible digits [“0”-“9”]. The real-world Breast Cancer dataset 222https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) contains total 569 records with 32 features, each record is classified into two classes: malignant and benign. We randomly sampled 2/3 examples from the whole database as the training set, while the remaining 1/3 as the test set. The NSL-KDD dataset is used for intrusion detection333https://www.unb.ca/cic/datasets/nsl.html, which contains total 125973 records with 41 features, each record is classified into two classes: anomaly and normal. We implement on both the whole NSL-KDD dataset and a smaller subset called NSL-KDD-20 (with 20% train and test data sampled from NSL-KDD dataset).

In our experiments with MNIST, we stack two convolutional layers with max-pooling and one fully connected layer with ReLUs. We use SVM model with RBF kernel for the other datasets. To simulate the situation in which each teacher constitutes only a limited subset of the whole database, all the training records are randomly distributed among multiple teachers such that each teacher receives nearly the same amount of records. Following the rationales provided in 

[papernot2016semi], we empirically find appropriate values of for all the datasets by measuring the test accuracy of each teacher trained on one of the partitions of the whole training set, i.e., we trained ensembles of 250, 100, 100 and 20 teachers for the MNIST, NSL-KDD-20, NSL-KDD and Breast Cancer datasets respectively. To accommodate randomness during noise addition, We run each experiment for 20 times and report the average result. is set to be 1 in all figures, i.e., assuming no compromised teachers.

In particular, the results of the standalone framework are averaged over all teachers. Since should be sufficiently small compared to training data size [dwork2014algorithmic], we set in all the private frameworks, where is the training data size. We report the prediction accuracy of student queries in Fig. 2 and Fig. 3. As evidenced by Fig. 2, DPPP outperforms both the standalone and LDP frameworks, for all datasets under varying privacy budgets . The centralized and distributed non-private frameworks achieve similar accuracy, indicating that the distributed non-private framework incurs negligible accuracy degradation compared with the centralized non-private framework. Moreover, DPPP yields comparable accuracy with the distributed non-private framework when for MNIST, NSL-KDD-20 and NSL-KDD datasets, which is also comparable to PATE where each query has a low privacy budget of  [papernot2016semi].

We also notice that the impact of DPPP is more obvious on Breast Cancer dataset compared with other datasets. One reason is the limited available data split among smaller number of teachers, while compensating for the introduced noise in Eq. (6) requires large ensembles. For a large number of teachers, the aggregated predictions are accurate despite the injection of large amounts of random noise to ensure privacy.

Moreover, to investigate the impact of different , we vary with fixed . As can be observed in Fig. 3, the effectiveness of our proposed DPPP persists, and for a fixed , varying follows the similar trend as varying . However, we found that there is less difference with different , which agrees with the findings reported in [abadi2016deep]. In contrast, Fig. 2 shows that varying the value of has larger impact on accuracy.

To understand how the number of teachers affects privacy cost, and investigate the reason behind the comparable accuracies offered by DPPP, we study the disparity of labels assigned by teachers. Since each teacher’s vote is represented by a bit vector, hence each element of the aggregated prediction vector represents the votes for each candidate label. We measure the difference in noisy votes between the most popular label and the second most popular label, i.e., the gap. If the true gap is small, the aggregated noise might change the label assigned from the first to the second. When the number of teachers increases, the gap becomes larger, hence the topmost vote has a larger quorum, allowing for aggregation mechanisms to output the correct label in the presence of noise. That partly explains the high accuracy of MNIST, NSL-KDD-20 and NSL-KDD datasets. However, a larger implies a smaller training set allocated for each teacher, potentially reducing the teacher accuracy. For large enough , each teacher will have too little training data to be accurate. Hence the accuracy of labels predicted by the ensemble is greatly dependent on the number of teachers in the ensemble: the number of teachers needs to be large enough to ensure the topmost vote with a larger quorum can counter the injection of noise, hence lowering the impact of noise on accuracy. Meanwhile, the number of teachers is also limited by a trade-off between task complexity and data availability.

Computation and Communication Overhead: We use the typical 1024-bit key size and implement a -threshold Paillier cryptosystem using Paillier Encryption Toolbox [paillier]. The average computation time at a teacher is independent of the number of teachers and remains nearly constant. On the other hand, the time required by the aggregator might increase with the number of teachers, but this can be reduced by running the aggregator and teachers in parallel through MapReduce. In all cases, the computation overhead is quite small (within

ms), most of which is spent on cryptographic operations. For communication complexity, a Paillier ciphertext is estimated as 2048 bits (256 bytes). Therefore, the total communication cost between the aggregator and any teacher can be estimated as 256*c*3=768*c bytes, where

is the number of classes and refers to three rounds of communication in Fig. 1, which is also well within the realm of practicality as is usually small.

8 Discussion

Different from the typical balanced and IID data distribution, in real practice, due to the differences in sensor quality, ambient noise, and skill level, the collected data by each teacher might be: (1) Unbalanced: Due to the capabilities of different teachers, some teachers may have large training data, while others have little or no data. For substantially unbalanced data, most teachers have only a few examples, and a few teachers have a large number of examples. (2) Non-IID: The collected data by each teacher might not be representative of the population distribution.

These two aspects — usually considered in federated learning — might affect the accuracy of DPPP, especially when most teachers have few or extremely unrepresentative examples. To improve robustness to unbalanced and/or non-IID data distributions, current methods allow teachers to share locally trained model parameters or updates with the aggregator [mcmahan2016federated, mcmahan2018learning], but giving the aggregator access to all teachers’ parameters or updates clearly risks privacy leakage. To privately share individual model updates, Bonawitz et al. [bonawitz2017practical] propose a secure aggregation protocol to securely aggregate local model updates as the weighted average to update the global model on the aggregator. However, this incurs both extra computation and communication costs.

9 Conclusion and future work

We present a distributed privacy-preserving prediction framework, which enables multiple parties to collaboratively deliver more accurate predictions through an aggregation mechanism. Distributed differential privacy via Binomial mechanism and homomorphic encryption are combined to preserve individual privacy, maintain utility and ensure aggregator obliviousness. For the Binomial mechanism, we offer tighter bounds than the previous work. Preliminary analysis and performance evaluation confirm the effectiveness of our framework. We plan to extend our framework to the unbalanced and non-IID data distribution, and conduct detailed comparisons between Binomial mechanism and other DP mechanisms. We also expect to extend our framework to various machine learning scenarios beyond classification.

Acknowledgments

The authors would like to thank Prof. Benjamin Rubinstein, Dr. Justin Bedo, Dr. Chris Culnane and Prof. Vanessa Teague for the early discussions on DDP and homomorphic encryption. The authors also would like to thank Dr. Xingjun Ma for discussions about unbalanced and non-IID data distributions.

References