Differentially Private Deep Learning with Smooth Sensitivity

03/01/2020 ∙ by Lichao Sun, et al. ∙ Salesforce University of Illinois at Chicago 14

Ensuring the privacy of sensitive data used to train modern machine learning models is of paramount importance in many areas of practice. One approach to study these concerns is through the lens of differential privacy. In this framework, privacy guarantees are generally obtained by perturbing models in such a way that specifics of data used to train the model are made ambiguous. A particular instance of this approach is through a "teacher-student" framework, wherein the teacher, who owns the sensitive data, provides the student with useful, but noisy, information, hopefully allowing the student model to perform well on a given task without access to particular features of the sensitive data. Because stronger privacy guarantees generally involve more significant perturbation on the part of the teacher, deploying existing frameworks fundamentally involves a trade-off between student's performance and privacy guarantee. One of the most important techniques used in previous works involves an ensemble of teacher models, which return information to a student based on a noisy voting procedure. In this work, we propose a novel voting mechanism with smooth sensitivity, which we call Immutable Noisy ArgMax, that, under certain conditions, can bear very large random noising from the teacher without affecting the useful information transferred to the student. Compared with previous work, our approach improves over the state-of-the-art methods on all measures, and scale to larger tasks with both better performance and stronger privacy (ϵ≈ 0). This new proposed framework can be applied with any machine learning models, and provides an appealing solution for tasks that requires training on a large amount of data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recent years have witnessed impressive breakthroughs of deep learning in a wide variety of domains, such as image classification (he2016deep)

, natural language processing 

(devlin2018bert), and many more. Many attractive applications involve training models using highly sensitive data, to name a few, diagnosis of diseases with medical records or genetic sequences (alipanahi2015predicting), mobile commerce behavior prediction (yan2017mobile), and location-based social network activity recognition (gong2018deepscan). In fact, many applications lack labeled sensitive data, which makes it challenging to build high performance models. This may require the collaboration of two parties such that one party helps the other to build the machine learning model. For example in a “teacher-student” framework, where the teacher owns the sensitive data and well-trained model to transfer its knowledge and help the student to label the unlabeled dataset from the student side. However, recent studies exploiting privacy leakage from deep learning models have demonstrated that private, sensitive training data can be recovered from released models (Papernot2017). Therefore, privacy protection is a critical issue in this context, and thus developing methods that protect sensitive data from being disclosed and exploited is critical.

Figure 1.

Overview of the proposed approach: first, the private data is partition to non-overlapping splits to train the teacher models. To train the student model, the ensemble of teachers then aggregate their predictions on the queried example from the student, followed by adding a large constant on the highest counted class vote. The count vector then gets randomly perturbed with noise followed by an ArgMax operation. Finally, the student model is trained by using the returned label from the teacher ensemble.

In order to protect the privacy of the training data and mitigate the effects of adversarial attacks, various privacy protection works have been proposed in the literature (michie1994machine; nissim2007smooth; samangouei2018defense; ma2018pdlm). The “teacher-student” learning framework with privacy constraints is of particular interest here, since it can provide a private student model without touching any sensitive data directly (hamm2016learning; pathak2010multiparty; papernot2017semi). The original purpose of a teacher-student framework is to transfer the knowledge from the teacher model to help train a student to achieve similar performance with the teacher. To satisfy the privacy-preserving need, knowledge from the teacher model is carefully perturbed with random noise, before being passed to the student model. In this way, one hopes that an adversary cannot ascertain the contributions of specific individuals in the original dataset even they have full access to the student model. Using the techniques of differential privacy, such protection can be guaranteed in certain settings. However, the current teacher-student frameworks (e.g. (Papernot2017) and (papernot2018scalable)) involve a trade-off between student’s performance and privacy. This is because the amount of noise perturbation required is substantial to ensure privacy at the desired level, which leads to degraded information passed to the student and results in sub-optimal models. We summarize some challenges of the current teacher-student framework:

  • [leftmargin=*,noitemsep,topsep=0pt]

  • Unavoidable trade-off between performance and privacy cost. The main challenge is the trade-off between performance and privacy cost. In order to protect the dataset, for each query, large noise needs to be added to perturb the output and return the noisy feedback to the student side. However, when the perturbation is significant, the returned feedback could be misleading as compared to the original information. Therefore, this is a fundamental trade-off for current methods, i.e. one has to choose to balance between privacy and performance.

  • Hard to set reasonable privacy budget. In practice, due to the trade-off described above, it is difficult to decide the privacy budget and noise scale for each query. This is because there is an inherent conflict between the student and teacher, i.e. one prefers a useful model and the other is more concerned on protecting sensitive data.

  • Large number of teacher models. To use less privacy cost per query, larger perturbation is required. To make sure the returned feedback is still useful after severe perturbation, a large number of teacher models is required. The hope is that with more models on the teacher side, the meaningful feedback gets more vote and thus can tolerate higher level of noise. However, this brings some new challenges. First, too many teacher models on subset datasets may result degradation of teacher ensemble performance, since each model is effectively trained with much less data. Second, because of the above, the teacher side now has to determine how to balance the number of data subsets and the performance of the model, which make the process a lot more complicated.

  • Hard to scale to more complex tasks. It is difficult for the current approaches to scale to more complex tasks that requires to train a more complex model with more data (e.g.IMAGENET). This is because, for those tasks the amount of data required is large in order to obtain a reasonable performance model. If one would need to subset the data into different partitions, it is likely to lead to a significant performance degradation.

In this paper, we develop a technique to address the aforementioned problems, which facilitates the deployment of accurate models with near zero privacy cost (NZC) when using smooth sensitivity. Instead of using traditional noisy ArgMax, we propose a new approach named immutable noisy ArgMax as describe in Section 2. We redesign the aggregation approach via adding a constant into the current largest count of the count vector, which enables immutable noisy ArgMax into teacher-student model. As a result, this method improves both privacy and utility over previous teacher-student based methods. The primary technical contributions of this paper is a novel mechanism for aggregating feedback from the teacher ensemble that are more immutable against larger noise without changing the consensus of the teachers. We show that our proposed method improves the performance of student model on all measures. Overall, our main research contributions are:

  • [leftmargin=*,noitemsep,topsep=0pt]

  • A high performance differential private framework with very low privacy cost. In this work, we redesign the query function , also named as data transformer. We add a constant into the voting vector to ensure the ArgMax is immutable after the noise perturbation. Our proposal can also be viewed as a generalization of existing teacher-student based work when . To the best of our knowledge, the proposed NZC is the first framework that proposes this mechanism.To further facilitate research in the area we will make our code publicly available.

  • A new mechanism with smooth sensitivity. Due to the properties of the proposed new data transformerr function , we need to use the data-dependent analysis approach for the whole process. In this paper, we use the smooth sensitivity which leverage the benefits from the proposed function and the properties of some specific datasets, and then we can receive an useful query feedback with a very small privacy cost (

    ). In addition, we also discuss three different sensitivity estimation with our proposed mechanism.

  • Empirical evaluation. We experimentally evaluate the proposed framework NZC on two standard datasets commonly used to evaluate machine learning models in a wide range of applications. Our results demonstrate that NZC can build a powerful student model which outperforms the previous works and give more realistic solution with our design.

2. Preliminary

In this section, we briefly overview some background related to deriving our new methods. We first introduce some basics in differential privacy, followed by the ArgMax and noisy ArgMax mechanism.

2.1. Differential Privacy

To satisfy the increasing demand for preserving privacy, differential privacy (DP) (10.1007/11681878_14) was proposed as a rigorous principle that guarantees provable privacy protection and has been extensively applied (Andres:2013; friedman2010data; apple_dp).

Let be a deterministic function that maps the dataset to the real numbers . This deterministic function , under the context of differential privacy, is called a query function of the dataset . For example, the query function may request the mean of a feature in the dataset, the gender of each sample. The goal in privacy is to ensure that when the query function is applied on a different but close dataset , the outputs of the query function are indistinguishably comparing to that from the dataset such that the private information of individual entries in the dataset can not be inferred by malicious attacks. Here, two datasets and are regarded as adjacent datasets when they are identical except for one single item.

Informally speaking, a randomized data release mechanism for a query function is said to ensure DP if “neighboring” inputs induce similar distributions over the possible outputs of the mechanism. DP is particularly useful for quantifying privacy guarantees of queries computed over a database with sensitive entries. The formal definition of differential privacy is given below.

Definition 0 (Differential Privacy (dwork2011a, Definition 2.4)).

A randomized mechanism is -differentially private if for any adjacent data , and , i.e , and any output of , we have


If , we say that is -differentially private. The parameter represents the privacy budget (dwork2011diff) that controls the privacy loss of . A larger value of indicates weaker privacy protection.

Definition 0 (Differential Privacy (dwork2011a, Definition 2.4)).

A randomized mechanism is -differentially private if for any adjacent data and , i.e , and any output of , we have


The privacy loss random variable

is defined as , i.e. the random variable defined by evaluating the privacy loss at an outcome sampled from .

From the notion of the DP, we know the sensitivity of the deterministic function (i.e. a query function) regarding the dataset is important for designing the mechanism for the query function. For different noise mechanisms, it requires different sensitivity estimation. In previous study of differential private deep learning, all mechanisms used the classical sensitivity analysis, named “Global sensitivity”. For example, the -norm sensitivity of the query function is used for Gaussian mechanism which is defined as , where and are two neighboring datasets. For the Laplacian mechanism, it uses the -norm sensitivity for random noise sampling. In essence, when the sensitivity is smaller, it means that the query function itself is not very distinguishable given different datasets.

A general method for enforcing a query function with the -differential privacy is to apply additive noise calibrated to the sensitivity of . A general method for conveniently ensuring a deterministic query to be the -differential privacy is via perturbation mechanisms that add calibrated noise to the query’s output (dwork2014foundations; dwork2010boosting; nissim2007smooth; duchi2013local).

Theorem 1 ((dwork2014the)).

If the -norm sensitivity of a deterministic function is , we have:


where preserves -differential privacy, and is the Laplacian distribution with location and scale .

Theorem 2 ((dwork2014the)).

If the -norm sensitivity of a deterministic function is , we have:



is a random variable obeying the Gaussian distribution with mean 0 and standard deviation

. The randomized mechanism is differentially private if and .

2.2. The ArgMax Mechanism

For any dataset The ArgMax Mechanism is widely used as a query function when is a vector of counts of the dimension same to the number of classes for sample . This decision-making mechanism is similar to the softmax mechanism of the likelihood for each label, but instead of using the likelihood as the belief of each label, the ArgMax mechanism uses the counts given by the teacher ensembles. Immediately, from the definition of the ArgMax mechanism, we know that the result given by the ArgMax mechanism is immutable against a constant translation, i.e.

where we use subscript to index through the vector.

2.3. The Noisy ArgMax Mechanism

Now, we want to ensure that the outputs of the given well-trained teacher ensembles are differentially private. A simple algorithm is to add independently generated random noise (e.g. independent Laplacian, Gaussian noise, etc.) to each count and return the index of the largest noisy count. This noisy ArgMax mechanism, introduced in (dwork2014the), is -differentially private for the query given by the ArgMax mechanism.

3. Our Approach

In this section, we introduce the specifics of our approach, which is illustrated in Figure 1. We first show the immutable noisy ArgMax mechanism, which is at the core of our framework. We then show how this property of immutable noisy ArgMax can be used in a differential private teacher-student training framework.

3.1. The Immutable Noisy ArgMax Mechanism

Definition 0 (Immutable Noisy ArgMax Mechanism).

Given a sample , a count and voting vector , when is a very large positive constant, the of is unaffected with significant noise added to the voting vector .

One interesting observation from the Noisy ArgMax mechanism is that when the aggregated results from the teachers are very concentrated (i.e. most of the predictions agrees on a certain class) and of high counts (i.e. large number of teachers), the result from the ArgMax will not change even under relatively large random noise. Therefore, the aforementioned scenario is likely to happen in practice, if all the teacher models have a relatively good performance on that task. This observation also hints us that if we can make the largest count much larger than the rest counts, we can achieve immutability with significant noise.

Let’s define the data transformer as a function that could convert a dataset into a count vector below:

Definition 0 (Data Transformer).

Given any dataset , the output of data transformer is an integer based vector, such as , where is the dimension of the vector.

Definition 0 (Distance- Data Transformer).

Given a dataset and data transformer function , the distance means the difference between the first and second largest counts given by the is larger than .

Note that, in this paper, for each query, the data transformer , where is the private dataset, is the student query, and is a customized constant by the teacher.

3.2. Smooth Sensitivity

Global Sensitivity Next, we need to add noise to perturb the output of the data transformer. In order to do that, we first need to estimate the sensitivity of the function. As mentioned in the preliminary, most previous deep learning approach uses the global sensitivity defined as below:

Definition 0 ().

(Global Sensitivity (dwork2014algorithmic)). For , for all , the global sensitivity of (with respect to the metric) is

where is neighbouring dataset of and returns the distance between two datasets. Global sensitivity is a worst case definition that does not take into consideration the property of a particular dataset. This can be seen from the operator, which find the maximum distance between the all possible dataset and its neighbour dataset . It is not hard to see that if we use global sensitivity, the distance of the data transformer function is .

Local Sensitivity In global sensitivity, noise magnitude depends on and the privacy parameter , but not on the dataset itself. This may not be an idea when analyzing data-dependent schemes, such as the teacher-student framework. The local measure of sensitivity reflects data-dependent properties and is defined in the following:

Definition 0 ().

(Local Sensitivity (dwork2014algorithmic)). For and a dataset , the local sensitivity of (with respect to the metric) is

Note that the global sensitivity , for all . While we use the local sensitivity to perturb our query (or output of function ), it depends on the properties of the dataset . Since it takes into account the data properties, it would be a more precise estimate when one employs data-dependent approaches. However, local sensitivity may itself be sensitive, which causes the perturbed results by local sensitivity does not satisfy the definition of differential privacy.

Smooth Sensitivity In order to add database-specific noise with smaller magnitude than the worst-case noise by global sensitivity, and yet satisfy the differential privacy, we introduce the smooth sensitivity (nissim2007smooth), which upper bounds on such that adding noise proportional to is safe.

Definition 0 ().

(Smooth Sensitivity (nissim2007smooth)). For , a dataset and , the -smooth sensitivity of is, the local sensitivity of is

It is not hard to see that the is upper bound of . Now, the local sensitivity of the data transformer function , given dataset (i.e. ) will have two different scales, i.e. 1 or –based on the dataset. , is a distance-2 data transformer, otherwise .

Based on the local sensitivity, we can add a large perturbation while the voting vector is the first situation ( for all ). In this case, even we add a very large noise by giving a small privacy budget for this specific query, the argmax of would not change due to a large constant added on the index of the largest count.

By using the smooth sensitivity we can find the -smooth sensitivity of data transformer is not like global sensitivity, but more like local sensitivity. Given a specific dataset , -smooth sensitivity could be , while is a distance-3 data transformer function, ensuring the largest local sensitivity of its neighbouring dataset is 1: . Otherwise, the -smooth sensitivity could be .

Lemma 0 ().

[Noisy ArgMax Immutability] Given any dataset , fixed noise perturbation vector and a data transformer function , the noisy argmax of both is immutable while we add a sufficiently large constant into the current largest count of .

Lemma 0 ().

[Local Sensitivity with ArgMax Immutability] Given any dataset , its adjacent dataset , fixed noise perturbation vector and a data transformer function , while ( or ) and the function is distance-2 data transformer.

Theorem 3 ().

[Differential private with Noisy ArgMax Immutability] Given any dataset , its adjacent dataset , fixed noise perturbation vector and a data transformer function , while and the function is distance-3 data transformer, the noisy argmax of both and is immutable and the same while we add a sufficiently large constant into the current largest count.

The proof of the above lemmas and theorem are provided in the appendix. In essence, when fed with a neighboring dataset , if the counts of is different by , the output of the ArgMax mechanism remains unchanged. This immutability to the noise or difference in counts due to the neighboring datasets, makes the output of the teacher ensemble unchanged, and thus maintain the advantage of higher performance in accuracy using the teacher ensembles.

Discussion From the above theorem, with smooth sensitivity the distance- data transformer will have for some specific dataset. This suggests that for this dataset, when we choose appropriate (e.g. a very large constant), it will incur very small privacy cost. At the same time, the would still preserve useful information from the teacher ensembles.

3.3. Near-Zero-Cost Query Framework

Now, we are ready to describe our near-zero-cost (NZC) query framework.To protect the privacy of training data during learning, NZC transfers knowledge from an ensemble of teacher models trained on non-overlapping partitions of the data to a student model. Privacy guarantees may be understood intuitively and expressed rigorously in terms of differential privacy. The NZC framework consists of three key parts: (1) an ensemble of teacher models, (2) an aggregation and noise perturbation and (3) training of a student model.

Ensemble of teachers: In the scenario of teacher ensembles for classification, we first partition the dataset into disjoint sub datasets and train each teacher separately on each set, where , is the number of the dataset and is the number of the teachers.

Aggregation and noise perturbation mechanism: For each sample , we collect the estimates of the labels given by each teacher, and construct a count vector , where each entry is given by . For each mechanism with fixed sample , before adding random noise, we choose to add a , then we have a new count vector . Our motivation is not to protect from the student, but protect the dataset from the teacher. Basically, if we fix the partition, teacher training and a query from the student, then we have data transformer that transfers the target dataset into a count vector. To be more clear, and a constant is used to define the data transformer and if we query times, then we have different data transformer based on each query . Then, by using a data transformer, we can achieve a count vector .

Note that, we use the following notation that , also shorted as , denotes the data transformer with adding a sufficiently large constant on the largest count, and denotes the count vector before adding a sufficiently large constant.

We add Laplacian random noise to the voting counts to introduce ambiguity:

where, is a privacy parameter and the Laplacian distribution with location and scale . The parameter influences the privacy guarantee, which we will analyze later.

Gaussian random noise is another choice for perturbing to introduce ambiguity:

where is the Gaussian distribution with mean

and variance


Intuitively, a small and large lead to a strong privacy guarantee, but can degrade the accuracy of the pre-trained teacher model and the size of each label in the dataset, as the noisy maximum f above can differ from the true plurality.

Unlike original noisy argmax, our proposed immutable noisy argmax will not increase privacy cost with increasing the number of queries, if we choose a sufficiently large constant and a large random noise by setting a very small for Laplacian mechanism (or a large for Gaussian mechanism). Therefore, for each query, it would cost almost zero privacy budget. By utilizing the property of immutable noisy argmax, we are allowed to have a very large number of queries with near zero privacy budget (setting and a large random noise for the mechanism).

Student model:

The final step is to use the returned information from the teacher to train a student model. In previous works, due to the limited privacy budget, one only can query very few samples and optionally use semi-supervised learning to learn a better student model. Our proposed approach enables us to do a large number of queries from the student with near zero cost privacy budget overall. Like training a teacher model, here, the student model also could be trained with any learning techniques.

4. Privacy Analysis

We now analyze the differential privacy guarantees of our privacy counting approach. Namely, we keep track of the privacy budget throughout the student’s training using the moments accountant (Abadi et al., 2016). When teachers reach a strong quorum, this allows us to bound privacy costs more strictly.

4.1. Moment Accountant

To better keep track of the privacy cost, we use recent advances in privacy cost accounting. The moments accountant was introduced by (abadi2016deep), building on previous work (Bun and Steinke, 2016; Dwork and Rothblum, 2016; Mironov, 2016). Definition 3.

Definition 0 ().

Let be a randomized mechanism and a pair of adjacent databases. Let aux denote an auxiliary input. The moments accountant is defined as:


is the moment generating function of the privacy loss random variable.

The moments accountant enjoys good properties of composability and tail bound as given in (abadi2016deep):

[Composability]. Suppose that a mechanism consists of a sequence of adaptive mechanisms , where . Then, for any output sequence and any

where is conditioned on ’s output being for .

[Tail bound] For any , the mechanism is -differential privacy for

By using the above two properties, we can bound the moments of randomized mechanism based on each sub-mechanism, and then convert the moments accountant to -differential privacy based on the tail bound.

4.2. Analysis of Our Approach

Theorem 4 (Laplacian Mechanism with Teacher Ensembles).

Suppose that on neighboring databases , , the voting counts differ by at most in each coordinate. Let be the mechanism that reports . Then satisfies -differential privacy. Moreover, for any , , and ,


For each query , we use the aggregation mechanism with noise which is -DP. Thus over queries, we get -differential privacy (dwork2014foundations). In our approach, we can choose a very small for each mechanism with each query , which leads to very small privacy cost for each query and thus a low privacy budget. Overall, we cost near zero privacy budget while . Note that, is a very small number but is not exactly zero, and we can set to be very small that would result in a very large noise scale but still smaller than the constant that we added in . Meanwhile, similar results are also used in PATE (papernot2017semi), but both our work and PATE is based on the proof of (dwork2014foundations). Note that, for neighboring databases , , each teacher gets the same training data partition (that is, the same for the teacher with and with , not the same across teachers), with the exception of one teacher whose corresponding training data partition differs.

The Gaussian mechanism is based on Renyi differential privacy, and details have been discussed in (papernot2018scalable). Similar to the Laplacian mechanism, we also get near zero cost privacy budget overall due to setting a large and an even larger constant .

In the following, we show the relations between constant with and with in two mechanism while (or ) We first recall the following basic facts about the Laplacian and Gaussian distributions: if and , then for ,


Now if each (resp. ) for , then the

will not change. We can apply a simple union bound to get an upper bound on the probability of these events.


Thus to obtain a failure probability at most , in the Laplacian case we can take , and in the Gaussian case we can take .

5. Experiments

In this section, we evaluate our proposed method along with previously proposed models.

5.1. Experimental Setup

We perform our experiments on two widely used datasets on differential privacy: SVHN (Netzer2011) and MNIST (LeCun1998). MNIST and SVHN are two well-known digit image datasets consisting of 60K and 73K training samples, respectively. We use the same data partition method and train the 250 teacher models as in (papernot2017semi). In more detail, for MNIST, we use 10,000 samples as the student dataset, and split it into 9,000 and 1,000 as a training and testing set for the experiment. For SVHN, we use 26,032 samples as the student dataset, and split it into 10,000 and 16,032 as training and testing set. For both MNIST and SVHN, the teacher uses the same network structure as in (papernot2017semi).

Dataset Aggregator Queries Privacy Accuracy
answered bound Student Clean Votes Ground Truth
MNIST LNMax (=20) 100 2.04 63.5% 94.5% 98.1%
LNMax (=20) 1,000 8.03 89.8%
LNMax (=20) 5,000 8.03 94.1%
LNMax (=20) 9,000 8.03 93.4%
NZC (, ) 9,000 0 95.1%
NZC (5 teachers only) 9,000 0 97.8% 97.5%
SVHN LNMax (=20) 500 5.04 54.0% 85.8% 89.3%
LNMax (=20) 1,000 8.19 64.0%
LNMax (=20) 5,000 8.19 79.5%
LNMax (=20) 10,000 8.19 84.6%
NZC (, ) 10,000 0 85.7%
NZC (5 teachers only) 10,000 0 87.1% 87.1%
Table 1. Classification accuracy and privacy of the students. LNMax refers to the method from  (papernot2017semi). The number of teachers is set to 250 unless otherwise mentioned. We set to compute values of (to the exception of SVHN where ). Clean votes refers to a student that are trained from the noiseless votes from all teachers. Ground truth refers to a student that are trained with ground truth query labels.
LNMax NZC Clean LNMax NZC Clean
93.02% 94.33% 94.37% 87.11% 88.08% 88.06%
Table 2. Label accuracy of teacher ensembles when compared to the ground truth labels from various methods using 250 teachers. Clean denotes the aggregation without adding any noise perturbation.

5.2. Results on Teacher Ensembles

We primarily compare with (papernot2017semi), which also employs a teacher-student framework and has demonstrated strong performance. We did not compare with work from (papernot2018scalable) because the improvements are more on the privacy budget and the improvement of student performance on tasks are marginal111

The open source implementation given in

https://github.com/tensorflow/privacy only generates table 2 in the original paper from (papernot2018scalable), which does not provide any model performance.. We used implementation from the official Github222https://github.com/tensorflow/privacy, however, we are unable to reproduce the semi-supervised results. Therefore, in the following, we compare the result under fully supervised setting for both approaches.

The results on MNIST and SVHN datasets are shown in table 1. It is clear that the proposed approach achieves both better accuracy and much better privacy cost. In particular, the results are very close to the baseline results, where the student is trained by using the non-perturbed votes from the teacher ensembles. The main reason is that NZC is more robust against the random perturbations for most of the queries, which helps the student to obtain better quality labels for training. We also achieved strong privacy cost, because our approach allows us to use a very large noise scale, as long as the constant is set to a proper large value. To check if the above intuition is true, we calculate the number of correctly labeled queries from the teacher ensembles, and the result is shown in table 2. It is quite clear that our approach is more robust against noise perturbation as compared to the previous approach.

5.3. Parameter Analysis

The number of teachers would have a significant impact on the performance of the student, as the teachers were trained on non-overlapping split of the data. The more number of teachers, the less data a teacher has to train. This leads to less accurate individual teachers, and thus less likely to have correct vote for the query. As can be seen from Fig 2a, the performance of the teacher ensembles decreases as the number of teachers increases. This is more prominent for more challenging datasets (e.g. SVHN performance drops more significantly as compared to MNIST). We would like to note that, although the number of qualified samples increases as the number of teachers increase (see Fig 2c), it is at the cost of increasing the wrongly labeled queries, since the total accuracy of teachers has decreased. Because of this, it is likely to result in worse student performance. However, the previous approach such as PATE (papernot2017semi) or Scale PATE (papernot2018scalable) requires large number of teachers due to privacy budget constraints. Our approach does not have this limitation. Therefore, we experimented with fewer number of teachers and the results are shown in Table 2. The results from using less teachers improved significantly, and approaches closer to the performance when the training student with the ground truth.

(%) 1 5 10 25 50 100 250
MNIST 98.99 98.31 96.71 95.03 91.94 91.45 81.18
SVHN 93.99 93.21 91.2 88.93 85.78 82.7 75.93
Table 3. Average label accuracy of teacher models when compared to the ground truth labels from various methods using 250 teachers.
(a) Number of teachers versus Performance
(b) Distance-n versus Qualified Samples with 250 Teachers
(c) Distance-3 Qualified Samples versus Number of Teachers
Figure 2. (a) shows the trade-off between number of teachers and the performance of the teacher ensemble; (b) shows the percentage of qualified sample which satisfy the distance-n in whole dataset when using 250 teachers; (c) shows the percentage of distance-3 qualified samples over the dataset.

5.4. Discussion

As can be observed from the results, benefiting from the proposed mechanism with smooth sensitivity, NZC can use much less number of teacher models, which leads to better performance of individual teacher models due to larger amount of training data available to each model. A well-trained deep learning model requires much larger dataset scale when the task is more challenging (e.g. see (sun2017revisiting; Papernot2017)). In this case, our proposed NZC offers a more appealing solution for more challenging real world applications that requires more data to train. As our approach allows one to obtain good performing student model as well as low privacy budget by using less number of teachers, which in turn leads to improved performance of individual teacher (see Table 3). It is also interesting to note that, PATE can be viewed as a special case of our proposal, when we use distance-0 data transformer.

6. Related Work

Differential privacy is increasingly regarded as a standard privacy principle that guarantees provable privacy protection (beimel2014bounds)

. Early work adopting differential privacy focus on restricted classifiers with convex loss 

(bassily2014differentially; chaudhuri2011differentially; hamm2016learning; pathak2010multiparty; song2013stochastic)

. Stochastic gradient descent with differentially private updates is first discussed in

(song2013stochastic). The author starts to perturb each gradient update by random exponential noise. Then, (abadi2016deep) proposed DP-SGD, a new optimizer by carefully adding random Gaussian noise into stochastic gradient descent for privacy-preserving for deep learning approaches. At each step of DP-SGD

  by given a set random of examples, it needs to compute the gradient, clip the

norm of each gradient, add random Gaussian noise for privacy protection, and updates the model parameters based on the noisy gradient. Meanwhile, DP-SGD proved the new moments account that gets the more precise privacy estimation.

Intuitively, DP-SGD could be easily adopted with most existing deep neural network models built on the SGD optimizer. Based on DP-SGD(agarwal2018cpsgd) applies differential privacy on distributed stochastic gradient descent to achieve both communicate efficiency and privacy-preserving. (mcmahan2017learning) applies differential privacy to LSTM language models by combining federated learning and differential private SGD to guarantee user-level privacy.

(papernot2017semi) proposed a general approach by aggregation of teacher ensembles (PATE) that uses the teacher models’ aggregate voting decisions to transfer the knowledge for student model training. Our main framework is also inspired by PATE with a modification to the aggregation mechanism. In order to solve the privacy issues, PATE adds carefully-calibrated Laplacian noise on the aggregate voting decisions between the communication. To solve the scalability of the original PATE model, (papernot2018scalable) proposed an advanced version of PATE by optimizing the voting behaviors from teacher models with Gaussian noise. PATE-GAN (jordon2018pate) applies PATE to GANs to provide privacy guarantee for generate data over the original data. However, existing PATE or Scale PATE have spent much privacy budget and train lots of teacher models. Our new approach overcomes these two limitations and achieved better performance on both accuracy and privacy budget. Compared with PATE and our model, DP-SGD is not a teacher-student model.

(nissim2007smooth) first proposed the smooth sensitivity and proof the DP guarantee under the data-dependent privacy analysis. Then (papernot2017semi; papernot2018scalable) use the similar idea to study use the data-dependent under different scenarios. Compared with global sensitivity, the smooth sensitivity always shows more precise and accurate sensitivity which allows adding less noise perturbation per query. Finally, data-dependent differential privacy can improve both performance and privacy cost in DP area. This is also our recommendation used on the proposed mechanism in this work.

7. Conclusion

We propose a novel voting mechanism with smooth sensitivity – the immutable noisy ArgMax, which enables stable output with tolerance to very large noise. Based on this mechanism, we propose a simple but effective method for differential privacy under the teacher-student framework using smooth sensitivity. Our method benefits from the noise tolerance property of the immutable noisy ArgMax, which leads to near zero cost privacy budget. Theoretically, we provide detailed privacy analysis for the proposed approach. Empirically, our method outperforms previous methods both in terms of accuracy and privacy budget.

Appendix A Proofs

See 7


First, let us recall some facts discussed in the main paper. We first recall the following basic facts about the Laplacian and Gaussian distributions: if and , then for ,


Now if each (resp. ) for , then the will not change. We can apply a simple union bound to get an upper bound on the probability of these events.


Thus to obtain a failure probability at most , in the Laplacian case we can take , and in the Gaussian case we can take .

Since we have a sufficiently large constant , or , then minus any sampled noise from either Gaussian or Laplacian distribution is larger than 0 with probability, where we could set as a very small number which is close to 0. Then, the largest count of adds a positive number which not change the argmax result. ∎

See 8


First, we have (or ) and the function is distance-2 data transformer. For any adjacent , is immutable, since can only modify count due to the . However, the distance is larger than 2, then any modification of would not change the argmax. Assume the argmax will be changed, let us use presents the largest count and presents the second largest count:

which is conflict the distance-2 of for any cases. Then we prove that and have the same argmax. ∎

See 3


Given a dataset , by using Lemma 2, the local sensitivity of all neighbouring data is 1, while is distance-2 data transformer for all . Apparently, it requires the is a distance-3 data transformer to ensure the upper bound of smooth sensitivity of is 1 for all .

By using Lemma 1, we can see that after adding a sufficiently large count and noise perturbation will also not change the argmax information for both and . Then, we have the same argmax return over any , and DP also holds. ∎