Privacy concerns surfaced with the increased adoption of machine learning in domains like healthcare (LeCun et al., 2015). One widely adopted framework for measuring privacy characteristics of randomized algorithms, such as machine learning techniques, is differential privacy (Dwork et al., 2006). Abadi et al. (2016)
introduced an algorithm for differentially private stochastic gradient descent (DP-SGD), which made it feasible to scale differential privacy guarantees to neural networks. DP-SGD is now the de facto algorithm used for training neural networks with privacy guarantees.
There is, however, a crucial problem for models with a large number of parameters: it is difficult to achieve both non-trivial privacy guarantees and good accuracy. The reasons for this are numerous and involved, but the basic intuition is that DP-SGD involves clipping gradients and then adding noise to those gradients. However, gradient clipping has an increasing relative effect as the number of model parameters grows. This reduces the applicability of differential privacy to deep learning in practice, where strong performance tends to require a large number of model parameters.111
For example, the best performing CIFAR-10 classifier fromZagoruyko and Komodakis (2016) has 32.5 million parameters, while the private baseline we describe in Section 4.1 has about 26 thousand.
In this work, we propose a method to mitigate this problem, making DP learning feasible for modern image classifiers. The method is based on the observation that each DP-SGD update consumes privacy budget in proportion to the quotient of the batch size and the training-set size. Thus, by increasing the number of effective training examples, we can improve accuracy while maintaining the same privacy guarantees. Since obtaining more private training samples will generally be nontrivial, we focus on using ‘public’ data to augment the training set. This involves assuming that there will be unlabeled222 If there is labeled public data, the situation is even better, but this is likely to be rare and is not the setting we consider here.
public data that is sufficiently related to our private data to be useful, but we think this is a reasonable assumption to make, for two reasons: First, it is an assumption made by most of the semi-supervised learning literature(Oliver et al., 2018). Second, one of our techniques will explicitly address the situation where the ‘sufficiently related’ clause partially breaks down.
In summary, our contributions are:
We introduce an algorithm (DiversePublic ) to select diverse representative samples from a public dataset to fine-tune a DP classifier.
We then describe another algorithm (NearPrivate ) that pays a privacy cost to reference the private training data when querying the public dataset for representative samples.
We establish new state-of-the-art results on the MNIST and SVHN datasets by fine-tuning DP models using simple active learning techniques, and then improve upon those results further using DiversePublic and NearPrivate .
We open source all of our experimental code.333We will release the repository in the next update.
2.1 Differential Privacy
We reason about privacy within the framework of differential privacy (Dwork et al., 2006). In this paper, the random algorithm we analyze is the training algorithm, and guarantees are measured with respect to the training data.
Informally, an algorithm is said to be differentially private if its behavior is indistinguishable on pairs of datasets that only differ by one point. That is, an observer cannot tell whether a particular point was included in the model’s training set simply by observing the output of the training algorithm. Formally, for a training algorithm to be ()-differentially private, we require that it satisfies the following property for all pairs of datasets differing in exactly one data point and all possible subsets :
is a (small) probability for which we tolerate the property not to be satisfied. The parametermeasures the strength of the privacy guarantee: the smaller the value of is, the stronger the privacy guarantee is.
This guarantee is such that the output of a differentially private algorithm can be post-processed at no impact to the strength of the guarantee provided. In case privacy needs to be defined at a different granularity than invidividual training points, the guarantee degrades by a factor which, naively, is the number of points that are included in the granularity considered. See Dwork et al. (2014a) for further information.
2.2 Differentially Private Stochastic Gradient Descent (DP-SGD)
Building on earlier work (Chaudhuri et al., 2011; Song et al., 2013), Abadi et al. (2016) introduce a variant of stochastic gradient descent to train deep neural networks with differential privacy guarantees. Two modifications are made to vanilla SGD. First, gradients are computed on a per-example basis and clipped to have a maximum known
norm. This bounds the sensitivity of the training procedure with respect to each individual training point. Second, noise calibrated to have a standard deviation proportional to this sensitivity is added to the average gradient. This results in a training algorithm known as differentially private SGD (DP-SGD). Unfortunately (as discussed in Section1), DP-SGD does not perform well for models with large parameter counts, which motivates the improvements proposed in the next section.
3 Improving Differentially Private Models with Active Learning
This paper introduces the following high-level process to improve the performance of an existing DP classifier: First, find a public insensitive dataset containing relevant samples. Second, carefully select a subset of public samples that can improve the performance of the classifier. Third, fine-tune the existing classifier on the selected samples with labels. We want to perform the selection and the fine-tuning in a way which does not compromise the privacy guarantees of the existing classifier.
The first step can be done using domain knowledge about the classifier, e.g., utilizing relevant public genomic datasets for a DP classifier of genomics data. We assume standard fine-tuning techniques for the last step. Therefore, the problem boils down to efficiently selecting samples from the public dataset while preserving privacy. We also assume a limit on the number of selected samples. This limit is relevant when the samples are unlabeled, in which case it controls the cost of labeling (e.g., hiring human annotators to process the selected samples).
We introduce two active learning algorithms, DiversePublic and NearPrivate , for sample-selection. These algorithms make different assumptions about access to the private training data but are otherwise drop-in replacements in the end-to-end process.
3.1 Problem Statement and Baseline Methods
We are given a differentially private model trained and tested on private sensitive data and respectively. The privacy cost of training on is (we omit for brevity in this paper, but it composes similarly). And we have an extra set of public insensitive unlabeled data which can be utilized to further improve . Given the number of extra data that we are allowed to request labels for, and the total privacy budget of the improved model , we want to efficiently pick where , using which we can fine-tune to of better performance on . The simplest baseline method we will use is to choose ‘random’ samples out of for fine-tuning. The other baseline that we use is Uncertain Sampling (Settles, 2009)
, according to either the ‘entropy’ of the logits or the ‘difference’ between the two largest logits. These are widely used active learning methods and serve as strong baselines.
3.2 The DiversePublic Method
The baseline methods may select many redundant examples to label.
To efficiently select diverse representative samples, we propose the DiversePublic method, adapted from clustering based active learning methods (Nguyen and Smeulders, 2004).
Given a DP model of cost , we obtain the ‘embeddings’ (the
activations before the logits) of all , and perform PCA on as is done in Raghu et al. (2017).
Then we select a number (more than ) of uncertain points (according to, e.g., the logit entropy of the private model) from , project their embeddings onto the top few principal components, and cluster those projections into groups.
Finally, we pick a number of samples from each representative group up to in total and fine-tune the DP model with these .
Though this procedure accesses , it adds nothing to , since there is no private data referenced and the output of DP models can be post-processed at no additional privacy cost.
It can be applied even to models
for which we cannot access the original training data.444 e.g., models published to a repository like TensorFlow Hub:
e.g., models published to a repository like TensorFlow Hub:https://www.tensorflow.org/hub Algorithm 1 presents more details of the DiversePublic method.
3.3 The NearPrivate Method
The DiversePublic method works well under the assumption that has the same distribution as . However, this may not be a reasonable assumption in general (Oliver et al., 2018). For instance, there may be a subset of about which our pre-trained model has high uncertainty but which cannot improve performance if sampled. This may be because that subset contains corrupted data or it may be due simply to distribution shift. In order to mitigate this issue, we propose to check query points against our private data and decline to label query points that are too far from any training points (in projected-embedding-space). But doing this check while preserving privacy guarantees is nontrivial, since it involves processing the private training data itself in addition to . Given a DP model of cost , we obtain the embeddings of all , and perform differentially private PCA (DP-PCA) (Dwork et al., 2014b) on at a privacy cost of . We select a number of uncertain points from both and . Then in the space of low dimensional DP-PCA projections, we assign each uncertain private example to exactly one uncertain public example according to Euclidean distance. This yields, for each uncertain public example, a count of ‘nearby’ uncertain private examples. Finally, we choose points with the largest values of from the uncertain public data. This sampling procedure is differentially private with cost due to the Laplace mechanism (Dwork et al., 2014a). The total privacy cost of NearPrivate is composed of the budgets expended to perform each of the three operations that depend on the private data: . More details in Algorithm 2.
We conduct experiments on the MNIST (LeCun et al., 1998) and SVHN (Netzer et al., 2011) data sets. These may be seen as ‘toy’ data sets in the object recognition literature, but they are still challenging for DP object recognizers. In fact, at the time of this writing, there are no published examples of a differentially private SVHN classifier with both reasonable accuracy and non-trivial privacy guarantees. The baseline we establish in Section 4.2 is thus a substantial contribution by itself. For both datasets, we use the same model architecture as in the Tensorflow Privacy tutorials555Tutorials for TensorFlow Privacy are found at: https://github.com/tensorflow/privacy. We obtain by training on via DP-SGD (Abadi et al., 2016). Unless otherwise specified, we always aggregate results over 5 runs with different random seeds and use error bars to represent the standard deviation. We use the implementation of DP-SGD made available through the TensorFlow Privacy library (McMahan et al., 2018) with .
4.1 Experiments on MNIST
dataset as our source of public data. In particular, we use the 50,000 examples from the Q-MNIST dataset that are reconstructions of the lost MNIST testing digits. We perform a hyperparameter optimization and find a baseline model with higher accuracy (at , and at ) than what is reported as the current state-of-the-art in the Tensorflow Privacy tutorial’s README file ( at ).
Figure 0(a) shows results of our DiversePublic method compared with baselines. Starting from a checkpoint of test accuracy 97.5% (), our method can reach 98.8% accuracy with 7,000 extra labels. The DiversePublic method yields higher test accuracy than other active learning baselines in the low-query regime, and performs comparably in the high-query regime.
In Figure 0(b), we compare NearPrivate against DiversePublic . Since NearPrivate adds extra privacy cost, we have to take special care when comparing it to DiversePublic . Therefore, we fine-tune from a starting checkpoint at test accuracy 97.0% with lower privacy cost () and make sure its total privacy cost () in the end is the same as the cost for the DiversePublic model. For this reason, NearPrivate takes some number of labeled data points to ‘catch up’ to DiversePublic for the same DP cost — in this case 2,000. When is large enough, NearPrivate outperforms all other methods. This shows that accessing the original training data in a privacy-aware way can substantially improve performance.
4.2 Experiments on SVHN
We conduct another set of experiments on the SVHN (Netzer et al., 2011) data. We use the set of ‘531,131 additional, somewhat less difficult samples’ as our source of public data. Since a baseline model trained with DP-SGD on the SVHN training set performs quite poorly, we have opted to first pre-train the model on rotated images of predicting only rotations as in Gidaris et al. (2018).
Broadly speaking, the results presented in Figure 2 are similar to MNIST results, but three differences stand out. First, the improvement given by active learning over the baseline private model is larger. Second, the improvement given by DiversePublic over the basic active learning techniques is also larger. Third, NearPrivate actually underperforms DiversePublic in this case. We hypothesize that the first and second results are due to there being more ‘headroom’ in SVHN accuracy than for MNIST, and that the third result stems from the reported lower difficulty of the extra SVHN data. In the next section, we examine this phenomenon further.
4.3 Experiments with Dataset Pollution
We were intrigued by the under-performance of NearPrivate relative to DiversePublic on SVHN. We wondered whether it was due to the fact that SVHN and its extra data violate the assumption built into NearPrivate – namely that we need to query the private data to ‘throw out’ unhelpful public data. Indeed, the SVHN website describes the extra set as ‘somewhat less difficult’ than the training data. To test this hypothesis, we designed a new experiment to check if NearPrivate can actually select more helpful samples given a mixture of relevant data and irrelevant data as ‘pollution’. We train the DP baseline with 30,000 of the SVHN training images, and treat a combination of another 40,000 SVHN training images and 10,000 CIFAR-10 (Krizhevsky et al., 2009) training images as the extra public dataset. These CIFAR-10 examples act as the unhelpful public data that we would hope NearPrivate could learn to discard. As shown in Figure 3, all baselines perform worse than before with polluted public data. DiversePublic does somewhat better than random selection, but not much, achieving a peak performance improvement of around 1%. On the other hand, the difference between NearPrivate and DiversePublic is more substantial, at over 2% accuracy in some cases. This is especially interesting considering that DiversePublic actually performed better in the results of Section 4.2. Broadly speaking, the results support our claim that NearPrivate helps more relative to DiversePublic when there is ‘unhelpful’ data in the public dataset. This is good to know, since having some unhelpful public data and some helpful public data seems like a more realistic problem setting than the one in which all public data is useful.
5 Ablation Analyses
In order to better understand how the performance of DiversePublic and NearPrivate is affected by various hyper-parameters, we conduct several ablation studies.
5.1 How do Clustering Hyper-Parameters Affect Accuracy?
For DiversePublic , there are two parameters that affect the number of extra data points labeled for fine-tuning: the number of clusters we form () and the number of points we label per cluster (). We write . To study the relative effects of these, we conduct the experiment depicted in Figure 3(a). In this experiment, we fine-tune the same DP MNIST model (test accuracy 97% at ) with varying values of and . We vary from 100 to 500, which is depicted on the -axis. We vary from 5 to 20, with each value depicted as a different line. The general trend is one of diminishing returns on extra labeled data, as would be predicted by Hestness et al. (2017). We do not notice a strong correspondence between final test accuracy at a fixed number of extra labels and the values of and . This is encouraging, as it suggests that practitioners can use our techniques without worrying too much about these hyper-parameters.
To address the question of which extra data points are being chosen for labeling, we create Figure 3(b) showing the most central example from each cluster, computed using . The chosen examples are quite diverse, with a similar number of representatives from each class and variations in the thickness and shear of the digits. We can also inspect examples labeled incorrectly by the original checkpoint, such as the digit in Row 8, Col 1, which is a 7 that looks a lot like a ‘2’.
5.2 How Does the Starting Checkpoint Affect Results?
Recall that NearPrivate accrues extra privacy cost by accessing the histogram of neighbor counts. This means that achieving a given accuracy under a constraint on the total privacy cost requires choosing how to allocate privacy between NearPrivate and the initial DP-SGD procedure. Making this choice correctly requires a sense of how much benefit can be achieved from applying NearPrivate to different starting checkpoints. Toward that end, we conduct (Figure 5) an ablation experiment on MNIST where we run NearPrivate on many different checkpoints from the same training run. Figure 4(a)
shows the test accuracies resulting from fine-tuning checkpoints at different epochs (represented on the-axis) with fixed extra privacy cost of . Figure 4(b) shows the corresponding for each checkpoint. Figure 4(c) varies other parameters given a fixed total privacy budget of .
In Figure 4(a), the black line with triangle markers shows the initial test accuracies of the checkpoints. The other lines show results with different values of from 1,000 to 10,000 respectively. With , improvements are marginal for later checkpoints. In fact, the improvement from using at checkpoint 80 is not enough to compensate for the additional privacy cost spent by NearPrivate , because you could have had the same increase in accuracy by training the original model for 20 more epochs, which costs less than that. On the other hand, the improvements from using or higher are significant and cannot be mimicked by training for longer.
Given a total privacy budget , how should we decide among , , and ? Empirically, we observe that can be set to a small value (e.g., 0.3) without substantially affecting the results. Allocating between and is addressed in Figure 4(c), which varies those parameters with fixed to . When the budget for gathering is low, say is around 1000, it is preferable to pick a later DP checkpoint, consuming a higher and lower . On the other hand, when allowed to label more instances from the public data, we should use an earlier DP checkpoint (with a lower ) and choose better public samples with respect to the private data.
In addition to creating new baselines for DP image classifiers by fine-tuning on public data, we introduce two algorithms – DiversePublic and NearPrivate – to perform fine-tuning in a privacy-aware way. We conduct experiments showing that these algorithms bring DP object recognition closer to practicality, improving on the aforementioned benchmarks. We hope that this work will encourage further research into techniques for making differential privacy more useful in practice, and we hope that the techniques we propose here will be helpful to existing practitioners.
We would like to thank Kunal Talwar, Abhradeep Guha Thakurta, Nicholas Carlini, Shuang Song, Ulfar Erlingsson, and Mukund Sundararajan for helpful discussions.
- Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: §1, §2.2, §4.
- Differentially private empirical risk minimization. Journal of Machine Learning Research 12 (Mar), pp. 1069–1109. Cited by: §2.2.
- Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp. 265–284. Cited by: §1, §2.1.
- The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9 (3–4), pp. 211–407. Cited by: §2.1, §3.3.
Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In
Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pp. 11–20. Cited by: §3.3.
- Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, Cited by: §4.2.
- Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409. Cited by: §5.1.
- Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.3.
- Deep learning. nature 521 (7553), pp. 436. Cited by: §1.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE. Cited by: §4.1, §4.
- A general approach to adding differential privacy to iterative training procedures. arXiv preprint arXiv:1812.06210. Cited by: §4.
- Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning. Cited by: §4.2, §4.
- Active learning using pre-clustering. In Proceedings of the twenty-first international conference on Machine learning, pp. 79. Cited by: §3.2.
- Realistic evaluation of deep semi-supervised learning algorithms. In Advances in Neural Information Processing Systems, pp. 3235–3246. Cited by: §1, §3.3.
Svcca: singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Advances in Neural Information Processing Systems, pp. 6076–6085. Cited by: §3.2.
- Active learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §3.1.
- Stochastic gradient descent with differentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing, pp. 245–248. Cited by: §2.2.
- Cold case: the lost mnist digits. Technical report arxiv:1905.10498. Cited by: §4.1.
- Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: footnote 1.