Improving Differentially Private Models with Active Learning

10/02/2019 ∙ by Zhengli Zhao, et al. ∙ 26

Broad adoption of machine learning techniques has increased privacy concerns for models trained on sensitive data such as medical records. Existing techniques for training differentially private (DP) models give rigorous privacy guarantees, but applying these techniques to neural networks can severely degrade model performance. This performance reduction is an obstacle to deploying private models in the real world. In this work, we improve the performance of DP models by fine-tuning them through active learning on public data. We introduce two new techniques - DIVERSEPUBLIC and NEARPRIVATE - for doing this fine-tuning in a privacy-aware way. For the MNIST and SVHN datasets, these techniques improve state-of-the-art accuracy for DP models while retaining privacy guarantees.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Privacy concerns surfaced with the increased adoption of machine learning in domains like healthcare (LeCun et al., 2015). One widely adopted framework for measuring privacy characteristics of randomized algorithms, such as machine learning techniques, is differential privacy (Dwork et al., 2006). Abadi et al. (2016)

introduced an algorithm for differentially private stochastic gradient descent (DP-SGD), which made it feasible to scale differential privacy guarantees to neural networks. DP-SGD is now the de facto algorithm used for training neural networks with privacy guarantees.

There is, however, a crucial problem for models with a large number of parameters: it is difficult to achieve both non-trivial privacy guarantees and good accuracy. The reasons for this are numerous and involved, but the basic intuition is that DP-SGD involves clipping gradients and then adding noise to those gradients. However, gradient clipping has an increasing relative effect as the number of model parameters grows. This reduces the applicability of differential privacy to deep learning in practice, where strong performance tends to require a large number of model parameters.

111

For example, the best performing CIFAR-10 classifier from

Zagoruyko and Komodakis (2016) has 32.5 million parameters, while the private baseline we describe in Section 4.1 has about 26 thousand.

In this work, we propose a method to mitigate this problem, making DP learning feasible for modern image classifiers. The method is based on the observation that each DP-SGD update consumes privacy budget in proportion to the quotient of the batch size and the training-set size. Thus, by increasing the number of effective training examples, we can improve accuracy while maintaining the same privacy guarantees. Since obtaining more private training samples will generally be nontrivial, we focus on using ‘public’ data to augment the training set. This involves assuming that there will be unlabeled222 If there is labeled public data, the situation is even better, but this is likely to be rare and is not the setting we consider here.

public data that is sufficiently related to our private data to be useful, but we think this is a reasonable assumption to make, for two reasons: First, it is an assumption made by most of the semi-supervised learning literature

(Oliver et al., 2018). Second, one of our techniques will explicitly address the situation where the ‘sufficiently related’ clause partially breaks down.

In summary, our contributions are:

  • We introduce an algorithm (DiversePublic ) to select diverse representative samples from a public dataset to fine-tune a DP classifier.

  • We then describe another algorithm (NearPrivate ) that pays a privacy cost to reference the private training data when querying the public dataset for representative samples.

  • We establish new state-of-the-art results on the MNIST and SVHN datasets by fine-tuning DP models using simple active learning techniques, and then improve upon those results further using DiversePublic and NearPrivate .

  • We open source all of our experimental code.333We will release the repository in the next update.

2 Background

2.1 Differential Privacy

We reason about privacy within the framework of differential privacy (Dwork et al., 2006). In this paper, the random algorithm we analyze is the training algorithm, and guarantees are measured with respect to the training data.

Informally, an algorithm is said to be differentially private if its behavior is indistinguishable on pairs of datasets that only differ by one point. That is, an observer cannot tell whether a particular point was included in the model’s training set simply by observing the output of the training algorithm. Formally, for a training algorithm to be ()-differentially private, we require that it satisfies the following property for all pairs of datasets differing in exactly one data point and all possible subsets :

where

is a (small) probability for which we tolerate the property not to be satisfied. The parameter

measures the strength of the privacy guarantee: the smaller the value of is, the stronger the privacy guarantee is.

This guarantee is such that the output of a differentially private algorithm can be post-processed at no impact to the strength of the guarantee provided. In case privacy needs to be defined at a different granularity than invidividual training points, the guarantee degrades by a factor which, naively, is the number of points that are included in the granularity considered. See Dwork et al. (2014a) for further information.

2.2 Differentially Private Stochastic Gradient Descent (DP-SGD)

Building on earlier work (Chaudhuri et al., 2011; Song et al., 2013), Abadi et al. (2016) introduce a variant of stochastic gradient descent to train deep neural networks with differential privacy guarantees. Two modifications are made to vanilla SGD. First, gradients are computed on a per-example basis and clipped to have a maximum known

norm. This bounds the sensitivity of the training procedure with respect to each individual training point. Second, noise calibrated to have a standard deviation proportional to this sensitivity is added to the average gradient. This results in a training algorithm known as differentially private SGD (DP-SGD). Unfortunately (as discussed in Section 

1), DP-SGD does not perform well for models with large parameter counts, which motivates the improvements proposed in the next section.

3 Improving Differentially Private Models with Active Learning

This paper introduces the following high-level process to improve the performance of an existing DP classifier: First, find a public insensitive dataset containing relevant samples. Second, carefully select a subset of public samples that can improve the performance of the classifier. Third, fine-tune the existing classifier on the selected samples with labels. We want to perform the selection and the fine-tuning in a way which does not compromise the privacy guarantees of the existing classifier.

The first step can be done using domain knowledge about the classifier, e.g., utilizing relevant public genomic datasets for a DP classifier of genomics data. We assume standard fine-tuning techniques for the last step. Therefore, the problem boils down to efficiently selecting samples from the public dataset while preserving privacy. We also assume a limit on the number of selected samples. This limit is relevant when the samples are unlabeled, in which case it controls the cost of labeling (e.g., hiring human annotators to process the selected samples).

We introduce two active learning algorithms, DiversePublic and NearPrivate , for sample-selection. These algorithms make different assumptions about access to the private training data but are otherwise drop-in replacements in the end-to-end process.

3.1 Problem Statement and Baseline Methods

We are given a differentially private model trained and tested on private sensitive data and respectively. The privacy cost of training on is (we omit for brevity in this paper, but it composes similarly). And we have an extra set of public insensitive unlabeled data which can be utilized to further improve . Given the number of extra data that we are allowed to request labels for, and the total privacy budget of the improved model , we want to efficiently pick where , using which we can fine-tune to of better performance on . The simplest baseline method we will use is to choose ‘random’ samples out of for fine-tuning. The other baseline that we use is Uncertain Sampling (Settles, 2009)

, according to either the ‘entropy’ of the logits or the ‘difference’ between the two largest logits. These are widely used active learning methods and serve as strong baselines.

3.2 The DiversePublic Method

The baseline methods may select many redundant examples to label. To efficiently select diverse representative samples, we propose the DiversePublic method, adapted from clustering based active learning methods (Nguyen and Smeulders, 2004). Given a DP model of cost , we obtain the ‘embeddings’ (the activations before the logits) of all , and perform PCA on as is done in Raghu et al. (2017). Then we select a number (more than ) of uncertain points (according to, e.g., the logit entropy of the private model) from , project their embeddings onto the top few principal components, and cluster those projections into groups. Finally, we pick a number of samples from each representative group up to in total and fine-tune the DP model with these . Though this procedure accesses , it adds nothing to , since there is no private data referenced and the output of DP models can be post-processed at no additional privacy cost. It can be applied even to models for which we cannot access the original training data.444

e.g., models published to a repository like TensorFlow Hub:

https://www.tensorflow.org/hub Algorithm 1 presents more details of the DiversePublic method.

with privacy cost , , , ,
fine-tuned on selected public data with the same privacy cost
Compute ‘embeddings’ of public data
Perform PCA on embeddings and get first PCs
Project most ‘uncertain’ public data onto PCs
Cluster projected embeddings to get clusters
for  do Label data points from each cluster
     
end for
Algorithm 1 The DiversePublic Algorithm

3.3 The NearPrivate Method

The DiversePublic method works well under the assumption that has the same distribution as . However, this may not be a reasonable assumption in general (Oliver et al., 2018). For instance, there may be a subset of about which our pre-trained model has high uncertainty but which cannot improve performance if sampled. This may be because that subset contains corrupted data or it may be due simply to distribution shift. In order to mitigate this issue, we propose to check query points against our private data and decline to label query points that are too far from any training points (in projected-embedding-space). But doing this check while preserving privacy guarantees is nontrivial, since it involves processing the private training data itself in addition to . Given a DP model of cost , we obtain the embeddings of all , and perform differentially private PCA (DP-PCA) (Dwork et al., 2014b) on at a privacy cost of . We select a number of uncertain points from both and . Then in the space of low dimensional DP-PCA projections, we assign each uncertain private example to exactly one uncertain public example according to Euclidean distance. This yields, for each uncertain public example, a count of ‘nearby’ uncertain private examples. Finally, we choose points with the largest values of from the uncertain public data. This sampling procedure is differentially private with cost due to the Laplace mechanism (Dwork et al., 2014a). The total privacy cost of NearPrivate is composed of the budgets expended to perform each of the three operations that depend on the private data: . More details in Algorithm 2.

with privacy cost , , , and
fine-tuned on selected public data with privacy cost
Compute ‘embeddings’ of private data
Perform DP-PCA on embeddings and get first PCs
Project most ‘uncertain’ private data onto PCs
Project most ‘uncertain’ public data onto PCs
for  do In PC-space, assign each private point to exactly one public point
     
end for
for  do Compute support with Laplacian noise for each public data point
     
end for
Label data points of the highest
Algorithm 2 The NearPrivate Algorithm

4 Experiments

We conduct experiments on the MNIST (LeCun et al., 1998) and SVHN (Netzer et al., 2011) data sets. These may be seen as ‘toy’ data sets in the object recognition literature, but they are still challenging for DP object recognizers. In fact, at the time of this writing, there are no published examples of a differentially private SVHN classifier with both reasonable accuracy and non-trivial privacy guarantees. The baseline we establish in Section 4.2 is thus a substantial contribution by itself. For both datasets, we use the same model architecture as in the Tensorflow Privacy tutorials555Tutorials for TensorFlow Privacy are found at: https://github.com/tensorflow/privacy. We obtain by training on via DP-SGD (Abadi et al., 2016). Unless otherwise specified, we always aggregate results over 5 runs with different random seeds and use error bars to represent the standard deviation. We use the implementation of DP-SGD made available through the TensorFlow Privacy library (McMahan et al., 2018) with .

4.1 Experiments on MNIST

We conduct our first set of experiments on the MNIST (LeCun et al., 1998) dataset. We use the Q-MNIST (Yadav and Bottou, 2019)

dataset as our source of public data. In particular, we use the 50,000 examples from the Q-MNIST dataset that are reconstructions of the lost MNIST testing digits. We perform a hyperparameter optimization and find a baseline model with higher accuracy (

at , and at ) than what is reported as the current state-of-the-art in the Tensorflow Privacy tutorial’s README file ( at ).

Figure 0(a) shows results of our DiversePublic method compared with baselines. Starting from a checkpoint of test accuracy 97.5% (), our method can reach 98.8% accuracy with 7,000 extra labels. The DiversePublic method yields higher test accuracy than other active learning baselines in the low-query regime, and performs comparably in the high-query regime.

In Figure 0(b), we compare NearPrivate against DiversePublic . Since NearPrivate adds extra privacy cost, we have to take special care when comparing it to DiversePublic . Therefore, we fine-tune from a starting checkpoint at test accuracy 97.0% with lower privacy cost () and make sure its total privacy cost () in the end is the same as the cost for the DiversePublic model. For this reason, NearPrivate takes some number of labeled data points to ‘catch up’ to DiversePublic for the same DP cost — in this case 2,000. When is large enough, NearPrivate outperforms all other methods. This shows that accessing the original training data in a privacy-aware way can substantially improve performance.

(a) DiversePublic vs baselines
(b) NearPrivate vs DiversePublic
Figure 1: MNIST experiments. All plots show test-set-accuracy vs. the number of extra points labeled. Left: Comparison of our DiversePublic technique with the active learning baselines described above. All models are fine-tuned starting from the same baseline (‘checkpoint 1’: test accuracy 97.5% and ). Active learning improves the performance of the DP-model to as much as 98.8% in the best case with no increase in privacy cost. Right: Comparison of our NearPrivate technique with the DiversePublic technique. Since we spent on the NearPrivate technique, we fine-tune the NearPrivate model from ‘checkpoint 0’ with privacy cost . Thus, both lines have the same privacy cost (), regardless of the number of extra points used.

4.2 Experiments on SVHN

We conduct another set of experiments on the SVHN (Netzer et al., 2011) data. We use the set of ‘531,131 additional, somewhat less difficult samples’ as our source of public data. Since a baseline model trained with DP-SGD on the SVHN training set performs quite poorly, we have opted to first pre-train the model on rotated images of predicting only rotations as in Gidaris et al. (2018).

Broadly speaking, the results presented in Figure 2 are similar to MNIST results, but three differences stand out. First, the improvement given by active learning over the baseline private model is larger. Second, the improvement given by DiversePublic over the basic active learning techniques is also larger. Third, NearPrivate actually underperforms DiversePublic in this case. We hypothesize that the first and second results are due to there being more ‘headroom’ in SVHN accuracy than for MNIST, and that the third result stems from the reported lower difficulty of the extra SVHN data. In the next section, we examine this phenomenon further.

(a) DiversePublic vs baselines
(b) NearPrivate vs DiversePublic
Figure 2: SVHN experiments. Left: Comparison of our DiversePublic technique with the active learning baselines. All models are fine-tuned starting from the same baseline (‘checkpoint 1’: test accuracy 75.0% and ). Active learning improves the performance of the DP-model to as much as 85% in the best case with no increase in privacy cost. Recall also that the 75.0% number is itself a baseline established in this paper. Right: Comparison of NearPrivate with DiversePublic . The setup here is analogous to the one in the MNIST experiment, but DiversePublic performs better in this case. Since we spent on NearPrivate , we fine-tune from ‘checkpoint 0’ with test accuracy 73.5% and .

4.3 Experiments with Dataset Pollution

We were intrigued by the under-performance of NearPrivate relative to DiversePublic on SVHN. We wondered whether it was due to the fact that SVHN and its extra data violate the assumption built into NearPrivate – namely that we need to query the private data to ‘throw out’ unhelpful public data. Indeed, the SVHN website describes the extra set as ‘somewhat less difficult’ than the training data. To test this hypothesis, we designed a new experiment to check if NearPrivate can actually select more helpful samples given a mixture of relevant data and irrelevant data as ‘pollution’. We train the DP baseline with 30,000 of the SVHN training images, and treat a combination of another 40,000 SVHN training images and 10,000 CIFAR-10 (Krizhevsky et al., 2009) training images as the extra public dataset. These CIFAR-10 examples act as the unhelpful public data that we would hope NearPrivate could learn to discard. As shown in Figure 3, all baselines perform worse than before with polluted public data. DiversePublic does somewhat better than random selection, but not much, achieving a peak performance improvement of around 1%. On the other hand, the difference between NearPrivate and DiversePublic is more substantial, at over 2% accuracy in some cases. This is especially interesting considering that DiversePublic actually performed better in the results of Section 4.2. Broadly speaking, the results support our claim that NearPrivate helps more relative to DiversePublic when there is ‘unhelpful’ data in the public dataset. This is good to know, since having some unhelpful public data and some helpful public data seems like a more realistic problem setting than the one in which all public data is useful.

(a) DiversePublic vs baselines
(b) NearPrivate vs DiversePublic
Figure 3: SVHN experiments with dataset pollution. In this experiment, we train the DP baseline with 30,000 of the SVHN training images. The extra public dataset is a combination of 40,000 SVHN training images and 10,000 CIFAR-10 training images. Left: DiversePublic compared against other active learning baseline techniques as in Figure 1(a). In this case, the active learning techniques do not outperform random selection by very much. Uncertainty by itself is not a sufficient predictor of whether extra data will be helpful here, since the baseline model is also uncertain about the CIFAR-10 images. Right: NearPrivate compared against DiversePublic , but started from different checkpoints (‘checkpoint 0’ with , ‘checkpoint 1’ with ) to keep the privacy cost constant as in Figure 1(b). In this case, NearPrivate substantially outperforms DiversePublic by selecting less of the CIFAR-10 images.

5 Ablation Analyses

In order to better understand how the performance of DiversePublic and NearPrivate is affected by various hyper-parameters, we conduct several ablation studies.

5.1 How do Clustering Hyper-Parameters Affect Accuracy?

For DiversePublic , there are two parameters that affect the number of extra data points labeled for fine-tuning: the number of clusters we form () and the number of points we label per cluster (). We write . To study the relative effects of these, we conduct the experiment depicted in Figure 3(a). In this experiment, we fine-tune the same DP MNIST model (test accuracy 97% at ) with varying values of and . We vary from 100 to 500, which is depicted on the -axis. We vary from 5 to 20, with each value depicted as a different line. The general trend is one of diminishing returns on extra labeled data, as would be predicted by Hestness et al. (2017). We do not notice a strong correspondence between final test accuracy at a fixed number of extra labels and the values of and . This is encouraging, as it suggests that practitioners can use our techniques without worrying too much about these hyper-parameters.

(a) DiversePublic with different and .
(b) Visualization of chosen public data points.
Figure 4: DiversePublic analysis. Left: We apply DiversePublic to the same DP checkpoint (dashed horizontal line), varying the number of clusters (horizontal axis) and the number of chosen points from each cluster (lines with error bars). Right: For , we visualize the most central example from each cluster. Since NearPrivate does not have explicit clustering, we use DiversePublic for this visualization. Green borders mean that the initial checkpoint (dashed horizontal line) predicted correctly; while red bordered examples (originally predicted incorrectly) have dots on the left showing predictions and dots on the right showing true labels.

To address the question of which extra data points are being chosen for labeling, we create Figure 3(b) showing the most central example from each cluster, computed using . The chosen examples are quite diverse, with a similar number of representatives from each class and variations in the thickness and shear of the digits. We can also inspect examples labeled incorrectly by the original checkpoint, such as the digit in Row 8, Col 1, which is a 7 that looks a lot like a ‘2’.

5.2 How Does the Starting Checkpoint Affect Results?

Recall that NearPrivate accrues extra privacy cost by accessing the histogram of neighbor counts. This means that achieving a given accuracy under a constraint on the total privacy cost requires choosing how to allocate privacy between NearPrivate and the initial DP-SGD procedure. Making this choice correctly requires a sense of how much benefit can be achieved from applying NearPrivate to different starting checkpoints. Toward that end, we conduct (Figure 5) an ablation experiment on MNIST where we run NearPrivate on many different checkpoints from the same training run. Figure 4(a)

shows the test accuracies resulting from fine-tuning checkpoints at different epochs (represented on the

-axis) with fixed extra privacy cost of . Figure 4(b) shows the corresponding for each checkpoint. Figure 4(c) varies other parameters given a fixed total privacy budget of .

In Figure 4(a), the black line with triangle markers shows the initial test accuracies of the checkpoints. The other lines show results with different values of from 1,000 to 10,000 respectively. With , improvements are marginal for later checkpoints. In fact, the improvement from using at checkpoint 80 is not enough to compensate for the additional privacy cost spent by NearPrivate , because you could have had the same increase in accuracy by training the original model for 20 more epochs, which costs less than that. On the other hand, the improvements from using or higher are significant and cannot be mimicked by training for longer.

Given a total privacy budget , how should we decide among , , and ? Empirically, we observe that can be set to a small value (e.g., 0.3) without substantially affecting the results. Allocating between and is addressed in Figure 4(c), which varies those parameters with fixed to . When the budget for gathering is low, say is around 1000, it is preferable to pick a later DP checkpoint, consuming a higher and lower . On the other hand, when allowed to label more instances from the public data, we should use an earlier DP checkpoint (with a lower ) and choose better public samples with respect to the private data.

(a) Fixed extra privacy cost with , .
(b) The initial training privacy cost for different checkpoints.
(c) Fixed total privacy cost with , .
Figure 5: NearPrivate analysis. We apply NearPrivate to DP checkpoints (horizontal axis) of different privacy cost . The black line with triangle represents those starting checkpoints. Middle: The initial training privacy cost of DP checkpoints at different epochs of a single training run. Left: We fix the extra privacy cost , but vary the number of labeled public points for each line. We achieve large improvements from 1,000 to 4,000 labeled public points. With even larger , the improvements are not significant and it may be better to start fine-tuning from a DP checkpoint of lower privacy cost. Right: With set to 0.3, we fix the total privacy cost and vary . For , fine-tuning from the checkpoint at Epoch 60 is the best given a total privacy budget of 5.0.

6 Conclusion

In addition to creating new baselines for DP image classifiers by fine-tuning on public data, we introduce two algorithms – DiversePublic and NearPrivate – to perform fine-tuning in a privacy-aware way. We conduct experiments showing that these algorithms bring DP object recognition closer to practicality, improving on the aforementioned benchmarks. We hope that this work will encourage further research into techniques for making differential privacy more useful in practice, and we hope that the techniques we propose here will be helpful to existing practitioners.

Acknowledgments

We would like to thank Kunal Talwar, Abhradeep Guha Thakurta, Nicholas Carlini, Shuang Song, Ulfar Erlingsson, and Mukund Sundararajan for helpful discussions.

References

  • M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: §1, §2.2, §4.
  • K. Chaudhuri, C. Monteleoni, and A. D. Sarwate (2011) Differentially private empirical risk minimization. Journal of Machine Learning Research 12 (Mar), pp. 1069–1109. Cited by: §2.2.
  • C. Dwork, F. McSherry, K. Nissim, and A. Smith (2006) Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp. 265–284. Cited by: §1, §2.1.
  • C. Dwork, A. Roth, et al. (2014a) The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9 (3–4), pp. 211–407. Cited by: §2.1, §3.3.
  • C. Dwork, K. Talwar, A. Thakurta, and L. Zhang (2014b)

    Analyze gauss: optimal bounds for privacy-preserving principal component analysis

    .
    In

    Proceedings of the forty-sixth annual ACM symposium on Theory of computing

    ,
    pp. 11–20. Cited by: §3.3.
  • S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, Cited by: §4.2.
  • J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. Patwary, M. Ali, Y. Yang, and Y. Zhou (2017) Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409. Cited by: §5.1.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.3.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §1.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE. Cited by: §4.1, §4.
  • H. B. McMahan, G. Andrew, U. Erlingsson, S. Chien, I. Mironov, N. Papernot, and P. Kairouz (2018) A general approach to adding differential privacy to iterative training procedures. arXiv preprint arXiv:1812.06210. Cited by: §4.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning. Cited by: §4.2, §4.
  • H. T. Nguyen and A. Smeulders (2004) Active learning using pre-clustering. In Proceedings of the twenty-first international conference on Machine learning, pp. 79. Cited by: §3.2.
  • A. Oliver, A. Odena, C. A. Raffel, E. D. Cubuk, and I. Goodfellow (2018) Realistic evaluation of deep semi-supervised learning algorithms. In Advances in Neural Information Processing Systems, pp. 3235–3246. Cited by: §1, §3.3.
  • M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein (2017)

    Svcca: singular vector canonical correlation analysis for deep learning dynamics and interpretability

    .
    In Advances in Neural Information Processing Systems, pp. 6076–6085. Cited by: §3.2.
  • B. Settles (2009) Active learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §3.1.
  • S. Song, K. Chaudhuri, and A. D. Sarwate (2013) Stochastic gradient descent with differentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing, pp. 245–248. Cited by: §2.2.
  • C. Yadav and L. Bottou (2019) Cold case: the lost mnist digits. Technical report arxiv:1905.10498. Cited by: §4.1.
  • S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: footnote 1.