1 Introduction
differential privacy (DP)tfalse [13] is the defacto standard for privacypreserving data analysis, including the training of machine learning models using sensitive data. The strength of DP comes from its use of randomness to hide the contribution of any individual’s data from an adversary with access to arbitrary side knowledge. The price of DP is a loss in utility caused by the need to inject noise into computations. Quantifying the tradeoff between privacy and utility is a central topic in the literature on differential privacy. Formal analysis of such tradeoffs lead to algorithms achieving a prespecified level privacy with minimal utility reduction, or, conversely, an apriori acceptable level of utility with maximal privacy. Since the privacy level is generally regarded as a policy decision [37], this step is essential to decisionmakers tasked with balancing utility and privacy in realworld deployments [3].
However, analytical analyses of the privacy–utility tradeoff are only available for relatively simple problems amenable to mathematical treatment, and cannot be conducted for most problems of practical interest. Further, differentially private algorithms have more hyperparameters than their nonprivate counterparts, most of which affect both privacy and utility. In this paper we develop a Bayesian optimization approach for empirically characterizing the privacy–utility tradeoff, and provide a principled, computationally efficient way to tune any differentially private algorithm.
A canonical application of our methods is differentially private deep learning. Differentially private stochastic optimization has been employed to train feedforward
[1], convolutional [8], and recurrent [34] neural networks, showing that reasonable accuracies can be achieved when selecting hyperparameters carefully. These works rely on the gradient perturbationtechnique, which clips and adds noise to gradient computations, while keeping track of the privacy loss incurred. However, these results do not provide actionable information regarding the privacy–utility tradeoff of the proposed models. For example, private stochastic optimization methods can obtain the same level of privacy in different ways (e.g. by increasing the noise variance and reducing the clipping norm, or
viceversa), and it is not generally clear what combinations of these changes yield the best possible utility for a fixed privacy level. Furthermore, increasing the number of hyperparameters makes exhaustive hyperparameter optimization prohibitively expensive.The goal of this paper is to provide a computationally efficient methodology to this problem by using Bayesian optimization to estimate the privacy–utility
Pareto front of a given algorithm. The Pareto fronts obtained by our method can be used to find hyperparameter settings leading to the optimal operating points of any differentially private technique, enabling decisionmakers to take informed actions when balancing the privacy–utility tradeoff of an algorithm before deployment. This is in line with the approach taken by the U.S. Census Bureau to calibrate the level of DP that will be used when releasing the results of the upcoming 2020 census [17, 3, 2].Our contributions are: (1) Characterizing the privacy–utility tradeoff of an algorithm as a function of its hyperparameters as the problem of learning a Pareto front on the privacy vs. utility plane (Sec. 2). (2) Designing DPareto, a multiobjective Bayesian optimization algorithm for learning the privacy–utility Pareto front of any differentially private algorithm (Sec. 3). (3) Instantiating and experimentally evaluating our framework for the case of differentially private stochastic optimization on a variety of learning tasks involving multiple models, optimizers, and datasets (Sec. 4).
2 The Privacy–Utility Pareto Front
This section provides an abstract formulation of the problem we want to address. We start by introducing some basic notation and recalling the definition of differential privacy
, after which we will define the key components of our framework. We then formalize the task of quantifying the privacy–utility tradeoff using the notion of Pareto front, and conclude by giving an illustrative example in the context private logistic regression trained with SGD
[44].General Setup
Let be a randomized algorithm that takes as input a tuple containing records from and outputs a value in some set . Differential privacy formalizes the idea that preserves the privacy of its inputs when the output distribution is stable under changes in one input.
Definition 1 (Dwork et al. [13], Dwork [12]).
Given and , we say algorithm is DP if for any pair of inputs differing in a single coordinate we have^{1}^{1}1Smaller values of and yield more private algorithms.
To analyze the tradeoff between utility and privacy for a given problem, we consider a parametrized family of algorithms . Here, indexes the possible choices of hyperparameters, so can be interpreted as the set of all possible algorithm configurations for solving a given task. For example, in the context of a machine learning application, the family consists of a set of learning algorithms which take as input a training dataset containing examplelabel pairs and produce as output the parameters of a predictive model. It is clear that in this context different choices for the hyperparameters might yield different utilities. We further assume each configuration of the algorithm satisfies DP with potentially distinct privacy parameters.
To capture the privacy–utility tradeoff across we introduce two oracles to model the effect of hyperparameter changes on the privacy and utility of . A privacy oracle is a function that given a choice of hyperparameters returns a value such that satisfies DP. An instancespecific utility oracle is a function that given a choice of hyperparameters returns some measure of the utility^{2}^{2}2Due to the broad applicability of DP, concrete utility measures are generally defined on a perproblem basis. Here we use the conventions that is bounded and that larger utility is better. of the output distribution of . These oracles allow us to condense everything about our problem in the tuple . Given these three objects, our goal is to find hyperparameter settings for that simultaneously achieve maximal privacy and utility on a given input . Next we will formalize this goal using the concept of Pareto front, but we first provide remarks about the definition of our oracles.
Remark 1 (Privacy Oracle).
The choice to parametrize our privacy oracle in terms of a fixed stems from the convention that is considered the most important privacy parameter, whereas
can be interpreted as a small probability that an
DP guarantee fails. This choice is also aligned with recent uses of DP in machine learning where the privacy analysis is conducted under the framework of Rényi DP [35] and the reported privacy is obtained by a posteriori converting the guarantees to standard DP for some fixed [1, 19, 34, 16, 43]. In particular, in our experiments with gradient perturbation for SGD and other stochastic optimization methods (Sec. 4) we implement the privacy oracle using the moments accountant technique proposed
[1] coupled with the tight bounds provided in [43] for Rényi DP amplification by subsampling without replacement.Remark 2 (Utility Oracle).
Parametrizing the utility oracle by a fixed input is a choice justified by the type applications we tackle in our experiments (cf. Sec. 4). Other applications might require variations which our framework can easily accommodate by extending the definition of the utility oracle. We also stress that since the algorithms in are randomized, the utility is a property of the output distribution of . This means that in practice we might have to implement the oracle approximately, e.g. through sampling. In particular, in our experiments we use a test set of measure the utility of a hyperparameter setting by running a fixed number of times to obtain model parameters , and then let be the average accuracy of the models on the test set.
The Pareto front of a collection of points contains all the points in where none of the coordinates can be decreased further without increasing some of the other coordinates (while remaining inside ).
Definition 2 (Pareto Front).
Let and . We say that dominates if for all , and we write . The Pareto front of is the set of all nondominated points .
According to this definition, given a privacy–utility tradeoff problem of the form we are interested in finding the Pareto front of the dimensional set^{3}^{3}3The use of for the utility coordinate is for notational consistency, since we use the convention that the points in the Pareto front are those that minimize each individual dimension. . Given this Pareto front, a decisionmaker looking to deploy DP has all the necessary information to make an informed decision about how to tradeoff privacy and utility in their particular application.
Example: Logistic Regression
To illustrate the ingredients of our framework we consider a simple private logistic regression model with regularization trained on the Adult dataset [25]. In particular, to reduce the number of hyperparameters we privatize the model by training with minibatched projected SGD and applying a Gaussian perturbation at the output using the method from [44, Algorithm 2] with default parameters^{4}^{4}4These are the smoothness, Lipschitz and strong convexity parameters of the loss, and the learning rate.. The only hyperparameters we tune in this experiment are the regularization
and the noise standard deviation
, while we fix the rest of hyperparameters^{5}^{5}5Minibatch sizeand number of epochs
.. Note that both hyperparameters affect privacy and accuracy in this case. To implement the privacy oracle we compute the global sensitivity according to [44, Algorithm 2] and find the for a fixed using the exact analysis of the Gaussian mechanism provided in [6]. To implement the utility oracle we evaluate the accuracy of the model on the test set, averaging over 50 runs for each setting of the hyperparameters. Given the small number of hyperparameters, we can perform a fine grid search over and to obtain the exact Pareto front for this problem. The results are displayed in Figure 1, where we illustrate privacy and utility as a function of both hyperparameters, as well as the resulting Pareto front and the corresponding hyperparameter settings.Threat Model
In the idealized setting presented above, the desired output is the Pareto front , which depends on through the utility oracle; this is also the case for the Bayesian optimization algorithm for approximating the Pareto front presented in Sec. 3. This warrants a discussion about what threat model is appropriate to consider here.
DPtfalse guarantees that an adversary observing the output will not be able to infer too much about any individual record in . The (central) threat model for DP assumes that is owned by a trusted curator, responsible for running the algorithm and releasing its output to the world. However, the framework described above does not attempt to prevent information about to be exposed by the Pareto front. This is because our methodology is only meant to provide a substitute for using closedform utility guarantees when selecting hyperparameters for a given DP algorithm before its deployment. Accordingly, throughout this work we assume the Pareto fronts obtained with our method are only revealed to a small set of trusted individuals, which is the usual scenario in an industrial context.
An alternative approach is to assume the existence of a public dataset following a similar distribution to the private dataset on which we would like to run the algorithm. Then we can use to compute the Pareto front of the algorithm, select hyperparameters achieving a desired privacy–utility tradeoff, and release the output of . In particular, this the threat model being used by the U.S. Census Bureau to tune the parameters for their use of DP in the context of the 2020 census (see Sec. 5 for more details).
3 DPareto: Learning the Pareto Front
This section starts by recalling the basic ideas behind multiobjective Bayesian optimization
. Then we describe the proposed methodology to learn the privacy–utility Pareto front and revisit the sparse vector technique example to illustrate the effectiveness of our method.
Bayesian Optimization for Multiple Objectives
Bayesian optimization (BO) [36] is a strategy for sequential decision making useful for optimizing expensivetoevaluate blackbox objective functions. It has become increasingly relevant in machine learning due to its success in the optimization of model hyperparameters [41, 21].
In its most standard form, BO is used to find the minimum of an objective function on some subset of a Euclidean space of moderate dimension. It works by generating a sequence of evaluations of the objective at locations , which is done by (i) building a surrogate model of the objective function using the current data and (ii) applying a prespecified criterion to select a new location based on the model. In the singleobjective case a common choice is to select the location that, in expectation under the model, gives the best improvement to the current estimate [36].
In this work, we use BO for learning the privacy–utility Pareto front. When used in multiobjective problems, BO aims to learn the Pareto front with a minimal number of evaluations, which makes it an appealing tool in cases where evaluating the objectives is expensive. Although in this paper we only work with two objective functions, we detail here the general case of minimizing objectives simultaneously. This generalization could be used, for instance, to introduce the running time of the algorithm as a third objective to be traded off against privacy and utility.
Let be a set of locations in and denote by the set such that each is the vector . In a nutshell, BO works by iterating over the following:

Fit a surrogate model of the objectives using the available dataset . The most standard approach is to use a Gaussian process (GP) [38].

For each objective calculate the predictive distribution over using the surrogate model. If GPs are used, the predictive distribution of each output can be fully characterized by their mean and variance functions, which can be computed in closed form.

Use the posterior distribution of the surrogate model to form an acquisition function , where represents the dataset and the GP posterior conditioned on .

Collect the next evaluation point at the (numerically estimated) global maximum of .
The process is repeated until the budget to collect new locations is over. There are two key aspects of any BO method: the surrogate model of the objectives and the acquisition function .
Acquisition with Pareto Front Hypervolume
Next we define an acquisition criterion useful to collect new points when learning the Pareto front. Let be the Pareto front computed with the objective evaluations in and let be some “antiideal” point^{6}^{6}6The antiideal point must be dominated by all points in . See [11] for further details.. To measure the relative merit of different Pareto fronts we use the hypervolume of the region dominated by the Pareto front bounded by the antiideal point. Mathematically this can be expressed as , where denotes the standard Lebesgue measure on . Henceforth we assume the antiideal point is fixed and drop it from our notation.
Larger hypervolume means the points in the Pareto front are closer to the ideal point . Thus, provides a way to measure the quality of the Pareto front obtained from the data in . Furthermore, hypervolume can be used to design acquisition functions for selecting hyperparameters that will improve the Pareto front. Start by defining the increment in the hypervolume given a new point : This quantity is positive only if lies in the set of points nondominated by . Therefore, the PoI over the current Pareto front when selecting a new hyperparameter can be computed using the model trained on as , where is the predictive Gaussian density for with mean and variance .
The function accounts for the probability that a given has to improve the Pareto front, and it can be used as a criterion to select new points. However, in this work, we opt for the hypervolumebased (HVPoI) due to its superior computational and practical properties [11]. The HVPoI is given by , where . This acquisition weights the probability of improving the Pareto front with a measure of how much improvement is expected computed using the means of the outputs. The HVPoI has been shown to work well in practice and efficient implementations exist.
The DPareto Algorithm
The main optimization loop of DPareto is shown in Alg. 1. It combines the two ingredients sketched so far: GPs for surrogate modelling of the objective oracles, and HVPoI as an acquisition function to select new hyperparameters. The basic procedure is to first seed the optimization by selecting hyperparameters from at random, and then fit the GP models for the privacy and utility oracles based on these points. We then find the maximum of the HVPoI acquisition function to obtain the next query point, which is then added into the dataset. This is repeated times until the optimization budget is used up. Further implementation details are provided in Sec. E.1.
Now we revisit our example on private logistic regression with SGD and output perturbation from Sec. 3 to illustrate how GPs can learn a good model of the privacy and utility oracles from a few random samples and how that produces an acquisition function to find next hyperparameter settings that improve the current empirical Pareto front. This corresponds to the initialization phase of DPareto; results are given in Figure 2.
4 Experiments
We provide experimental evaluation of DPareto on a number of ML tasks, highlighting the advantage of using BO over random or grid search, and showcasing DPareto’s versatility on a variety of models, datasets and optimizers. Due to space limitations, further details (e.g. optimization domains and random sampling distributions) and additional results are given in Appx. C and Appx. D.
Datasets
We tackle two classic problems: multiclass classification of handwritten digits with the mnist dataset, and binary classification of income with the adult dataset. mnist [28] is composed of grayscale images, each representing a single digit 09. It has 60k (10k) images in the training (test) set. adult [25] is composed of 123 binary demographic features on various people, with the task of predicting whether income > $50k. It has 40k (1.6k) points in the training (test) set.
Algorithms
Experiments are performed with privatized variants of two popular optimization algorithms – stochastic gradient descent (SGD) [7] and Adam [22] – although our framework can easily accommodate other algorithms. For the privatized version of SGD, we use a minibatched implementation with clipped gradients and Gaussian noise similar to that of [1]
, where the only difference is that we sample minibatches of a fixed size without replacement instead of sampling minibatches from a Poisson distribution with fixed rate, and use the moments accountant from
[43]. Our privatized version of Adam uses the same gradient perturbation technique as SGD. The pseudocode for both of these can be found in Appx. B (Alg. 4 and Alg. 5 respectively).Models
For adult dataset, we consider logistic regression (LogReg) and linear support vector machines, and explore the effect of the choice of model and optimization algorithm (SGD vs. Adam), using the differentially private versions of these algorithms outlined in Appx. B. For mnist, we fix the optimization algorithm as SGD, but use a more expressive multilayer perceptron (MLP
) model and explore the choice of network architectures. The first (MLP1) has a single hidden layer with 1000 neurons, which is the same as used by
[1]but without PCA dimensionality reduction. The second (MLP2) has two hidden layers with 128 and 64 units. In both cases we use ReLU activations.
DPareto vs. Random Sampling
A primary purpose of these experiments is to highlight the efficacy of DPareto at estimating the privacyutility tradeoff of a given algorithm. As discussed in above, the hypervolume is a popular measure for quantifying the quality of a Pareto front. We compare DPareto to the traditional naïve approach of uniform random sampling by computing the hypervolumes of Pareto fronts generated by each method.
In Fig. 3, the first two plots show, for a variety of models, how the hypervolume of the Pareto front expands as new points are sampled. In nearly every experiment, DPareto’s approach yields a greater hypervolume than the experiment’s random sampling analog – a direct indicator that DPareto has better characterized the Pareto front. This can be seen very clearly by examining the center right plot of the figure, which directly shows a Pareto front of the MLP2 model with both sampling methods. Specifically, while the random sampling method only marginally improved over its initially seeded points, DPareto was able to thoroughly explore the highprivacy regime (i.e. small ). The far right plot of the figure compares the DPareto approach with 256 sampled points against the random sampling approach with significantly more sampled points, 1500. While both approaches yield similar Pareto fronts, the efficiency of DPareto is particularly highlighted by the points that are not actually on the front: nearly all the points chosen by DPareto are close to the actual front, whereas many points chosen by random sampling are nowhere near it. We also ran experiments using grid search, where we chose used grid sizes of 3 or 4 (corresponding to 243 and 1024 points), both of which performed clearly worse than DPareto. These are shown in Fig. 8 in Appx. D.
To quantify the differences between random sampling and DPareto for the adult dataset, we split the 5000 random samples into 19 parts of size 256 to match the number of BO points, and computed hypervolume differences between the resultant Pareto fronts under the (mild) assumption that DPareto is deterministic^{7}^{7}7Whilst this is not strictly true, since BO is seeded with a random set of points, running repetitions would have been an extremely costly exercise, and we would expect the results to be nearly identical
. We then computed the twosided confidence intervals for these differences, shown in
Table 1. We also computed the tstatistic for these differences being zero, which were all highly significant (p<0.001). This demonstrates that the observed differences between Pareto fronts are in fact statistically significant. We did not have enough random samples to run statistical tests for mnist, however the differences are visually even clearer in this case.DPareto’s Versatility
The other main purpose of these experiments is to demonstrate the versatility of DPareto by comparing multiple approaches to the same problem. In Fig. 4, the left plot shows Pareto fronts of the adult dataset for multiple optimizers (SGD and Adam) as well as multiple models (LogReg and SVM), and the right plot shows Pareto fronts of the mnist dataset for different architectures (MLP1 and MLP2). With this, we can see that on the adult dataset, the LogReg model optimized using Adam was nearly always better than the other model/optimizer combinations. We can also see that on the mnist dataset, while both architectures performed similarly in the lowprivacy regime, the MLP2 architecture significantly outperformed the MLP1 architecture in the highprivacy regime. With DPareto, analysts and practitioners can efficiently create these types of Pareto fronts and use them to perform privacy–utility tradeoff comparisons.
5 Related Work
While this work is the first to examine the privacy–utility tradeoff of differentially private algorithms using multiobjective optimization and Pareto fronts, efficiently computing Pareto fronts without regards to privacy is an active area of research in fields relating to multiobjective optimization. DPareto’s pointselection process most closely aligns with [11], but other approaches (e.g., [45]) may provide promising alternatives for improving DPareto.
The threat model and outputs of the DPareto algorithm are closely aligned with the methodology used by the U.S. Census Bureau to choose the privacy parameter for their deployment of DP to release data from the upcoming 2020 census. In particular, the bureau is combining a graphical approach to represent the privacy–utility tradeoff for their application [17] together with economic theory to pick a particular point to balance the tradeoff [3]. Their graphical approach works with Pareto fronts identical to the ones computed by our algorithm, which they construct using data from previous censuses [2]. However, they do not attempt to optimize or learn the Pareto front.
Several aspects of this paper are related to recent work in singleobjective optimization. For nonprivate singleobjective optimization, there is an abundance of recent work in machine learning on hyperparameter selection, typically using BO [23, 20] or other methods [29] to maximize a model’s utility. Recently, several related questions at the intersection of machine learning and differential privacy have emerged regarding hyperparameter selection and utility maximization.
One such question explicitly asks how to do the hyperparametertuning process in a privacypreserving way. Specifically, [27] and subsequently [39] use BO to find nearoptimal hyperparameter settings for a given model while preserving the privacy of the data during the utility evaluation stage. Aside from the singleobjective focus of this setting, our setting is significantly different in that we are primarily interested in training the models with differential privacy, not in protecting the privacy of the data used to evaluate an alreadytrained trained model.
Another question asks how to choose utilitymaximizing hyperparameters when privately training models. When privacy is independent of the hyperparameters, this reduces to the nonprivate hyperparameter optimization task. However, two variants of this question don’t have this trivial reduction. The first variant inverts the stated objective: [30] and [18] each study the problem of maximizing privacy given constraints on the final utility. The second variant, closely aligning with the setting in this paper, studies the problem of choosing utilitymaximizing, but privacydependent, hyperparameters. This is particularly challenging, since the privacy’s dependence on the hyperparameters may be nonanalytical and computationally expensive to determine. [33, 42]
provide approaches to this variant, however the proposed strategies are 1) based on heuristics, 2) only applicable to the differentially private SGD problem, and 3) do not provide a computationally efficient way to find the Pareto optimal points for the privacy–utility tradeoff of a given model.
[44] provides a practical analysisbacked approach to privately training utilitymaximizing models (again, for the case of SGD with a fixed privacy constraint), but hyperparameter optimization is naïve performed using gridsearch. By contrast, this paper provides a computationally efficient way to directly search for Pareto optimal points for the privacy–utility tradeoff of arbitrary hyperparameterized algorithms.The final related question revolves around the differentially private “selection” or “maximization” problem [9], which asks: how can an item be chosen (from a predefined universe) to maximize a datadependent function while still protecting the privacy of that data? Here, [31] recently provided a way to choose hyperparameters that approximately maximize the utility of a given differentially private model in a way that protects the privacy of both the training and test data sets. However, this only optimizes utility with fixed privacy – it doesn’t address our problem of directly optimizing for the selection of hyperparameters that generate privacy–utility points which fall on the Pareto front.
Recent work on datadriven algorithm configuration has considered the problem of tuning the hyperparameters of combinatorial optimization algorithms while maintaining
DP [5]. In this, problem instances are sampled from a distribution, and this sample’s privacy is protected. A similar problem of datadriven algorithm selection is considered by [26], where the problem is to choose the best algorithm to accomplish a given task while maintaining the privacy of the data used. For both, only the utility objective is being optimized, assuming a fixed constraint on the privacy.6 Conclusion
In this paper we introduced DPareto, a method to empirically characterize the privacy–utility tradeoff of differentially private algorithms. We use Bayesian optimization (BO), a stateoftheart method for hyperparameter optimization, to simultaneously optimize for both privacy and utility, forming a Pareto front. Further, we showed that BO
allows us to perform useful visualizations to aid the decision making process. There are several directions for future work. We focused on supervised learning, but the method could also be applied to, e.g. stochastic variational inference on probabilistic models, as long as a utility function (e.g. heldout perplexity) is available.
DPareto currently uses independent GPs, but an interesting extension would be to use multioutput GPs. While we explored the effect of changing the model (logistic regression vs. SVM) and the optimizer (SGD vs. Adam) on the privacy–utility tradeoff, it would interesting to optimize over these choices as well. Finally, it may be of interest to optimize over additional criteria, such as model size or running time.References
 Abadi et al. [2016] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318. ACM, 2016.
 Abowd [2018] J. M. Abowd. Disclosure avoidance for block level data and protection of confidentiality in public tabulations. Census Scientific Advisory Committee (Fall Meeting), 2018. URL https://www2.census.gov/cac/sac/meetings/201812/abowddisclosureavoidance.pdf.
 Abowd and Schmutte [2018] J. M. Abowd and I. M. Schmutte. An economic analysis of privacy protection and statistical accuracy as social choices. American Economic Review, Forthcoming, 2018.
 Álvarez et al. [2012] M. A. Álvarez, L. Rosasco, and N. D. Lawrence. Kernels for vectorvalued functions: A review. Found. Trends Mach. Learn., 4(3):195–266, Mar. 2012. ISSN 19358237. doi: 10.1561/2200000036. URL http://dx.doi.org/10.1561/2200000036.
 Balcan et al. [2018] M.F. Balcan, T. Dick, and E. Vitercik. Dispersion for datadriven algorithm design, online learning, and private optimization. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 603–614. IEEE, 2018.
 Balle and Wang [2018] B. Balle and Y. Wang. Improving the gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pages 403–412, 2018.
 Bottou [2010] L. Bottou. Largescale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
 Carlini et al. [2018] N. Carlini, C. Liu, J. Kos, Ú. Erlingsson, and D. Song. The secret sharer: Measuring unintended neural network memorization & extracting secrets. CoRR, abs/1802.08232, 2018.
 Chaudhuri et al. [2014] K. Chaudhuri, D. J. Hsu, and S. Song. The large margin mechanism for differentially private maximization. In Advances in Neural Information Processing Systems, pages 1287–1295, 2014.
 Chen et al. [2015] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
 Couckuyt et al. [2014] I. Couckuyt, D. Deschrijver, and T. Dhaene. Fast calculation of multiobjective probability of improvement and expected improvement criteria for Pareto optimization. Journal of Global Optimization, 60(3):575–594, 2014.
 Dwork [2006] C. Dwork. Differential privacy. In Automata, Languages and Programming, 33rd International Colloquium, ICALP 2006, Venice, Italy, July 1014, 2006, Proceedings, Part II, pages 1–12, 2006.
 Dwork et al. [2006] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.

Dwork et al. [2009]
C. Dwork, M. Naor, O. Reingold, G. N. Rothblum, and S. Vadhan.
On the complexity of differentially private data release: efficient
algorithms and hardness results.
In
Proceedings of the fortyfirst annual ACM symposium on Theory of computing
, pages 381–390, 2009.  Dwork et al. [2014] C. Dwork, A. Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
 Feldman et al. [2018] V. Feldman, I. Mironov, K. Talwar, and A. Thakurta. Privacy amplification by iteration. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), 2018.
 Garfinkel et al. [2018] S. L. Garfinkel, J. M. Abowd, and S. Powazek. Issues encountered deploying differential privacy. In Proceedings of the 2018 Workshop on Privacy in the Electronic Society, 2018.
 Ge et al. [2019] C. Ge, X. He, I. F. Ilyas, and A. Machanavajjhala. Apex: Accuracyaware differentially private data exploration. 2019.
 Geumlek et al. [2017] J. Geumlek, S. Song, and K. Chaudhuri. Renyi differential privacy mechanisms for posterior sampling. In Advances in Neural Information Processing Systems, pages 5289–5298, 2017.
 Golovin et al. [2017] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service for blackbox optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1487–1495. ACM, 2017.
 Jenatton et al. [2017] R. Jenatton, C. Archambeau, J. González, and M. Seeger. Bayesian optimization with treestructured dependencies. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1655–1664, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
 Kingma and Ba [2015] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of International Conference on Learning Representations (ICLR), 2015.
 Klein et al. [2016] A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter. Fast bayesian optimization of machine learning hyperparameters on large datasets. arXiv preprint arXiv:1605.07079, 2016.
 Knudde et al. [2017] N. Knudde, J. van der Herten, T. Dhaene, and I. Couckuyt. GPflowOpt: A Bayesian Optimization Library using TensorFlow. arXiv preprint – arXiv:1711.03845, 2017. URL https://arxiv.org/abs/1711.03845.

Kohavi [1996]
R. Kohavi.
Scaling up the accuracy of NaiveBayes classifiers: a decisiontree hybrid.
In KDD, volume 96, pages 202–207. Citeseer, 1996.  Kotsogiannis et al. [2017] I. Kotsogiannis, A. Machanavajjhala, M. Hay, and G. Miklau. Pythia: Data dependent differentially private algorithm selection. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1323–1337. ACM, 2017.
 Kusner et al. [2015] M. Kusner, J. Gardner, R. Garnett, and K. Weinberger. Differentially private bayesian optimization. In International Conference on Machine Learning, pages 918–927, 2015.
 LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.
 Li et al. [2016] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel banditbased approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560, 2016.
 Ligett et al. [2017] K. Ligett, S. Neel, A. Roth, B. Waggoner, and S. Z. Wu. Accuracy first: Selecting a differential privacy level for accuracy constrained erm. In Advances in Neural Information Processing Systems, pages 2566–2576, 2017.
 Liu and Talwar [2018] J. Liu and K. Talwar. Private selection from private candidates. arXiv preprint arXiv:1811.07971, 2018.
 Lyu et al. [2017] M. Lyu, D. Su, and N. Li. Understanding the sparse vector technique for differential privacy. Proceedings of the VLDB Endowment, 2017.
 McMahan and Andrew [2018] H. B. McMahan and G. Andrew. A general approach to adding differential privacy to iterative training procedures. CoRR, abs/1812.06210, 2018.
 McMahan et al. [2018] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang. Learning differentially private recurrent language models. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BJ0hF1Z0b.
 Mironov [2017] I. Mironov. Renyi differential privacy. In Computer Security Foundations Symposium (CSF), 2017 IEEE 30th, pages 263–275. IEEE, 2017.
 Močkus [1975] J. Močkus. On Bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pages 400–404. Springer, 1975.
 Nissim et al. [2018] K. Nissim, T. Steinke, A. Wood, M. Altman, A. Bembenek, M. Bun, M. Gaboardi, D. O’Brien, and S. Vadhan. Differential privacy: A primer for a nontechnical audience (preliminary version). Vanderbilt Journal of Entertainment and Technology Law, Forthcoming, 2018.
 Rasmussen and Williams [2005] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005. ISBN 026218253X.

Smith et al. [2018]
M. Smith, M. Álvarez, M. Zwiessele, and N. Lawrence.
Differentially private regression with Gaussian processes.
In
International Conference on Artificial Intelligence and Statistics
, pages 1195–1203, 2018.  Snelson et al. [2004] E. Snelson, Z. Ghahramani, and C. E. Rasmussen. Warped Gaussian processes. In Advances in neural information processing systems, pages 337–344, 2004.
 Snoek et al. [2012] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
 van der Veen [2018] K. L. van der Veen. A Practical Approach to Differential Private Learning. PhD thesis, Master’s thesis, University of Amsterdam, Amsterdam, The Netherlands, 2018.
 Wang et al. [2019] Y. Wang, B. Balle, and S. Kasiviswanathan. Subsampled Rényi differential privacy and analytical moments accountant. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
 Wu et al. [2017] X. Wu, F. Li, A. Kumar, K. Chaudhuri, S. Jha, and J. Naughton. Bolton differential privacy for scalable stochastic gradient descentbased analytics. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1307–1322. ACM, 2017.

Zuluaga et al. [2016]
M. Zuluaga, A. Krause, and M. Püschel.
pal: an active learning approach to the multiobjective optimization problem.
The Journal of Machine Learning Research, 17(1):3619–3650, 2016.
Appendix A Sparse Vector Technique Analysis
The sparse vector technique Dwork et al. [2009] is a mechanism to privately run queries against a fixed sensitive database and release under DP the indices of those queries which exceed a certain threshold. The naming of the mechanism reflects the fact that it is specifically designed to have good accuracy when only a small number of queries are expected to be above the threshold. The mechanism has found applications in a number of problems, and several variants of the algorithm have been proposed Lyu et al. [2017].
To illustrate our framework we use a noninteractive version of the mechanism proposed in [Lyu et al., 2017, Alg. 7]. The mechanism is described in Alg. 2, and is tailored to answer binary queries with sensitivity and a fixed threshold . The privacy and utility of the mechanism are controlled by the noise level and the bound on the number of answers. Increasing or decreasing yields a more private but less accurate mechanism. Unlike in the usual setting, where the sparse vector technique is parametrized by the target privacy , we modified the mechanism to takes as input a total noise level . This noise level is split across two parameters and controlling how much noise is added to the threshold and to the query answers respectively^{8}^{8}8The split used by the algorithm is based on the privacy budget allocation suggested in [Lyu et al., 2017, Section 4.2].. The standard privacy analysis of the sparse vector technique provides the following closedform privacy oracle for our algorithm: (see Appx. A for more details).
As a utility oracle we use the score between the vector of true answers and the vector returned by the algorithm. This measures how well the algorithm identifies the support of the queries that return , while penalizing both for false positives and false negatives. This is again different from the usual utility analysis of sparse vector technique algorithms, which focuses on providing an interval around the threshold outside which the output is guaranteed to have no false positives or false negatives Dwork et al. [2014]. Our measure of utility is more finegrained and relevant for practical applications, although to the best of our knowledge no theoretical analysis of the utility of the sparse vector technique in terms of score is available in the literature.
To illustrate the concepts introduced in Sec. 2 we compute the oracles and Pareto front for Alg. 2. In our experiment we set and pick queries at random such that exactly of them return a . Since the accuracy of the algorithm is sensitive to the order of the queries, to evaluate the privacy oracle we run the algorithm times with a random order in the queries and return the average utility. Fig. 5 displays the values returned by the privacy and utility oracles across a range of hyperparameters (left two figures), the Pareto front (center right) and a set of pairs that lead to points in the Pareto front (far right).
In this example we were able to compute the Pareto front of Alg. 2 using a simple gridsearch procedure on a lowdimensional hyperparameter space. However, this approach might not be computationally feasible in practical applications with more hyperparameters, especially in cases where each evaluation of the utility oracle requires training a machine learning model – thus motivating the DPareto algorithm.
We now illustrate how DPareto can help to efficiently learn the privacy–utility trade off. In this example we initialize the GP models with hyperparameter pairs . The values of are sampled uniformly between 1 and 30. The values of are sampled uniformly in the interval on a logarithmic scale. The oracle values for and the utility are computed for the selected samples using the same oracles as above. The predicted means of the surrogate models for both oracles are shown in Fig. 6. We observe that both models achieve a reasonably good prediction accuracy when comparing directly to the true values in Fig. 5.
Fig. 6 (center right) shows the exact Pareto front of the problem, along with the output values of the initial sample and the corresponding empirical Pareto front. The empirical Pareto front sits close to the true one, which indicates that the selection of points is already quite good. The goal of DPareto is to select new points in the input domain whose outputs will bring the empirical front closer to the true one. The HVPoI function is used for this aim. Fig. 6 (far right) shows the values of the HVPoI for all pairs. The maximizer of this function (marked with a star) is used as the next location to evaluate the oracles. Note that given the current models the HVPoI is making a sensible choice, selecting a point where is predicted to have a medium value while and is predicted to be low, possibly looking to improve the gap in the lower right corner in the Pareto front plot.
a.1 Privacy Proof
This section provides a proof of the privacy bound for Alg. 2 used to implement the privacy oracle . The proof is based on observing that our Alg. 2 is just a simple reparametrization of [Lyu et al., 2017, Alg. 7] where some of the parameters have been fixed upfront. For concreteness, we reproduce [Lyu et al., 2017, Alg. 7] as Alg. 3 below. The result then follows from a direct application of [Lyu et al., 2017, Thm. 4], which shows that Alg. 3 is DP.
Appendix B Differentially Private Stochastic Optimization Algorithms
stochastic gradient descent (SGD)stfalse is a simplification of gradient descent, where on each iteration instead of computing the gradient for the entire dataset, it is instead estimated on the basis of a single example (or small batch of examples) picked uniformly at random (without replacement) Bottou [2010]. Adam Kingma and Ba [2015] is a firstorder gradientbased optimization algorithm for stochastic objective functions, based on adaptive estimates of lowerorder moments.
As a privatized version of SGD, we use a minibatched implementation with clipped gradients and Gaussian noise similar to that of Abadi et al. [2016]. The pseudocode is given in Alg. 4; the only difference with the algorithm in Abadi et al. [2016] is that we sample minibatches of a fixed size without replacement instead of using minibatches obtained from Poisson sampling with a fixed probability. In the pseudocode below, the function acts as the identify if , and otherwise returns . This clipping operation ensures that so that the sensitivity of any gradient to a change in one datapoint in is always bounded by .
Our privatized version of Adam is given in Alg. 5, which uses the same gradient perturbation technique as stochastic gradient descent. Here the notation denotes the vector obtained by squaring each coordinate of . Adam uses three numerical constants that are not present in SGD (, and ). To simplify our experiments we fixed those constants to the defaults suggested in Kingma and Ba [2015].
Appendix C Experimental Setup
In all our experiments we used as the antiideal point in DPareto.
c.1 Optimization Domains
Table 2 gives the optimization domain for each of the different experiments.
max width= Algorithm Dataset Epochs () Lot Size () Learning Rate () Noise Variance () Clipping Norm () LogReg+SGD adult LogReg+Adam adult SVM+SGD adult MLP1+SGD mnist MLP2+SGD mnist
c.2 Random Sampling Distributions
The random sampling distributions for experiments with both mnist and adult datasets were chosen to generate as favorable results from the random samplings as possible. The distributions were chosen both from reviewing literature – namely, Abadi et al. [2016] and McMahan et al. [2018]
– as well as the authors’ experience from training these differentially private models. We note that these distributions generated significantly better points (with regards to characterizing the Pareto front) than naïvely sampling from the uniform distribution.
Table 3 lists the distributions for the hyperparameters used in the mnist experiments, and Table 4 lists the distributions for the hyperparameters used in the adult experiments.
max width= Hyperparameter Base Distribution Parameters RoundtoInt Acceptable Range Epochs Uniform True Lot Size Normal True Learning Rate Shifted Exponential False Noise Variance Shifted Exponential False Clipping Norm Shifted Exponential False
max width= Hyperparameter Base Distribution Parameters RoundtoInt Acceptable Range Epochs Uniform True Lot Size Normal True Learning Rate Shifted Exponential False Noise Variance Shifted Exponential False Clipping Norm Shifted Exponential False
Appendix D Further Experimental Results
DPareto also allows us to gather information about the potential variability of the recovered Pareto front. To that, recall that in our experiments we implemented the utility oracle by repeatedly running algorithm with a fixed choice of hyperparameters, and then reported the average utility across runs. Using these same runs we can also take the best and worst utilities observed for each choice of hyperparameters. Fig. 7 displays the Pareto fronts recovered from considering the best and worst runs in addition to the Pareto front obtained from the average over runs. In general we observe higher variability in utility on the high privacy regime (i.e. small ), which is to be expected since more privacy is achieved by increasing the variance of the noise added to the computation. These type of plots can be useful to decisionmakers who want to get an idea of what variability can be expected in practice from a particular choice of hyperparameters.
d.1 Grid Search
For the grid search experiments we have defined parameter ranges as limits of the parameter values from our random sampling experiment setup (see Table 4). We have tried grid size 3, which corresponds to 243 points (approximately the same amount of points as DPareto uses), and grid size 4, which corresponds to 1024 points (4 times more than what we used for DPareto). As can be seen in Fig. 8, DPareto clearly outperformed grid search.
Appendix E Implementation Details
e.1 DPareto
Hyperparameter optimization was implemented with GPFlowOpt library [Knudde et al., 2017] which offers GPbased Bayesian optimization, as well as the HVPoI acquisition function.
Transformed Output Domains
The output domain for accuracy is , which would clearly not be modeled well by a GP that models outputs on the entire real line. The output domain for privacy is on the real line, but it is expressed on a logarithmic scale. Hence, in both cases we transform the outputs, so that we are modeling a GP
with Gaussian noise in the transformed space. For accuracy, we use a logit transform
which transforms values from to . For privacy, we use a simple log transform. Note that it is possible to use Warped GPs [Snelson et al., 2004], where the transformation is learnt. Concretely this amounts to adding an additional Jacobian term to the likelihood function that takes the transformation into account. The advantage of this approach is that the form of both the covariance matrix and the nonlinear transformation are learnt simultaneously under the same probabilistic framework. However, for simplicity and efficiency we choose to use fixed transformations.e.2 Machine Learning Algorithms and Moments Accountant
Machine learning models used in the paper are implemented with Apache MXNet [Chen et al., 2015]. We have made use of the highlevel Gluon API whenever possible. However, the privacy accountant implementation that we used (see [Wang et al., 2019]) required lowlevel changes to the definitions of the models. In order to keep the continuous MXNet execution graph to ensure a fast evaluation of the model, we reverted to the pure MXNet model definitions. Even though this approach requires much more effort to implement the models themselves, it allows for more finegrained control of how the model is executed, as well as provides a natural way of implementing privacy accounting.