Many data analysis pipelines are adaptive
: the choice of which analysis to run next depends on the outcome of previous analyses. Common examples include variable selection for regression problems and hyper-parameter optimization in large-scale machine learning problems: in both cases, common practice involves repeatedly evaluating a series of models on the same dataset. Unfortunately, this kind of adaptive re-use of data invalidates many traditional methods of avoiding over-fitting and false discovery, and has been blamed in part for the recent flood of non-reproducible findings in the empirical sciences(Gelman and Loken, 2014).
There is a simple way around this problem: don’t re-use data. This idea suggests a baseline called data splitting: to perform analyses on a dataset, randomly partition the dataset into disjoint parts, and perform each analysis on a fresh part. The standard “holdout method” is the special case of . Unfortunately, this natural baseline makes poor use of data: in particular, the data requirements of this method grow linearly with the number of analyses to be performed.
A recent literature starting with Dwork et al. (2015b) shows how to give a significant asymptotic improvement over this baseline via a connection to differential privacy: rather than computing and reporting exact sample quantities, perturb these quantities with noise. This line of work established a powerful transfer theorem, that informally says that any analysis that is simultaneously differentially private and accurate in-sample will also be accurate out-of-sample. The best analysis of this technique shows that for a broad class of analyses and a target accuracy goal, the data requirements grow only with — a quadratic improvement over the baseline (Bassily et al., 2016). Moreover, it is known that in the worst case, this cannot be improved asymptotically (Hardt and Ullman, 2014; Steinke and Ullman, 2015). Unfortunately, thus far this literature has had little impact on practice. One major reason for this is that although the more sophisticated techniques from this literature give asymptotic improvements over the sample-splitting baseline, the concrete bounds do not actually improve on the baseline until the dataset is enormous. This remains true even after optimizing the constants that arise from the arguments of (Dwork et al., 2015b) or (Bassily et al., 2016), and appears to be a fundamental limitation of their proof techniques (Rogers et al., 2019). In this paper, we give a new proof of the transfer theorem connecting differential privacy and in-sample accuracy to out-of-sample accuracy. Our proof is based on a simple insight that arises from taking a Bayesian perspective, and in particular yields an improved concrete bound that beats the sample-splitting baseline at dramatically smaller data set sizes compared to prior work. In fact, at reasonable dataset sizes, the magnitude of the improvement arising from our new theorem is significantly larger than the improvement between the bounds of Bassily et al. (2016) and Dwork et al. (2015b): see Figure 1.
1.1 Proof Techniques
Consider an unknown data distribution over a data-domain , and a dataset consisting of i.i.d. draws from . It is a folklore observation (attributed to Frank McSherry) that if a predicate is selected by an -differentially private algorithm acting on , then it will generalize in expectation (or have low bias) in the sense that . But bounds on bias are not enough to yield confidence intervals (except through Markov’s inequality), and so prior work has focused on strengthening the above observation into a high probability bound. For small , the optimal bound has the asymptotic form: (Bassily et al., 2016). Note that this bound does not refer to the estimated answers supplied to the data analyst: it says only that a differentially private data analyst is unlikely to be able to find a query whose average value on the dataset differs substantially from its expectation. Pairing this with a simultaneous high probability bound on the in-sample accuracy of a mechanism—that it supplies answers such that with high probability the empirical error is small: —yields a bound on out-of-sample accuracy via the triangle inequality.
Dwork et al. (2015b) proved their high probability bound via a direct computation on the moments of empirical query values, but this technique was unable to achieve the optimal rate. Bassily et al. (2016) proved a bound with the optimal rate by introducing the ingenious monitor technique. This important technique has subsequently found other uses (Steinke and Ullman, 2017; Nissim and Stemmer, 2019; Feldman and Vondrak, 2018), but is a heavy hammer that seems unavoidably to yield large constant overhead, even after numeric optimization (Rogers et al., 2019).
We take a fundamentally different approach by directly providing high probability bounds on the out-of-sample accuracy of mechanisms that are both differentially private and accurate in-sample. Our elementary approach is motivated by the following thought experiment: in actuality, the dataset is fixed before any interaction with begins. However, imagine that after the entire interaction with is complete, the dataset is resampled from the posterior distribution on datasets conditioned on the output of
. This thought experiment doesn’t alter the joint distribution on datasets and outputs, and so any in-sample accuracy guarantees thathas continue to hold under this hypothetical re-sampling experiment. But because the empirical value of the queries on the re-sampled dataset are likely to be close to their expected value over the posterior , the only way the mechanism can promise to be sample-accurate with high probability is if it provides answers that are close to their expected value over the posterior distribution with high probability.
This focuses attention on the posterior distribution on datasets induced by differentially private transcripts. But it is not hard to show that a consequence of differential privacy is that the posterior expectation of any query must be close to its expectation over the data distribution with high probability. In contrast to prior work, this argument directly leverages high-probability in-sample accuracy guarantees of a private mechanism to derive high-probability out-of-sample guarantees, without the need for additional machinery like the monitor argument of (Bassily et al., 2016).
1.2 Further Related Work
The study of “adaptive data analysis” was initiated by Dwork et al. (2015b, c) who provided upper bounds via a connection to differential privacy, and Hardt and Ullman (2014) who provided lower bounds via a connection to fingerprinting codes. The upper bounds were subsequently strengthened by Bassily et al. (2016), and the lower bounds by Steinke and Ullman (2015) to be (essentially) matching, asymptotically. The upper bounds were optimized by Rogers et al. (2019), which we use in our comparisons. Subsequent work proved transfer theorems related to other quantities like description length bounds (Dwork et al., 2015a) and compression schemes (Cummings et al., 2016), and expanded the types of analyses whose generalization properties we could reason about via a connection to a quantity called approximate max information (Dwork et al., 2015a; Rogers et al., 2016). Feldman and Steinke (2017, 2018)
gave improved methods that could guarantee out-of-sample accuracy bounds that depended on query variance.Neel and Roth (2018) extend the transfer theorems from this literature to the related problem of adaptive data gathering, which was identified by Nie et al. (2018). Ligett and Shenfeld (2019) give an algorithmic stability notion they call local statistical stability (also defined with respect to a posterior data distribution) that they show asymptotically characterizes the ability of mechanisms to offer high probability out-of-sample generalization guarantees for linear queries. A related line of work initiated by Russo and Zou (2016) and extended by Xu and Raginsky (2017) starts with weaker assumptions on the mechanism (mutual information bounds), and derives weaker conclusions (bounds on bias, rather than high probability generalization guarantees).
A more recent line of work aims at mitigating the fact that the worst-case bounds deriving from transfer theorems do not give non-trivial guarantees on reasonably sized datasets. Zrnic and Hardt (2019) show that better bounds can be derived under the assumption that the data analyst is restricted in various ways to not be fully adaptive. Feldman et al. (2019)
showed that overfitting by a classifier because of test-set re-use is mitigated in multi-label prediction problems, compared to binary prediction problems.Rogers et al. (2019)
gave a method for certifying the correctness of heuristically guessed confidence intervals, which they show often out-perform the theoretical guarantees by orders of magnitude.
Finally, Elder (2016b, a) proposed a Bayesian reformulation of the adaptive data analysis problem. In the model of (Elder, 2016b), the data distribution is assumed to itself be drawn from a prior that is commonly known to the data analyst and mechanism. In contrast, we work in the standard adversarial setting originally introduced by Dwork et al. (2015b) in which the mechanism must offer guarantees for worst case data distributions and analysts, and use the Bayesian view purely as a proof technique.
Let be an abstract data domain, and let be an arbitrary distribution over . A dataset of size is a collection of data records: . We study datasets sampled from : . We will write
to denote the random variable andfor realizations of this random variable. A linear query is a function that takes the following empirical average form when acting on a data set :
We will be interested in estimating the expectations of linear queries over . Abusing notation, given a distribution over datasets, we write to denote the expectation of over datasets drawn from , and write to denote a datapoint sampled uniformly at random from a dataset . Note that for linear queries we have:
We note that for linear queries, when the dataset distribution , we have , which we write as when the notation is clear from context. However, the more general definition will be useful because we will need to evaluate the expectation of over other (non-product) distributions over datasets in our arguments, and we will generalize beyond linear queries in Appendices A.1 and A.2.
Given a family of queries , a statistical estimator is a (possibly stateful) randomized algorithm parameterized by a dataset that interactively takes as input a stream of queries , and provides answers . An analyst is an arbitrary randomized algorithm that generates a stream of queries and receives a stream of answers (which can inform the next queries it generates). When an analyst interacts with a statistical estimator, they generate a transcript of their interaction where is the space of all transcripts. Throughout we write to denote the transcript’s random variable and for its realizations.
algocf[htbp] The interaction is summarized in Algorithm LABEL:alg:interact, and we write Interact to refer to it. When and are clear from context, we will abbreviate this notation and write simply . When we refer to an indexed query , this is implicitly a function of the transcript . Given a transcript , write to denote the posterior distribution on datasets conditional on : . Note that will no longer generally be a product distribution. We will be interested in evaluating uniform accuracy bounds, which control the worst-case error over all queries:
satisfies -sample accuracy if for every data analyst and every data distribution ,
We say satisfies -distributional accuracy if for every data analyst and every data distribution ,
We will be interested in interactions that satisfy differential privacy.
Definition 2.2 (Dwork et al. (2006)).
Two datasets are neighbors if they differ in at most one coordinate. An interaction satisfies -differential privacy if for all data analysts , pairs of neighboring datasets , and for all events :
where the operator denotes either a probability density or a probability mass. If satisfies -differential privacy, we will also say that satisfies -differential privacy.
We introduce a novel quantity that will be crucial to our argument: it captures the effect of the transcript on the change in the expectation of a query contained in the transcript.
An interaction is called -posterior sensitive if for every data distribution :
3 An Elementary Proof of the Transfer Theorem
3.1 A General Transfer Theorem
In this section we prove a general transfer theorem for sample accurate mechanisms with low posterior sensitivity. In Section 3.2 we prove that differentially private mechanisms have low posterior sensitivity.
Theorem 3.1 (General Transfer Theorem).
Suppose that is an -sample accurate, -posterior sensitive interaction. Then for every it also satisfies:
i.e. it is -distributionally accurate for and .
The theorem follows easily from a change in perspective driven by an elementary observation. Imagine that after the interaction is run and results in a transcript , the dataset is resampled from its posterior distribution . This does not change the joint distribution on datasets and transcripts. This simple claim is formalized below: its elementary proof appears in Appendix B.
Lemma 3.2 (Bayesian Resampling Lemma).
Let be any event. Then:
The change in perspective suggested by the resampling lemma makes it easy to see why the following must be true: any sample-accurate mechanism must in fact be accurate with respect to the posterior distribution it induces. This is because if it can first commit to answers, and guarantee that they are sample-accurate after the dataset is resampled from the posterior, the answers it committed to must have been close to the posterior means, because it is likely that the empirical answers on the resampled dataset will be. This argument is generic and does not use differential privacy.
Suppose that is -sample accurate. Then for every it also satisfies:
Denote by . Given and , and expanding the definition of we get:
Here, inequality (1) follows from Markov’s inequality, inequality (2) follows from the fact that , and equality 3 follows from the Bayesian Resampling Lemma (Lemma 3.2). Repeating this argument for yields a symmetric bound, so by combining the two with the guarantee of -sample accuracy we get,
Because sample accuracy implies accuracy with respect to the posterior distribution, together with a bound on posterior sensitivity, the transfer theorem follows immediately:
3.2 A Transfer Theorem for Differential Privacy
In this section we prove a transfer theorem for differentially private mechanisms by demonstrating that they have low posterior sensitivity and applying our general transfer theorem.
We here show that differentially private mechanisms are posterior-sensitive for linear queries. In the Appendix we extend this argument to low-sensitivity and optimization queries.
If is -differentially private, then for any data distribution , any analyst , and any constant :
i.e. it is -posterior sensitive for every data analyst , where and .
Given a transcript , let . define for an :
Fix any . Suppose that . We must have that either
or . Without loss of generality, assume
Let be the random variable obtained by first sampling and then sampling uniformly at random. We compare the probability measure of under the joint distribution on and with its corresponding measure under the product distribution of and :
On the other hand, using the definition of -differential privacy (See Lemma C.1 for the elementary derivation of the first inequality):
This is a contradiction for . ∎
Since differential privacy is closed under post processing, this claim can be generalized beyond queries contained in the transcript to any query generated as function of the transcript.
In the case of -differential privacy, choosing , the claim holds for every query with probability 1.
Combined with our general transfer theorem (Theorem 3.1), this directly yields a transfer theorem for differential privacy:
Theorem 3.5 (Transfer Theorem for -Differential Privacy).
Suppose that is -differentially private and -sample accurate for linear queries. Then for every analyst and it also satisfies:
i.e. it is -distributionally accurate for and .
As we will see in Section 4, the Gaussian mechanism (and many other differentially private mechanisms) has a sample accuracy bound that depends only on the square root of the log of both and . Thus, despite the Markov-like term in the above transfer theorem, together with the sample accuracy bounds of the Gaussian mechanism, it yields Chernoff-like concentration.
4 Applications: The Gaussian Mechanism
We now apply our new transfer theorem to derive the concrete bounds that we plotted in Figure 1. The Gaussian mechanism is extremely simple and has only a single parameter : for each query that arrives, the Gaussian mechanism returns the answer where
denotes the Gaussian distribution with mean. First, we recall the differential privacy properties of the Gaussian mechanism.
Theorem 4.1 (Bun and Steinke (2016)).
When used to answer linear queries, the Gaussian mechanism with parameter satisfies -zCDP for . A consequence of this is that for every , it satisfies -differential privacy for:
It is also easy to see that the sample-accuracy of the Gaussian mechanism is characterized by the CDF of the Gaussian distribution:
For any , the Gaussian mechanism with parameter is -sample accurate for:
Above, is the complementary error function.
For a query , write where . The sample error is . We have that . is the value that solves the equation ∎
With these quantities in hand, we can now apply Theorem 3.5 to derive distributional accuracy bounds for the Gaussian mechanism:
Fix a desired confidence parameter . When is set optimally, the Gaussian mechanism can be used to answer linear queries while satisfying -distributional accuracy, where is the solution to the following unconstrained minimization problem:
Using Theorem 3.5 and fixing and , we have that an -sample accurate, -differentially private mechanism is -distributionally accurate for and where can be an arbitrary parameter. For any fixed value of , we can take , and see that we obtain -distributional accuracy where . The theorem then follows from plugging in the privacy bound from Theorem 4.1, the sample accuracy bound from Theorem 4.2, and optimizing over the free variables and . ∎
We have given a new proof of the transfer theorem for differential privacy that has several appealing properties. Besides being simpler than previous arguments, it achieves substantially better concrete bounds than previous transfer theorems, and uncovers new structural insights about the role of differential privacy and sample accuracy. In particular, sample accuracy serves to guarantee that the reported answers are close to their posterior means, and differential privacy serves to guarantee that the posterior means are close to their true answers. This focuses attention on the posterior data distribution as a key quantity of interest, which we expect will be fruitful in future work. In particular, it may shed light on what makes certain data analysts overfit less than worst-case bounds would suggest: because they choose queries whose posterior means are closer to the prior than the worst-case query.
There seems to be one remaining place to look for improvement in our transfer theorem: Lemmas 3.3 and 3.4 both exhibit a Markov-like tradeoff between a parameter and and respectively. Although the dependence on and in our ultimate bounds is only root-logarithmic, it would still yield an improvement if this Markov-like dependence could be replaced with a Chernoff-like dependence. It is possible to do this for the parameter: we give an alternative (and even simpler) proof of the transfer theorem for -differential privacy which shows that posterior distributions induced by private mechanisms exhibit Chernoff-like concentration, in Appendix D. But the only way we know to extend this argument to -differential privacy requires dividing by a factor of , which yields a final theorem that is inferior to Theorem 3.5.
We thank Adam Smith for helpful conversations at an early stage of this work.
Algorithmic stability for adaptive data analysis.
Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 1046–1059. Cited by: Figure 1, §1.1, §1.1, §1.1, §1.2, §1.
- Concentrated differential privacy: simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pp. 635–658. Cited by: Theorem 4.1.
- Adaptive learning with robust generalization guarantees. In Conference on Learning Theory, pp. 772–814. Cited by: §1.2.
- Generalization in adaptive data analysis and holdout reuse. In Advances in Neural Information Processing Systems, pp. 2350–2358. Cited by: §1.2.
- Preserving statistical validity in adaptive data analysis. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pp. 117–126. Cited by: Figure 1, §1.1, §1.2, §1.2, §1.
- The reusable holdout: preserving validity in adaptive data analysis. Science 349 (6248), pp. 636–638. Cited by: §1.2.
- Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp. 265–284. Cited by: Definition 2.2.
- Bayesian adaptive data analysis guarantees from subgaussianity. arXiv preprint arXiv:1611.00065. Cited by: §1.2.
- Challenges in bayesian adaptive data analysis. arXiv preprint arXiv:1604.02492. Cited by: §1.2.
- The advantages of multiple classes for reducing overfitting from test set reuse. In International Conference on Machine Learning, pp. 1892–1900. Cited by: §1.2.
- Generalization for adaptively-chosen estimators via stable median. In Conference on Learning Theory, pp. 728–757. Cited by: §1.2.
- Calibrating noise to variance in adaptive data analysis. In Conference On Learning Theory, pp. 535–544. Cited by: §1.2.
- Generalization bounds for uniformly stable algorithms. In Advances in Neural Information Processing Systems, pp. 9747–9757. Cited by: §1.1.
- The statistical crisis in science. American Scientist 102 (6), pp. 460. Cited by: §1.
- Preventing false discovery in interactive data analysis is hard. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pp. 454–463. Cited by: §1.2, §1.
- A necessary and sufficient stability notion for adaptive generalization. arXiv preprint arXiv:1906.00930. Cited by: §1.2.
- Mitigating bias in adaptive data gathering via differential privacy. In International Conference on Machine Learning (ICML), Cited by: §1.2.
Why adaptively collected data have negative bias and how to correct for it.
International Conference on Artificial Intelligence and Statistics, pp. 1261–1269. Cited by: §1.2.
- Concentration bounds for high sensitivity functions through differential privacy. Journal of Privacy and Confidentiality 9 (1). Cited by: §1.1.
- Guaranteed validity for empirical approaches to adaptive data analysis. arXiv preprint arXiv:1906.09231. Cited by: Figure 1, §1.1, §1.2, §1.2, §1.
- Max-information, differential privacy, and post-selection hypothesis testing. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pp. 487–494. Cited by: §1.2.
- Controlling bias in adaptive data analysis using information theory. In Artificial Intelligence and Statistics, pp. 1232–1240. Cited by: §1.2.
- Interactive fingerprinting codes and the hardness of preventing false discovery. In Conference on Learning Theory, pp. 1588–1628. Cited by: §1.2, §1.
- Subgaussian tail bounds via stability arguments. arXiv preprint arXiv:1701.03493. Cited by: §1.1.
- Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems, pp. 2524–2533. Cited by: §1.2.
- Natural analysts in adaptive data analysis. In International Conference on Machine Learning, pp. 7703–7711. Cited by: §1.2.
Appendix A Extensions
a.1 Low Sensitivity Queries
Our technique extends easily to reason about arbitrary low sensitivity queries. We only need to generalize our lemma about posterior sensitivity.
A query is called -sensitive if for all pairs of neighbouring datasets : . Note that linear queries are -sensitive.
If is an -differentially private mechanism for answering -sensitive queries, then for any data distribution , analyst , and any constant :
i.e. it is -posterior sensitive for every , where .
We introduce a useful bit of notation: . Notice that and . Given a transcript , let . Denote for any
and for any denote
From the definition of differential privacy:
where , and the last equality follows from the observation that and are identically distributed. Since , independently from , we get that , so
Subtracting from both sides we get
We now chooose . Suppose that . We must have that either or . Without loss of generality, assume
But this leads to a contradiction, since
We can combine this Lemma with Lemma 3.3 (which holds for any query type) to get our transfer theorem:
Theorem A.2 (Transfer Theorem for Low Sensitivity Queries).
Suppose that is -differentially private and -sample accurate for -sensitive queries. Then for every analyst , it also satisfies:
i.e. it is -distributionally accurate for and .
a.2 Minimization Queries
Minimization queries are specified by a loss function
Minimization queries are specified by a loss functionwhere is generally known as the “parameter space”. An answer to a minimization query is a parameter . We work with -sensitive minimization queries: for all pairs of neighbouring datasets and all , .
A mechanism is -sample accurate for minimization queries if for every data analyst and every dataset :
We say that satisfies -distributional accuracy for minimization queries if for every data analyst and every data distribution :