Many data analysis workflows are adaptive, that is, they re-use data over the course of a sequence of analyses, where the choice of analysis at any given stage depends on the results from previous stages. Such adaptive re-use of data is an important source of overfitting
in machine learning andfalse discovery in the empirical sciences (Gelman and Loken, 2014)
. Adaptive workflows arise, for example, when exploratory data analysis is mixed with confirmatory data analysis, when hold-out sets are re-used to search through large hyper-parameter spaces or to perform feature selection, and when datasets are repeatedly re-used within a research community.
A simple solution to this problem—that we can view as a naïve benchmark—is to simply not re-use data. More precisely, one could use sample splitting: partitioning the data set into equal-sized pieces, and using a fresh piece of the data set for each of adaptive interactions with the data. This allows us to treat each analysis as nonadaptive, and allows many quantities of interest to be accurately estimated with their empirical estimate, and paired with tight confidence intervals that come from classical statistics. This seemingly naive approach is wasteful in its use of data, however: the sample size needed to conduct a series of adaptive analyses grows linearly with .
A line of recent work (Dwork et al., 2015c, a, b; Russo and Zou, 2016; Bassily et al., 2016; Rogers et al., 2016; Feldman and Steinke, 2017a, b; Xu and Raginsky, 2017; Zrnic and Hardt, 2019; Mania et al., 2019) aims to improve on this baseline by using mechanisms which provide “noisy” answers to queries rather than exact empirical answers. The methods coming from this line of work require that the sample size grow proportional to the square root of the number of adaptive analyses, dramatically beating the sample splitting baseline asymptotically. Unfortunately, the bounds proven in these papers—even when optimized—only beat the naïve baseline when both the data set size , and the number of adaptive rounds , are very large; see Figure 1 (left).
The failure of these worst-case bounds to beat simple baselines in practice — despite their attractive asymptotics — has been a major obstacle to the practical adoption of techniques from this literature. There are two difficulties with directly improving this style of bounds. The first is that we are limited by what we can prove: mathematical analyses can often be loose by constants that are significant in practice. The more fundamental difficulty is that these bounds are guaranteed to hold even against a worst-case data analyst, who is adversarially attempting to find queries which over-fit the sample: one would naturally expect that when applied to a real workload of queries, such worst-case bounds would be extremely pessimistic. We address both difficulties in this paper.
In this paper, we move the emphasis from algorithms that provide point estimates to algorithms that explicitly manipulate and output confidence intervals based on the queries and answers so far, providing the analyst with both an estimated value and a measure of its actual accuracy. At a technical level, we have two types of contributions:
First, we give optimized worst-case bounds that carefully combine techniques from different pieces of prior work—plotted in Figure 1 (left). For certain mechanisms, our improved worst-case bounds are within small constant factors of optimal, in that we can come close to saturating their error bounds with a concrete, adversarial query strategy (Section 2). However, even these optimized bounds require extremely large sample sizes to improve over the naive sample splitting baseline, and their pessimism means they are often loose.
Our main result is the development of a simple framework called Guess and Check, that allows an analyst to pair any method for “guessing” point estimates and confidence interval widths for their adaptive queries, and then rigorously validate those guesses on an additional held-out data set. So long as the analyst mostly guesses correctly, this procedure can continue indefinitely. The main benefit of this framework is that it allows the analyst to guess confidence intervals whose guarantees exceed what is guaranteed by the worst-case theory, and still enjoy rigorous validity in the event that they pass the “check”. This makes it possible to take advantage of the non-worst-case nature of natural query strategies, and avoid the need to “pay for” constants that seem difficult to remove from worst-case bounds. Our empirical evaluation demonstrates that our approach can improve on worst-case bounds by orders of magnitude, and that it improves on the naive baseline even for modest sample sizes: see Figure 1 (right), and Section 3 for details. We also provide a Python library Rogers et al. (2019) containing an implementation of our Guess and Check framework.
Our “Guess and Check” (GnC) framework draws inspiration from the Thresholdout method of Dwork et al. (2015a), which uses a holdout set in a similar way. GnC has several key differences, which turn out to be crucial for practical performance: first, whereas the “guesses” in Thresholdout are simply the empirical query answers on a “training” portion of the dataset, we make use of other heuristic methods for generating guesses (including, in our experiments, Thresholdout itself) that empirically often seem to prevent overfitting to a substantially larger degree than their worst-case guarantees suggest. Second, we make confidence-intervals first-order objects: whereas the “guesses” supplied to Thresholdout are simply point estimates, the “guesses” supplied to GnC are point estimates along with confidence intervals. Finally, we use a more sophisticated analysis to track the number of bits leaked from the holdout, which lets us give tighter confidence intervals and avoids the need to a priori set an upper bound on the number of times the holdout is used. Gossmann et al. (2018) use a version of Thresholdout to get worst-case accuracy guarantees for values of the area under the receiver operating charateristic curve (AUC) for adaptively obtained queries. However, apart from being limited to binary classification tasks and the dataset being used only to obtain AUC values, their bounds require “unrealistically large” dataset sizes. Our results are complementary to theirs; by using appropriate concentration inequalities, GnC could also be used to provide confidence intervals for AUC values. Their technique could be used to provide the “guesses” to GnC.
Our improved worst-case bounds combine a number of techniques from the existing literature: namely the information theoretic arguments of Russo and Zou (2016); Xu and Raginsky (2017) together with the “monitor” argument of Bassily et al. (2016), and a more refined accounting for the properties of specific mechanisms using concentrated differential privacy (Dwork and Rothblum (2016); Bun and Steinke (2016b)).
give worst-case bounds that improve with the variance of the queries asked. We show in Section3.1 how our techniques can also be used to give tighter bounds when the empirical query variance is small.
Mania et al. (2019) give an improved union bound for queries that have high overlap, that can be used to improve bounds for adaptively validating similar models, in combination with description length bounds. Zrnic and Hardt (2019) take a different approach to going beyond worst-case bounds in adaptive data analysis, by proving bounds that apply to data analysts that may only be adaptive in a constrained way. A difficulty with this approach in practice is that it is limited to analysts whose properties can be inspected and verified — but provides a potential explanation why worst-case bounds are not observed to be tight in real settings. Our approach is responsive to the degree to which the analyst actually overfits, and so will also provide relatively tight confidence intervals if the analyst satisfies the assumptions of Zrnic and Hardt (2019).
As in previous work, we assume that there is a data set drawn i.i.d. from an unknown distribution over a universe . This data set is the input to a mechanism that also receives a sequence of queries from an analyst and outputs, for each one, an answer. Each is a statistical query, defined by a bounded function . We denote the expectation of a statistical query over the data distribution by , and the empirical average on a dataset by .
The mechanism’s goal is to give estimates of for query on the unknown . Previous work looked at analysts that produce a single point estimate , and measured error based on the distances . As mentioned above, we propose a shift in focus: we ask mechanisms to produce a confidence interval specified by a point estimate and width . The answer is correct for on if . (Note that the data play no role in the definition of correctness—we measure only population accuracy.)
An interaction between randomized algorithms and on data set (denoted ) consists of an unbounded number of query-answer rounds: at round , sends , and replies with . receives as input. receives no direct input, but may select queries adaptively
, based on the answers in previous rounds. The interaction ends when either the mechanism or the analyst stops. We say that the mechanism provides simultaneous coverage if, with high probability,all its answers are correct:
Definition 1.1 (Simultaneous Coverage).
Given , we say that has simultaneous coverage if, for all , all distributions on and all randomized algorithms ,
We denote by the (possibly random) number of rounds in a given interaction.
Definition 1.2 (Accuracy).
We say is -accurate, if has simultaneous coverage and its interval widths satisfy with probability 1.
2 Confidence intervals from worst-case bounds
Our emphasis on explicit confidence intervals led us to derive worst-case bounds that are as tight as possible given the techniques in the literature. We discuss the Gaussian mechanism here, and defer the application to Thresholdout in Section B.4, and provide a pseudocode for Thresholdout in Algorithm 4.
The Gaussian mechanism is defined to be an algorithm that, given input dataset and a query , reports an answer , where is a parameter. It has existing analyses for simultaneous coverage (see Dwork et al. (2015d); Bassily et al. (2016)) — but these analyses involve large, sub-optimal constants. Here, we provide an improved worst-case analysis by carefully combining existing techniques. We use results from Bun and Steinke (2016a) to bound the mutual information of the Gaussian mechanism. We then apply an argument similar to that of Russo and Zou (2016) to bound the bias of the empirical average of a statistical query selected as a function of the perturbed outputs. Finally, we use Chebyshev’s inequality, and the monitor argument from Bassily et al. (2016) to obtain high probability accuracy bound. Figure 1 (left) shows the improvement in the number of queries that can be answered with the Gaussian mechanism with -accuracy for . Our guarantee is stated below, with its proof deferred to Appendix C.2.
Given input , confidence parameter , and parameter , the Gaussian mechanism is )-accurate, where
We now consider the extent to which our analyses are improvable for worst-case queries to the Gaussian and the Thresholdout mechanisms. To do this, we derive the worst query strategy in a particular restricted regime. We call it the “single-adaptive query strategy”, and show that it maximizes the root mean squared error (RMSE) amongst all single query strategies under the assumption that each sample in the data set is drawn u.a.r. from , and the strategy is given knowledge of the empirical correlations of each of the first features with the st feature (which can be obtained e.g. with non-adaptive queries asked prior to the adaptive query). We provide a pseudocode for the strategy in Algorithm 5, and prove that our single adaptive query results in maximum error, in Appendix C.1. To make the bounds comparable, we translate our worst-case confidence upper bounds for both the mechanisms to RMSE bounds in Theorem C.8 and Theorem B.10. Figure 2 shows the difference between our best upper bound and the realized RMSE (averaged over 100 executions) for the two mechanisms using and various values of . (For the Gaussian, we set separately for each , to minimize the upper bound.) On the left, we see that the two bounds for the Gaussian mechanism are within a factor of 2.5, even for queries. Our bounds are thus reasonably tight in one important setting. For Thresholdout (right side), however, we see a large gap between the bounds which grows with , even for our best query strategy111We tweak the adaptive query in the single-adaptive query strategy to result in maximum error for Thresholdout. We also tried “tracing” attack strategies (adapted from the fingerprinting lower bounds of Bun et al. (2014); Hardt and Ullman (2014); Steinke and Ullman (2015)) that contained multiple adaptive queries, but gave similar results.. This result points to the promise for empirically-based confidence intervals for complex mechanisms that are harder to analyze.
3 The Guess and Check Framework
In light of the inadequacy of worst-case bounds, we here present our Guess and Check (GnC) framework which can go beyond the worst case. It takes as inputs guesses for both the point estimate of a query, and a confidence interval width. If GnC can validate a guess, it releases the guess. Otherwise, at the cost of widening the confidence intervals provided for future guesses, it provides the guessed confidence width along with a point estimate for the query using the holdout set such that the guessed width is valid for the estimate. An instance of GnC, , takes as input a data set , desired confidence level , and a mechanism which operates on inputs of size . randomly splits into two, giving one part to , and reserving the rest as a holdout . For each query , mechanism uses to make a “guess” to , for which conducts a validity check. If the check succeeds, then releases the guess as is, otherwise uses the holdout to provide a response containing a discretized answer that has as a valid confidence interval. This is closely related to Thresholdout. However, an important distinction is that the width of the target confidence interval, rather than just a point estimate, is provided as a guess. Moreover, the guesses themselves can be made by non-trivial algorithms. We provide pseudocode for GnC in Algorithm 1, and block schematic of how a query is answered by GnC in Figure 3.
We provide coverage guarantees for GnC without any restrictions on the guess mechanism. To get the guarantee, we first show that for query , if function returns a -confidence interval for holdout answer , and GnC’s output is the guess , then is a -confidence interval for . We can get a simple definition for (formally stated in Section C.3.1), but we provide a slightly sophisticated variant below that uses the guess and holdout answers to get better tolerances, especially under low-variance queries. We defer the proof of Lemma 3.1 to Appendix C.3.2.
If the function in GnC (Algorithm 1) is defined as
then for each query s.t. GnC’s output is , we have
Next, if failure occurs within GnC for query , by applying a Chernoff bound we get that is the maximum possible discretization parameter s.t. is a -confidence interval for the discretized holdout answer . Finally, we get a simultaneous coverage guarantee for GnC by a union bound over the error probabilities of the validity over all possible transcripts between the mechanism and any analyst with adaptive queries . The guarantee is stated below, with its proof deferred to Appendix C.3.1.
The Guess and Check mechanism (Algorithm 1), with inputs data set , confidence parameter , and a mechanism that, using inputs of size , provides responses (“guesses”) of the form for query , has simultaneous coverage .
3.1 Experimental evaluation
Now, we provide details of our empirical evaluation of the Guess and Check framework. In our experiments, we use two mechanisms, namely the Gaussian mechanism and Thresholdout, for providing guesses in GnC. For brevity, we refer to the overall mechanism as GnC Gauss when the Gaussian is used to provide guesses, and GnC Thresh when Thresholdout is used.
Strategy for performance evaluation: Some mechanisms evaluated in our experiments provide worst-case bounds, whereas the performance of others is instance-dependent and relies on the amount of adaptivity present in the querying strategy. To highlight the advantages of instance-dependent bounds, we design a query strategy called the quadratic-adaptive query strategy. It contains both adaptive and non-adaptive queries, where the adaptive queries become more sparsely distributed with time. “Hard” adaptive queries are asked when is a perfect square. They are computed using the answers to all the non-adaptive queries asked in prior rounds, using a strategy similar to that used in Figure 2. We provide pseudocode for the strategy in Algorithm 5.
Experimental Setup: We run the quadratic-adaptive strategy for up to
queries. We tune the hyperparameters of each mechanism to optimize for this query strategy. We fix a confidence parameterand set a target upper bound on the maximum allowable error we can tolerate, given our confidence bound. We evaluate each mechanism by the number of queries it can empirically answer with a confidence width of for our query strategy while providing a simultaneous coverage of : i.e. the largest number of queries it can answer while providing
-accuracy. We plot the average and standard deviation of the number of queriesanswered before it exceeds its target error bound in 20 independent runs over the sampled data and the mechanism’s randomness. When we plot the actual realized error for any mechanism, we denote it by dotted lines, whereas the provably valid error bounds resulting from the confidence intervals produced by GnC are denoted by solid lines. Note that the empirical error denoted by dotted lines is not actually possible to know without access to the distribution, and is plotted just to visualize the tightness of the provable confidence intervals. We compare to two simple baselines: sample splitting, and answer discretization: the better of these two is plotted as the thick solid line. For comparison, the best worst-case bounds for the Gaussian mechanism (Theorem 2.1) are shown as dashed lines. Note that we improve by roughly two orders of magnitude compared to the tightest bounds for the Gaussian. We improve over the baseline at data set sizes .
Boost in performance for low-variance queries: Since all the queries we construct take binary values on a sample , the variance of query is given by , as . Now, is maximized when . Hence, informally we call as low-variance if either , or . We want to be able to adaptively give tighter confidence intervals for low-variance queries (as e.g., the worst-case bounds of Feldman and Steinke (2017a, b) are able to). For instance, in Figure 4 (left), we show that in the presence of low-variance queries, using Lemma 3.1 for (plot labelled “GnC Check:MGF”) results in a significantly better performance for GnC Gauss as compared to using Lemma C.9 (plot labelled “GnC Check:Chern”). We fix , and set for . We can see that as the dataset size grows, using Lemma 3.1 provides an improvement of almost 2 orders of magnitude in terms of the number of queries answered. This is due to Lemma 3.1 providing tighter holdout tolerances for low-variance queries (with guesses close to or ), compared to those obtained via Lemma C.9 (agnostic to the query variance). Thus, we use Lemma 3.1 for in all experiments with GnC below. Note that the worst-case bounds for the Gaussian don’t promise a coverage of even for in the considered parameter ranges. This is representative of a general phenomenon: switching to GnC-based bounds instead of worst-case bounds is often the difference between obtaining useful vs. vacuous guarantees.
Performance at high confidence levels: The bounds we prove for the Gaussian mechanism, which are the best known worst-case bounds for the considered sample size regime, have a substantially sub-optimal dependence on the coverage parameter . On the other hand, sample splitting (and the bounds from Dwork et al. (2015d); Bassily et al. (2016) which are asymptotically optimal but vacuous at small sample sizes) have a much better dependence on . Since the coverage bounds of GnC are strategy-dependent, the dependence of on is not clear a priori. In Figure 4 (right), we show the performance of GnC Gauss (labelled “GnC”) when . We see that reducing by a factor of 10 has a negligible effect on GnC’s performance. Note that this is the case even though the guesses are provided by the Gaussian, for which we do not have non-vacuous bounds with a mild dependence on in the considered parameter range (see the worst-case bounds, plotted as “Bnd”) — even though we might conjecture that such bounds exist. This gives an illustration of how GnC can correct deficiencies in our worst-case theory: conjectured improvements to the theory can be made rigorous with GnC’s certified confidence intervals.
GnC with different guess mechanisms: GnC is designed to be modular, enabling it to take advantage of arbitrarily complex mechanisms to make guesses. Here, we compare the performance of two such mechanisms for making guesses, namely the Gaussian mechanism, and Thresholdout. In Figure 5 (left), we first plot the number of queries answered by the Gaussian (“Gauss Emp”) and Thresholdout (“Thresh Emp”) mechanisms, respectively, until the maximum empirical error of the query answers exceeds . It is evident that Thresholdout, which uses an internal holdout set to answer queries that likely overfit to its training set, provides better performance than the Gaussian mechanism. In fact, we see that for , while Thresholdout is always able to answer queries (the maximum number of queries we tried in our experiments), the Gaussian mechanism isn’t able to do so even for the largest data set size we consider. Note that the “empirical” plots are generally un-knowable in practice, since we do not have access to the underlying distributions. But they serve as upper bounds for the best performance a mechanism can provide.
Next, we fix , and plot the performance of GnC Gauss and GnC Thresh. We see that even though GnC Thresh has noticeably higher variance, it provides performance that is close to two orders of magnitude larger than GnC Gauss when . Moreover, for , it is interesting to see GnC Thresh gaurantees -accuracy for our strategy while consistently beating even the empirical performance of the Gaussian. We note that the best bounds for both the Gaussian and Thresholdout mechanisms alone (not used as part of GnC) do not provide any non-trivial guarantees in the considered parameter ranges.
Responsive widths that track the empirical error: The GnC framework is designed to certify guesses which represent both a point estimate and a desired confidence interval width for each query. Rather than having fixed confidence interval widths, this framework also provides the flexibility to incorporate guess mechanisms that provide increased interval widths as failures accumulate within GnC. This allows GnC to be able to re-use the holdout set in perpetuity, and answer an infinite number of queries (albeit with confidence widths that might grow to be vacuous). In Figure 5 (right), we fix , and plot the performance of GnC Gauss such that the guessed confidence width if the “check” for query results in a failure, otherwise . For comparison, we also plot the actual maximum empirical error encountered by the answers provided by GnC (“GnC Gauss Emp”). It corresponds to the maximum empirical error of the answers of the Gaussian mechanism that is used as a guess mechanism within GnC, unless the check for a query results in a failure (which occurs 4 times in 40000 queries), in which case the error corresponds to the discretized answer on the holdout. We see that the statistically valid accuracy guaranteed by GnC is “responsive” to the empirical error of the realized answers produced by the GnC, and is almost always within a factor of 2 of the actual error.
The authors would like to thank Omer Tamuz for his helpful comments regarding a conjecture that existed in a prior version of this work. A.R. acknowledges support in part by a grant from the Sloan Foundation, and NSF grants AF-1763314 and CNS-1253345. A.S. and O.T. were supported in part by a grant from the Sloan foundation and NSF grants IIS-1447700 and AF-1763314. B.W. is supported by the NSF GRFP (award No. 1754881). This work was done in part while R.R., A.S., and O.T. were visiting the Simons Institute for the Theory of Computing.
- Bassily et al.  Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ullman. Algorithmic stability for adaptive data analysis. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, pages 1046–1059. ACM, 2016.
- Bun and Steinke [2016a] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Martin Hirt and Adam Smith, editors, Theory of Cryptography, pages 635–658, Berlin, Heidelberg, 2016a. Springer Berlin Heidelberg. ISBN 978-3-662-53641-4.
- Bun and Steinke [2016b] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. CoRR, abs/1605.02065, 2016b. URL http://arxiv.org/abs/1605.02065.
- Bun et al.  Mark Bun, Jonathan Ullman, and Salil P. Vadhan. Fingerprinting codes and the price of approximate differential privacy. In STOC, pages 1–10. ACM, May 31 – June 3 2014.
- Dwork and Rothblum  Cynthia Dwork and Guy N. Rothblum. Concentrated differential privacy. CoRR, abs/1603.01887, 2016.
- Dwork et al. [2006a] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Advances in Cryptology - EUROCRYPT 2006, 25th Annual International Conference on the Theory and Applications of Cryptographic Techniques, St. Petersburg, Russia, May 28 - June 1, 2006, Proceedings, pages 486–503, 2006a. doi: 10.1007/11761679_29.
- Dwork et al. [2006b] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265–284. Springer, 2006b.
- Dwork et al.  Cynthia Dwork, Guy N. Rothblum, and Salil P. Vadhan. Boosting and differential privacy. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23-26, 2010, Las Vegas, Nevada, USA, pages 51–60, 2010. doi: 10.1109/FOCS.2010.12.
- Dwork et al. [2015a] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toni Pitassi, Omer Reingold, and Aaron Roth. Generalization in adaptive data analysis and holdout reuse. In Advances in Neural Information Processing Systems, pages 2350–2358, 2015a.
- Dwork et al. [2015b] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis. Science, 349(6248):636–638, 2015b. doi: 10.1126/science.aaa9375. URL http://www.sciencemag.org/content/349/6248/636.abstract.
- Dwork et al. [2015c] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pages 117–126. ACM, 2015c.
- Dwork et al. [2015d] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC ’15, pages 117–126, New York, NY, USA, 2015d. ACM. ISBN 978-1-4503-3536-2. doi: 10.1145/2746539.2746580.
- Feldman and Steinke [2017a] Vitaly Feldman and Thomas Steinke. Generalization for adaptively-chosen estimators via stable median. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, pages 728–757, 2017a. URL http://proceedings.mlr.press/v65/feldman17a.html.
- Feldman and Steinke [2017b] Vitaly Feldman and Thomas Steinke. Calibrating noise to variance in adaptive data analysis. CoRR, abs/1712.07196, 2017b. URL http://arxiv.org/abs/1712.07196.
- Gelman and Loken  Andrew Gelman and Eric Loken. The statistical crisis in science. American Scientist, 102(6):460, 2014.
- Gossmann et al.  Alexej Gossmann, Aria Pezeshk, and Berkman Sahiner. Test data reuse for evaluation of adaptive machine learning algorithms: over-fitting to a fixed’test’dataset and a potential solution. In Medical Imaging 2018: Image Perception, Observer Performance, and Technology Assessment, volume 10577, page 105770K. International Society for Optics and Photonics, 2018.
- Gray  Robert M. Gray. Entropy and Information Theory. Springer-Verlag, Berlin, Heidelberg, 1990. ISBN 0-387-97371-0.
- Hardt and Ullman  Moritz Hardt and Jonathan Ullman. Preventing false discovery in interactive data analysis is hard. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 454–463. IEEE, 2014.
- Kairouz et al.  Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. IEEE Trans. Information Theory, 63(6):4037–4049, 2017.
- Kasiviswanathan and Smith  S.P. Kasiviswanathan and A. Smith. On the ‘Semantics’ of Differential Privacy: A Bayesian Formulation. Journal of Privacy and Confidentiality, Vol. 6: Iss. 1, Article 1, 2014.
- Mania et al.  Horia Mania, John Miller, Ludwig Schmidt, Moritz Hardt, and Benjamin Recht. Model similarity mitigates test set overuse. arXiv preprint arXiv:1905.12580, 2019.
- Rogers et al.  Ryan Rogers, Aaron Roth, Adam Smith, and Om Thakkar. Max-information, differential privacy, and post-selection hypothesis testing. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, 2016.
- Rogers et al.  Ryan Rogers, Aaron Roth, Adam Smith, Nathan Srebro, Om Thakkar, and Blake Woodworth. Repository for empirical adaptive data analysis. https://github.com/omthkkr/empirical_adaptive_data_analysis, 2019.
- Russo and Zou  D. Russo and J. Zou. How much does your data exploration overfit? Controlling bias via information usage. ArXiv e-prints, November 2015.
Russo and Zou 
Daniel Russo and James Zou.
Controlling bias in adaptive data analysis using information theory.
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS, 2016.
- Steinke and Ullman  Thomas Steinke and Jonathan Ullman. Interactive fingerprinting codes and the hardness of preventing false discovery. In Proceedings of The 28th Conference on Learning Theory, pages 1588–1628, 2015.
- Xu and Raginsky  Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In NIPS 2017, 4-9 December 2017, Long Beach, CA, USA, pages 2521–2530, 2017.
- Zrnic and Hardt  Tijana Zrnic and Moritz Hardt. Natural analysts in adaptive data analysis. arXiv preprint arXiv:1901.11143, 2019.
Appendix A Omitted Definitions
Here, we present the definitions that were omitted from the main body due to space constraints.
a.1 Confidence Interval Preliminaries
In our implementation, we are comparing the true average to the answer , which will be the true answer on the sample with additional noise to ensure each query is stably answered. We then use the following string of inequalities to find the width of the confidence interval.
We will then use this connection to get a bound in terms of the accuracy on the sample and the error in the empirical average to the true mean. Many of the results in this line of work use a transfer theorem which states that if a query is selected via a private method, then the query evaluated on the sample is close to the true population answer, thus providing a bound on population accuracy. However, we also need to control the sample accuracy which is affected by the amount of noise that is added to ensure stability. We then seek a balance between the two terms, where too much noise will give terrible sample accuracy but great accuracy on the population – due to the noise making the choice of query essentially independent of the data – and too little noise makes for great sample accuracy but bad accuracy to the population. We will consider Gaussian noise, and use the composition theorems to determine the scale of noise to add to achieve a target accuracy after adaptively selected statistical queries.
Given the size of our dataset , number of adaptively chosen statistical queries , and confidence level , we want to find what confidence width ensures is -accurate with respect to the population when each algorithm adds either Laplace or Gaussian noise to the answers computed on the sample with some yet to be determined variance. To bound the sample accuracy, we can use the following theorem that gives the accuracy guarantees of the Gaussian mechanism.
If then for we have:
a.2 Stability Measures
It turns out that privacy preserving algorithms give strong stability guarantees which allows for the rich theory of differential privacy to extend to adaptive data analysis [Dwork et al., 2015d, a, Bassily et al., 2016, Rogers et al., 2016]. In order to define these privacy notions, we define two datasets to be neighboring if they differ in at most one entry, i.e. there is some where , but for all . We first define differential privacy.
A randomized algorithm (or mechanism) is -differentially private (DP) if for all neighboring datasets and and each outcome , we have If , we simply say is -DP or pure DP. Otherwise for , we say approximate DP.
We then give a more recent notion of privacy, called concentrated differential privacy (CDP), which can be thought of as being “in between" pure and approximate DP. In order to define CDP, we define the privacy loss random variable which quantifies how much the output distributions of an algorithm on two neighboring datasets can differ.
Definition A.3 (Privacy Loss).
Let be a randomized algorithm. For neighboring datasets , let . We then define the privacy loss variable to have the same distribution as .
Note that if we can bound the privacy loss random variable with certainty over all neighboring datasets, then the algorithm is pure DP. Otherwise, if we can bound the privacy loss with high probability then it is approximate DP (see Kasiviswanathan and Smith  for a more detailed discussion on this connection).
We can now define zero concentrated differential privacy (zCDP), given by Bun and Steinke [2016a] (Note that Dwork and Rothblum  initially gave a definition of CDP which Bun and Steinke [2016a] then modified).
Definition A.4 (zCDP).
An algorithm is -zero concentrated differentially private (zCDP), if for all neighboring datasets and all we have
We then give the Laplace and Gaussian mechanism for statistical queries.
Let be a statistical query and . The Laplace mechanism is the following , which is -DP. Further, the Gaussian mechanism is the following , which is -zCDP.
We now give the advanced composition theorem for -fold adaptive composition.
The class of -DP algorithms is -DP under -fold adaptive composition where and
We will also use the following results from zCDP.
Theorem A.7 (Bun and Steinke [2016a]).
The class of -zCDP algorithms is -zCDP under -fold adaptive composition. Further if is -DP then is -zCDP and if is -zCDP then is -DP for any .
Another notion of stability that we will use is mutual information (in nats) between two random variables: the input and output .
Definition A.8 (Mutual Information).
Consider two random variables and and let . We then denote the mutual information as
, where the expectation is taken over the joint distribution of.
a.3 Monitor Argument
For the population accuracy term in (1), we will use the monitor argument from Bassily et al. . Roughly, this analysis allows us to obtain a bound on the population accuracy over rounds of interaction between adversary and algorithm by only considering the difference for the two stage interaction where is chosen by based on outcome . We present the monitor in Algorithm 2.
Since our stability definitions are closed under post-processing, we can substitute the monitor as our post-processing function in the above theorem. We then get the following result.
Let , where each may be adaptively chosen, satisfy any stability condition that is closed under post-processing. For each , let be the statistical query chosen by adversary based on answers , and let be any function of . Then, we have
From the monitor in Algorithm 2 and the fact that is closed under post-processing, we have
We can then use the above corollary to obtain an accuracy guarantee by union bounding over the sample accuracy for all rounds of interaction and then bounding the population error for a single adaptively chosen statistical query.
Appendix B Omitted Confidence Interval Bounds
Here we present the bounds derived via prior work, provide a comparison of our bounds for the Gaussian mechanism (Theorem 2.1) with prior work.
b.1 Confidence Bounds from Dwork et al. [2015a]
If is -DP where and , and , then .
We pair this together with the accuracy from either the Gaussian mechanism or the Laplace mechanism along with Corollary A.9 to get the following result
Given confidence level and using the Laplace or Gaussian mechanism for each algorithm , then is )-accurate.
Laplace Mechanism: We define to be the solution to the following program
Gaussian Mechanism: We define to be the solution to the following program
To bound the sample accuracy, we will use the following lemma that gives the accuracy guarantees of Laplace mechanism.
If , then for we have:
Proof of Theorem b.2.
We will focus on the Laplace mechanism part first, so that we add noise to each answer. After adaptively selected queries, the entire sequence of noisy answers is -DP where
For the population accuracy, we need to apply Theorem B.1, which requires us to have the following, where we take a union bound over all selected statistical queries:
We then write in terms of to get:
We are then left to pick to obtain the smallest value of .
b.2 Confidence Bounds from Bassily et al. 
We now go through the argument of Bassily et al.  to improve the constants as much as we can via their analysis to get a decent confidence bound on adaptively chosen statistical queries. This requires presenting their monitoring, which is similar to the monitor presented in Algorithm 2 but takes as input several independent datasets. We first present the result.
Given confidence level and using the Laplace or Gaussian mechanism for each algorithm , then is )-accurate.
Laplace Mechanism: We define to be the following quantity:
Gaussian Mechanism: We define to be the following quantity:
In order to prove this result, we begin with a technical lemma which considers an algorithm that takes as input a collection of samples and outputs both an index in and a statistical query, where we denote as the set of all statistical queries and their negation.
Lemma B.5 ([Bassily et al., 2016]).
Let be -DP. If then
We then define what we will call the extended monitor in Algorithm 3.
We then present a series of lemmas that leads to an accuracy bound from Bassily et al. .
Lemma B.6 ([Bassily et al., 2016]).
For each , if is -DP for adaptively chosen queries from , then for every data distribution and analyst , the monitor is -DP.
Lemma B.7 ([Bassily et al., 2016]).
If fails to be -accurate, then , where is the answer to during the simulation ( can determine from output ) and
The following result is not stated exactly the same as in Bassily et al. , but it follows the same analysis. We just do not simplify the expressions in the inequalities.
If is -accurate on the sample but not -accurate for the population, then
We now put everything together to get our result.
Proof of Theorem b.4.
We first analyze the case when we add noise to each query answer on the sample to preserve -DP of each query and then use advanced composition Theorem A.6 to get a bound on .
Further, we obtain -accuracy on the sample, where for we have We then plug these values into (5) to get the following bound on
We then choose some of the parameters to be the same as in Bassily et al. , like and . We then want to find the best parameters that makes the right hand side as small as possible. Thus, the best confidence width that we can get with this approach is the following
Using the same analysis but with Gaussian noise added to each statistical query answer with variance (so that is -zCDP), we get the following confidence width ,
b.3 Comparison of Theorem 2.1 with Prior Work
One can also get a high-probability bound on the sample accuracy of using Theorem 3 in Xu and Raginsky , resulting in
where i.i.d. Gaussian noise has been added to each query. The proof is similar to the proof of Theorem 2.1. If the mutual information bound , then the first term in the expression of the confidence width in Theorem 2.1 is less than the first term in eq. 6. Furthermore, if , then the first term dominates in the expression of the confidence width in Theorem 2.1, thus making Theorem 2.1 result in a tighter bound for any . For very small values of , there exist sufficiently small for which the result obtained via Xu and Raginsky  is better.
b.4 Confidence Bounds for Thresholdout (Dwork et al. [2015a])
If the Thresholdout mechanism with noise scale , and threshold is used for answering queries , , with reported answers such that uses the holdout set of size to answer at most queries, then given confidence parameter , Thresholdout is )-accurate, where