Consider the problem of running a server that provides the test loss of a model on held out data, e.g. for evaluation in a machine learning challenge. We would like to ensure that all test losses returned by the server are accurate estimates of the true generalization error of the predictors.
Returning the empirical error on held out test data would initially be a good estimate of the generalization error. However, an analyst can use the empirical errors to adjust their model and improve their performance on the test data. In fact, with a number of queries only linear in the amount of test data, one can easily create a predictor that completely overfits, having empirical error on the test data that is artificially small [12, 5]. Even without such intentional overfitting, sequential querying can lead to unintentional adaptation since analysts are biased toward tweaks that lead to improved test errors.
If the queries were non-adaptive, i.e. the sequence of predictors is not influenced by previous test results, then we could handle a much larger number of queries before overfitting–a number exponential in the size of the dataset. Nevertheless, the test set will eventually be “used up” and estimates of the test error (specifically those of the best performers) might be over-optimistic.
A similar situation arises in other contexts such as validating potential scientific discoveries. One can evaluate potential discoveries using set aside validation data, but if analyses are refined adaptively based on the results, one may again overfit the validation data and arrive at false discoveries [17, 14].
One way to ensure the validity of answers in the face of adaptive querying is to collect all queries before giving any answers, and answer them all at once, e.g. at the end of a competition. However, analysts typically want more immediate feedback, both for ML challenges and in scientific research. Additionally, if we want to answer more queries later, ensuring statistical validity would require collecting a whole new dataset. This might be unnecessarily expensive if few or none of the queries are in fact adaptive. It also raises the question of who should bear the cost of collecting new data.
Alternatively, we could try to limit the number or frequency of queries from each user, forbid adaptive querying, or assume users work independently of each other, remaining oblivious to other users’ queries and answers. However, it is nearly impossible to enforce such restrictions. Determined users can avoid querying restrictions by creating spurious user accounts and working in groups; there is no feasible way to check if queries are chosen adaptively; and information can leak between analysts, intentionally or not, e.g. through explicit collaboration or published results.
In this paper, we address the fundamental challenge of providing statistically valid answers to an arbitrarily long sequence of potentially adaptive queries. We assume that it is possible to collect additional samples from the same data distribution at a fixed cost per sample. To pay for new samples, users of the database will be charged for their queries. We propose a mechanism, EverlastingValidation, that guarantees “everlasting” statistical validity and maintains the following properties:
Without any assumptions about the users, and even with arbitrary adaptivity, with high probability, all answers ever returned by the database are accurate.
The database collects enough revenue to purchase as many new samples as necessary in perpetuity, and can answer an unlimited number of queries.
- Cost for Non-Adaptive Users
With high probability, a user making non-adaptive queries will pay at most , so the average cost per query decreases as .
- Cost for Autonomous Users
With high probability, a user (or group of users) making potentially adaptive queries that depend on each other arbitrarily, but not on any queries made by others, will pay at most , so the average cost per query decreases as .
We emphasize that the database mechanism needs no notion of “user” or “account” when answering the queries; it does not need to know which “user” made which query; and most of all, it does not need to know whether a query was made adaptively or not. Rather, the cost guarantees hold for any collection of queries that are either non-adaptive or autonomous in the sense described above–a “user” could thus refer to a single individual, or if an analyst uses answers from another person’s queries, we can consider them together as an “autonomous user” and get cost guarantees based on their combined number of queries. The database’s cost guarantees are nearly optimal; the cost to non-adaptive users and the cost to autonomous users cannot be improved (beyond log-factors) while still maintaining validity and sustainability (Section 5).
As is indicated by the guarantees above, using the mechanism adaptively may be far more expensive than using it non-adaptively. We view this as a positive feature. Although we cannot enforce non-adaptivity, and it is sometimes unreasonable to expect that analysts are entirely non-adaptive, we intend the mechanism to be used for validation. That is, analysts should do their discovery, training, tuning, development, and adaptive data analysis on unrestricted “training” or “discovery” datasets, and only use the protected database when they wish to receive a stamp of approval on their model, predictor, or discovery. Instead of trying to police or forbid adaptivity, we discourage it with pricing, but in a way that is essentially guaranteed not to affect non-adaptive users. Further, users will need to pay a high price only when their queries explicitly cause overfitting, so only adaptivity that is harmful to statistical validity will be penalized.
Relationship to prior work
Our work is inspired by a number of mechanisms for dealing with potentially adaptive queries that have been proposed and analyzed using techniques from differential privacy and information theory. These mechanisms handle only a pre-determined number of queries using a fixed dataset. We use techniques developed in this literature, in particular addition of noise to ensure that a quadratically larger number of adaptive queries can be answered in the worst case [10, 6]. Our main innovations over this prior work are the self-sustaining nature of the database, as opposed to handling only a pre-determined number of queries of each type, and also the per-query pricing scheme that places the cost burden on the adaptive users. To ensure that the cost burden on non-adaptive users does not grow by more than a constant factor, we need to adapt existing algorithms.
Ladder  and ShakyLadder  are mechanisms tailored to maintaining a ML competition leaderboard. These algorithms reveal the answer to a user’s query for the error of their model only if it is significantly lower than the error of the previous best submission from the user. While these mechanisms can handle an exponential number of arbitrarily adaptive submissions, each user will receive answers to a relatively small number of queries. Our setting is more suitable for the case where we want to validate the errors of all submissions or for scientific discovery where there is more then one discovery to be made.
A separate line of work in the statistics literature on “Quality Preserving Databases” (Aharoni and Rosset  and references therein) has suggested schemes for databases that maintain everlasting validity, while charging for use. The fundamental difference from our work is that these schemes do not account for adaptivity and thus are limited to non-adaptive querying. A second difference is that they focus on hypothesis testing for scientific discovery, with pricing schemes that depend on considerations of statistical power, which are not part of our framework. We further compare with existing methods at the end of Section 4.
2 Model formulation
We consider a setting in which a database curator has access to samples from some unknown distribution over a sample space . Multiple analysts submit a sequence of statistical queries , the database responds with answers , and the goal is to ensure that with high probability, all answers satisfy for some fixed accuracy parameter . In a prediction validation application, each query would measure the expected loss of a particular model, while in scientific applications a single query might measure the value of some phenomenon of interest, or compare it to a “null” reference. We denote the set of all possible queries, i.e. measurable functions , and use the shorthand to denote the mean value (desired answer) for each query. Given a data sample , we use as shorthand for the empirical mean of on .
In our framework, the database can, at any time, acquire new samples from at some fixed cost per sample, e.g. by running more experiments or paying workers to label more data. To answer a given query, the database can use the samples it has already purchased in any way it chooses, and the database is allowed to charge analysts for their queries in order to purchase additional samples. The price of query may be determined by the database after it receives query , allowing the database to charge more for queries that force it to collect more data.
We do not assume the queries are chosen in advance, and instead allow the sequence of queries to depend adaptively on past answers. More formally, we define a “querying rule” as a randomized mapping from the history of all previously made queries and their answers and prices to the statistical query to be made next:
The interaction of users with the database can then be modeled as a sequence of querying rules
. The combination of the data distribution, database mechanism, and sequence of querying rules together define a joint distribution over queries, answers, and prices. All our results will hold for any data distribution and any querying sequence, with high probability over .
We think of the query sequence as representing a combination of queries from multiple users, but the database itself is unaware of the identity or behavior of the users. Our validity guarantees do not assume any particular user structure, nor any constraints on the interactions of the different users. Thus, the guarantees are always valid regardless of what a “user” means, how “users” are allowed to collaborate, how many “users” there are, or how many queries each “user” makes—the guarantees simply hold for any (arbitrarily adaptive) querying sequence.
However, our cost guarantees will, and must, refer to analysts (or perhaps groups of analysts) behaving in specific ways. In particular, we define a non-adaptive user as a subsequence consisting of queries which do not depend on any of the history, i.e. is a fixed (pre-determined) distribution over queries, so is independent of all of the history. We further define an autonomous user of the database as a subsequence of the querying rules that depend only on the history within the subsequence, i.e.
That is, is independent of the overall past history given the past history pertaining to the autonomous user. The “cost to a user” is the total price paid for queries in the subsequence : .
Our mechanism for providing “everlasting” validity guarantees is based on a query answering mechanism which we call ValidationRound. It uses samples from in order to answer non-adaptive and at least adaptive statistical queries (and potentially many more). Our analysis is based on ideas developed in the context of adaptive data analysis  and relies on techniques from differential privacy . Differential privacy is a strong stability property of randomized algorithms that operate on a dataset. Composition properties of differential privacy imply that this form of stability holds even when the same dataset is used by multiple algorithms that can depend on the outputs of preceding algorithms. Most importantly, differential privacy implies generalization with high probability [10, 4].
ValidationRound splits its data into two sets and . Upon receiving each query, it first checks whether the answers on these datasets approximately agree. If so, the query has almost certainly not overfit to the data, and the algorithm simply returns the empirical mean of the query on plus additional random noise. We show that the addition of noise ensures that the algorithm, as a function from the data sample to an answer, satisfies differential privacy. This can be leveraged to show that any query which depends on a limited number of previous queries will have an empirical mean on that is close to the true expectation. This ensures that ValidationRound can accurately answer a large number of queries, while allowing some (unknown) subset of the queries to be adaptive.
ValidationRound uses truncated Gaussian noise , i.e. Gaussian noise conditioned on the event . Its density .
Here, is the index of the query that causes the algorithm to halt. If , the maximum allowed number of answers, we say that ValidationRound halted “prematurely.” The following three lemmas characterize the behavior of ValidationRound.
For any , , and , for any sequence of querying rules (with arbitrary adaptivity) and any probability distribution
, for any sequence of querying rules (with arbitrary adaptivity) and any probability distribution, the answers provided by ValidationRound satisfy
where the probability is taken over the randomness in the draw of datasets and from , the querying rules, and ValidationRound.
For any , , and , any sequence of querying rules, and any non-adaptive user interacting with ValidationRound,
For any , , and , any sequence of querying rules, and any autonomous user interacting with ValidationRound, if
Lemma 1 indicates that all returned answers are accurate with high probability, regardless of adaptivity. The proof involves showing that is close to for each query, so any query that is answered must be accurate since and are small. Lemma 2 indicates that with high probability, non-adaptive queries never cause a premature halt, which is a simple application of Hoeffding’s inequality. Finally, Lemma 3 shows that with high probability, an autonomous user who makes queries will not cause a premature halt. This requires showing that is close to despite the potential adaptivity.
The proof of Lemma 3 uses existing results from adaptive data analysis together with a simple argument that noise truncation does not significantly affect the results. For reference, the results we cite are included in Appendix E. While using Gaussian noise to answer queries is mentioned in other work, we are not aware of an explicit analysis, so we analyze the method here. To simplify parts of the derivation, we rely on the notion of concentrated differential privacy, which is particularly well suited for analysis of composition with Gaussian noise addition . Lemmas 1-3 are proven in Appendix A.
4 EverlastingValidation and pricing
ValidationRound uses a fixed number, , of samples and with high probability returns accurate answers for at least non-adaptive queries and adaptive queries. In order to handle infinitely many queries, we chain together multiple instances of ValidationRound. We start with an initial dataset, answer queries using ValidationRound using that data until it halts. At this point, we buy more data and repeat. The used-up data can be released to the public as a “training set,” which can be used with no restriction without affecting any guarantees.
The key ingredient is a pricing system with which we can always afford new data when an instance of ValidationRound halts. Our method has two price types: a low price, which is charged for all queries and decreases like ; and a high price, which is charged for any query that causes an instance of ValidationRound to halt prematurely, which may grow with the size of the current dataset. EverlastingValidation guarantees the following:
Theorem 1 (Validity).
For any sequence of querying rules (with arbitrary adaptivity), EverlastingValidation will provide answers such that
Consider the sequence of query rules that are answered by the instantiation of the ValidationRound mechanism. By Lemma 1, for any sequence of querying rules, with probability , all of the answers during round are answered accurately. By a union bound over all rounds, all answers in all rounds are accurate with probability at least . ∎
Theorem 2 (Sustainability).
For any sequence of queries, the revenue collected can pay for all samples ever needed by EverlastingValidation, excluding the initial budget of .
When ValidationRound halts, we charge exactly enough for the next (line 10). ∎
If and queries are answered during round , then at least revenue is collected.
The proof of Lemma 4 involves a straightforward computation. We find an upper bound, , on the number of queries made before round begins and then lower bound the revenue collected in round with . We defer the details to Appendix B.
Theorem 3 (Cost for non-adaptive users).
For any sequence of querying rules and any non-adaptive user indexed by , the cost to the user satisfies
By Lemma 4, if a round ends after queries are answered, then the total revenue collected from queries in that round is at least , so the “high price” at the end of the round is . Consequently, a query from the non-adaptive user costs the low price unless it causes an instantiation of ValidationRound to halt prematurely. By Lemma 2 and a union bound, this never occurs in any round with probability at least , and the cost to the user is
Theorem 4 (Cost for adaptive users).
For any sequence of querying rules and any autonomous user indexed by , there is a fixed constant such that the cost to the user satisfies
Ideally, none of the queries causes a premature halt, and the total cost is at most , but the adaptive user may cause rounds to end prematurely and pay up to . However, by Lemma 3, with probability if one of the adaptive user’s queries causes a round to end prematurely, then the amount of data, , and the number of the user’s queries answered in that round, , must satisfy
Given , there is a largest for which this is possible since and . That is,
which implies . Let be the set of rounds in which the adaptive user pays the high price, then with probability at least , inequality (1) holds for all . In this case, the total cost to the adaptive user is no more than
Relationship to prior work on adaptive data analysis
We handle adaptivity using ideas developed in recent work on adaptive data analysis. In this line of work, all queries are typically assumed to be adaptively chosen and the overall number of queries known in advance. For completeness, we briefly describe several algorithms that have been developed in this context and compare them with our algorithm. Dwork et al.  analyze an algorithm that adds Laplace or Gaussian noise to the empirical mean in order to answer adaptive queries using samples—a method that forms the basis of ValidationRound
. However, adding untruncated Laplace or Gaussian noise to exponentially many non-adaptive queries would likely cause large errors when the variance is large enough to ensure that the sample mean is accurate. We use truncated Gaussian noise instead and show that it does not substantially affect the analysis for autonomous queries.
Thresholdout  answers verification queries in which the user submits both a query and an estimate of the answer. The algorithm uses samples to answer queries of which at most estimates are far from correct. Similar to our use of the second dataset , this algorithm can be used to detect overfitting and answer adaptive queries (this is the basis of the EffectiveRounds algorithm ). However, in our application this algorithm would have sample complexity of , for autonomous queries in total queries. Consequently, direct use of this mechanism would result in a pricing for non-adaptive users that depends on the number of queries by autonomous users. This is in contrast to samples that suffice for ValidationRound, where the improvement relies on our definition of autonomy and truncation of the noise variables.
One might ask if it is possible to devise a mechanism with similar properties but lower costs. We argue that the prices set by EverlastingValidation are near optimal. The total cost to a non-adaptive user who makes queries is . Even if we knew in advance that we would receive only non-adaptive queries, we would still need samples to answer all of them accurately with high probability. Thus, our price for non-adaptive queries is optimal up to constant factors.
It is also known that answering a sequence of adaptively chosen queries with accuracy requires samples [15, 19]. Hence, the cost to a possibly adaptive autonomous user is nearly optimal in its dependence on (up to log factors). One natural concern is that our guarantee in this case is only for the amortized (or total) cost, and not on the cost of each individual query. Indeed, although the average cost of adaptive queries decreases as , the maximal cost of a single query might increase as . A natural question is whether the maximum price can be reduced, to spread the high price over more queries.
Finally, an individual who queries our mechanism with entirely non-adaptive queries will only pay in the worst case; generally, they will benefit from the economies of scale associated with collecting more and more data. For instance, if there are users each making non-adaptive queries, then the total cost of all queries will be so the average cost to each user is only .
6 An Alternative Approach: EverlastingTO
The EverlastingValidation mechanism provides cost guarantees that are, in certain ways, nearly optimal. The two main shortcomings are that (1) the price is guaranteed only for non-adaptive or autonomous users–not arbitrary adaptive ones and (2) the cost of an individual adaptive query cannot be upper bounded. One might also ask if inventing ValidationRound was necessary in the first place. Another mechanism, Thresholdout , is already well-suited to the setting of mixed adaptive and non-adaptive queries and it gives accuracy guarantees for quadratically many arbitrary adaptive queries or exponentially many non-adaptive queries. Perhaps using Thresholdout instead would be better? We will now describe an alternative mechanism, EverlastingTO, which allows us to provide price guarantees for individual queries, including arbitrarily adaptive ones, but with an exponential increase in the cost for both non-adaptive and adaptive queries.
The EverlastingTO mechanism is very similar to EverlastingValidation, except it uses Thresholdout in the place of ValidationRound. In each round, the algorithm determines an overfitting budget, , and a maximum number of queries, , as a function of the tradeoff parameter . It then answers queries using Thresholdout, charging a high price for queries that fail the overfitting check, and charging a low price for all of the other queries. Once Thresholdout cannot answer more queries, the mechanism buys more data, reinitializes Thresholdout, and continues as before.
We analyze EverlastingTO in Appendix D. Theorems 6-9 closely parallel the guarantees of EverlastingValidation and establish the following for any and any : Validity: with high probability, for any sequence of querying rules, all answers provided by EverlastingTO are -accurate. Sustainability: EverlastingTO charges high enough prices to be able to afford new samples as needed, excluding the initial budget. Cost: with high probability, any non-adaptive queries and any adaptive queries cost at most (ignoring the dependence on ).
Unlike EverlastingValidation, which prioritized charging as little as possible for non-adaptive queries, EverlastingTO increases the cost to in order to bound the price of arbitrary adaptive queries. The parameter allows the database manager to control the tradeoff; for near zero, the cost of adaptive queries is roughly the optimal , but non-adaptive queries are extremely expensive. On the other side, for near , the cost of adaptive queries becomes very high, but the cost of non-adaptive queries is relatively small, although it does not approach optimality.
Further details of the mechanism are contained in Appendix D. We also provide a tighter analysis of the Thresholdout algorithm which guarantees accurate answers using a substantially smaller amount of data in Appendix C. This analysis allows us to reduce the exponent in EverlastingTO’s cost guarantee for non-adaptive queries.
7 Potential applications
In the ML challenge scenario, validation results are often displayed on a scoreboard. Although it is often assumed that scoreboards cannot be used for extensive adaptation, it appears that such adaptations have played roles in determining the outcome of various well known competitions, including the Netflix challenge, where the final test set performance was significantly worse than performance on the leaderboard data set. EverlastingValidation would guarantee that test errors returned by the validation database are accurate, regardless of adaptation, collusion, the number of queries made by each user, or other intentional or unintentional dependencies. We do charge a price per-validation, but as long as users are non-adaptive, the price is very small. Adaptive users, on the other hand, pay what is required in order to ensure validity (which could be a lot). Nevertheless, even if a wealthy user could afford paying the higher cost of adaptive queries, she would still not be able to “cheat” and overfit the scoreboard set, and a poor user could still afford the quickly diminishing costs of validating non-adaptive queries.
Another feature of our mechanism is that once a round is over, we can safely release the datasets and to the public as unrestricted training data. This way, poor analysts also benefit from adaptive queries made by others, as all data is eventually released, and at any given time, a substantial fraction of all the data ever collected is public. Also, the ratio of public data to validation data can easily be adjusted by slightly amending the pricing.
In the context of scientific discovery, one use case is very similar to the ML competition. Scientists can search for interesting phenomena using unprotected data, and then re-evaluate “interesting” discoveries with the database mechanism in order to get an accurate and almost-unbiased estimate of the true value. This could be useful, for example, in building prediction models for scientific phenomena such as genetic risk of disease, which often involve complex modeling.
However, most scientific research is done in the context of hypothesis testing rather than estimation. Declarations of discoveries like the Higgs boson  and genetic associations of disease  are based on performing a potentially large number of hypothesis tests and identifying statistically significant discoveries while controlling for multiplicity. Because of the complexity of the discovery process, it is often quite difficult to properly control for all potential tests, causing many difficulties, the most well known of which is the problem of publication bias (cf. “Why Most Published Research Findings are False” ). An alternative, approach that has gained popularity in recent years, is requiring replication of any declared discoveries on new and independent data . Because the new data is used only for replication, it is much easier to control multiplicity and false discovery concerns.
Our everlasting database can be useful in both the discovery and replication phases. We now briefly explain how its validity guarantees can be used for multiplicity control in testing. Assume we have a collection of hypothesis tests on functionals of with null hypotheses: We employ our scheme to obtain estimates of . Setting , Theorem (1) guarantees: meaning that for any combination of true nulls, the rejection policy reject if makes no false rejections with probability at least , thus controlling the family-wise error rate (FWER) at level . This is easily used in the replication phase, where an entire community (say, type-I diabetes researchers) could share a single replication server using the everlasting database scheme in order to to guarantee validity. It could also be used in the discovery phase for analyses that can be described through a set of measurements and tests of the form above.
8 Conclusion and extensions
Our primary contribution is in designing a database mechanism that brings together two important properties that have not been previously combined: everlasting validity and robustness to adaptivity. Furthermore, we do so in an asymptotically efficient manner that guarantees that non-adaptive queries are inexpensive with high probability, and that the potentially high cost of handling adaptivity only falls upon truly adaptive users. Currently, there are large constants in the cost guarantees, but these are pessimistic and can likely be reduced with a tighter analysis and more refined pricing scheme. We believe that with some improvements, our scheme can form the basis of practical implementations for use in ML competitions and scientific discovery. Also, our cost guarantees themselves are worst-case and only guarantee a low price to entirely non-adaptive users. It would be useful to investigate experimentally how much users would actually end up being charged under “typical use,” especially users who are only “slightly adaptive.” However, there is no established framework for understanding what would constitute “typical” or “slightly adaptive” usage of a statistical query answering mechanism, so more work is needed before such experiments would be insightful.
Our mechanism can be improved in several ways. It only provides answers at a fixed, additive
, and only answers statistical queries, however these issues have been already addressed in the adaptive data analysis literature. E.g. arbitrary low-sensitivity queries can be handled without any modification to the algorithm, and arbitrary real-valued queries can be answered with the error proportional to their standard deviation (instead ofas in our analysis) . These approaches can be combined with our algorithms but we restrict our attention to the basic case since our focus is different.
Finally, one potentially objectionable element of our approach is that it discards samples at the end of each round (although these samples are not wasted since they become part of the public dataset). An alternative approach is to add the new samples to the dataset as they can be purchased. While this might be a more practical approach, existing analysis techniques that are based on differential privacy do not appear to suffice for dealing with such mechanisms. Developing more flexible analysis techniques for this purpose is another natural direction for future work.
BW is supported the NSF Graduate Research Fellowship under award 1754881.
- Aad et al.  Georges Aad, T Abajyan, B Abbott, J Abdallah, S Abdel Khalek, AA Abdelalim, O Abdinov, R Aben, B Abi, M Abolins, et al. Observation of a new particle in the search for the standard model higgs boson with the atlas detector at the lhc. Physics Letters B, 716(1):1–29, 2012.
- Aharoni and Rosset  Ehud Aharoni and Saharon Rosset. Generalized alpha-investing: Definitions, optimality results and application to public databases. Journal of the Royal Statistical Society: Series B, 76(4):771–794, 2014.
- Baker  Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature News, 533(7604):452, 2016.
- Bassily et al.  Raef Bassily, Kobbi Nissim, Adam D. Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ullman. Algorithmic stability for adaptive data analysis. In STOC, pages 1046–1059, 2016.
- Blum and Hardt  Avrim Blum and Moritz Hardt. The ladder: A reliable leaderboard for machine learning competitions. In International Conference on Machine Learning, pages 1006–1014, 2015.
- Bun and Steinke  Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635–658. Springer, 2016.
- Chatterjee et al.  Nilanjan Chatterjee, Jianxin Shi, and Montserrat García-Closas. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nature Reviews Genetics, 17(7):392, 2016.
- Craddock et al.  Nick Craddock, Matthew E Hurles, Niall Cardin, Richard D Pearson, Vincent Plagnol, Samuel Robson, Damjan Vukcevic, Chris Barnes, Donald F Conrad, Eleni Giannoulatou, et al. Genome-wide association study of cnvs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature, 464(7289):713, 2010.
- Dwork et al.  C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284, 2006.
- Dwork et al.  Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Preserving statistical validity in adaptive data analysis. CoRR, abs/1411.2664, 2014. Extended abstract in STOC 2015.
- Dwork et al. [2015a] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toni Pitassi, Omer Reingold, and Aaron Roth. Generalization in adaptive data analysis and holdout reuse. In Advances in Neural Information Processing Systems, pages 2350–2358, 2015a.
- Dwork et al. [2015b] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis. Science, 349(6248):636–638, 2015b. doi: 10.1126/science.aaa9375. URL http://www.sciencemag.org/content/349/6248/636.abstract.
- Feldman and Steinke  Vitaly Feldman and Thomas Steinke. Generalization for adaptively-chosen estimators via stable median. In Conference on Learning Theory (COLT), 2017.
- Gelman and Loken  Andrew Gelman and Eric Loken. The statistical crisis in science. The American Statistician, 102(6):460, 2014.
- Hardt and Ullman  M. Hardt and J. Ullman. Preventing false discovery in interactive data analysis is hard. In FOCS, pages 454–463, 2014.
- Hardt  Moritz Hardt. Climbing a shaky ladder: Better adaptive risk estimation. CoRR, abs/1706.02733, 2017. URL http://arxiv.org/abs/1706.02733.
- Ioannidis  John PA Ioannidis. Why most published research findings are false. PLoS medicine, 2(8):e124, 2005.
- Nissim and Stemmer  Kobbi Nissim and Uri Stemmer. On the generalization properties of differential privacy. CoRR, abs/1504.05800, 2015.
- Steinke and Ullman  Thomas Steinke and Jonathan Ullman. Interactive fingerprinting codes and the hardness of preventing false discovery. In COLT, pages 1588–1628, 2015. URL http://jmlr.org/proceedings/papers/v40/Steinke15.html.
Appendix A Proofs from Section 3
For any , , , and any sequence of querying rules (with arbitrary adaptivity) interacting with ValidationRound
Consider any sequence of querying rules (with arbitrary adaptivity). The interaction between the query rules and ValidationRound together determines a joint distribution over statistical queries, answers, and prices .
Consider also the interaction of the same sequence of querying rules with an alternative algorithm, which always returns (i.e. it ignores the if-statement in ValidationRound). This generates an infinite sequence of queries, answers, and prices . Now, we retroactively check the condition in the if-statement for each of the queries to calculate what should be, and take the length prefix of the . This sequence has exactly the same distribution as the sequence generated by ValidationRound, and each was chosen independently of by construction. Since has outputs bounded in , we can apply Hoeffding’s inequality:
At most queries are answered by the mechanism, so a union bound completes the proof. ∎
A query is not answered unless , so
By Lemma 5, with probabilty the final term is at most simultaneously for all . ∎
Since the non-adaptive user’s querying rules ignore all of the history, they are each chosen independently of . By Hoeffding’s inequality
and similarly for . If both and , then the algorithm halted upon receiving query because its empirical means on and were too dissimilar and not because it had already answered its maximum allotment of queries. Therefore,
At most queries are answered by the mechanism, so a union bound completes the proof. ∎
For any , , , any sequence of query rules, and any possibly adaptive autonomous user , if and then
Consider a slightly modified version of ValidationRound, where Gaussian noise is added instead of truncated Gaussian noise . Until this modified algorithm halts, all of the answers it provides are released according to the Gaussian mechanism on , which satisfies -zCDP by Proposition 1.6 in . We can view as an (at most) -fold composition of -zCDP mechanisms, which satisfies -zCDP by Lemma 1.7 in . Finally, Proposition 1.3 in  shows us how to convert this concentrated differential privacy guarantee to a regular differential privacy guarantee. In particular, is generated under
Specifically, when , and satisfy:
Furthermore, for . Therefore, the total variation distance between and is
. Consider two random vectorsand , the first of which has independent distributed coordinates, and the second of which has coordinates for and for all of the . The total variation distance between these vectors is then at most .
Now, for the given sequence of querying rules, , and , view ValidationRound as a function of the random noise which is added into the answers. Then too. Above, we showed that with probability the user’s interaction with ValidationRound has the property that
So their interaction with ValidationRound satisfies
Since this statement only depends on the indices of in , we can replace all of the remaining indices with truncated Gaussians and maintain this property, which recovers ValidationRound. ∎
Proof of Lemma 3.
Appendix B Proofs of Lemma 4
The revenue collected in round via the low price depends on how many queries are answered both in and before round . The maximum number of queries answered in a round is (this is enforced by ValidationRound). Let be the total number of queries made before the beginning of round , then
The first inequality holds because every exponent in the sum is at least by our choice of and for any , . The second inequality holds since implies . So, if queries are answered during round , the revenue collected is at least
Appendix C Tighter Thresholdout Analysis
In this section, we provide a tighter analysis of the Thresholdout algorithm . In particular, previous analysis showed a sample complexity for answering queries with an overfitting budget of of whereas we prove a bound like . The improvement has important consequences for our application of Thresholdout to the everlasting database setting. We make the improvement by applying the “monitor technique” of Bassily et al. .
Lemma 7 (Lemma 23 ).
Thresholdout satisfies -differential privacy and also -differential privacy for any .
Lemma 8 (Corollary 7 ).
Let be an algorithm that outputs a statistical query . Let be a random dataset chosen according to distribution and let . If is -differentially private then
Lemma 9 (Theorem 8 ).
Let be an -differentially private algorithm that outputs a statistical query. For dataset drawn from , we let . Then for ,
Theorem 5 (cf. Theorem 25 ).
Let and . Set and . Let denote datasets of size drawn i.i.d. from a distribution . Consider an analyst that is given access to and adaptively chooses functions while interacting with Thresholdout which is given datasets and values . For every let denote the answer of Thresholdout on query . Then whenever
with probability at least , for all before Thresholdout halts and is an adaptive query.
Consider the following post-processing of the output of Thresholdout: look through the sequence of queries and answers and output . Since this procedure does not use the datasets and since Thresholdout computes the sequence of queries and answers in a differentially private manner, it means that are also released under differential privacy. So by Lemma 7, is released simultaneously under
With our choice of , in the case that then, using the pure differential privacy guarantee we have so by Lemma 8
Alternatively, in the case that
then, choosing , under the approximate differential privacy guarantee we have
so by Lemma 9
Therefore, in either case