Log In Sign Up

The Limits of Post-Selection Generalization

While statistics and machine learning offers numerous methods for ensuring generalization, these methods often fail in the presence of adaptivity---the common practice in which the choice of analysis depends on previous interactions with the same dataset. A recent line of work has introduced powerful, general purpose algorithms that ensure post hoc generalization (also called robust or post-selection generalization), which says that, given the output of the algorithm, it is hard to find any statistic for which the data differs significantly from the population it came from. In this work we show several limitations on the power of algorithms satisfying post hoc generalization. First, we show a tight lower bound on the error of any algorithm that satisfies post hoc generalization and answers adaptively chosen statistical queries, showing a strong barrier to progress in post selection data analysis. Second, we show that post hoc generalization is not closed under composition, despite many examples of such algorithms exhibiting strong composition properties.


page 1

page 2

page 3

page 4


Making Progress Based on False Discoveries

We consider the question of adaptive data analysis within the framework ...

Post-Hoc Explanations Fail to Achieve their Purpose in Adversarial Contexts

Existing and planned legislation stipulates various obligations to provi...

Post-Hoc Methods for Debiasing Neural Networks

As deep learning models become tasked with more and more decisions that ...

Issues with post-hoc counterfactual explanations: a discussion

Counterfactual post-hoc interpretability approaches have been proven to ...

Generalization for Adaptively-chosen Estimators via Stable Median

Datasets are often reused to perform multiple statistical analyses in an...

A post hoc test on the Sharpe ratio

We describe a post hoc test for the Sharpe ratio, analogous to Tukey's t...

Post-Selection Inference for Changepoint Detection Algorithms with Application to Copy Number Variation Data

Changepoint detection methods are used in many areas of science and engi...

1 Introduction

Consider a dataset consisting of independent samples from some unknown population . How can we ensure that the conclusions drawn from generalize to the population ? Despite decades of research in statistics and machine learning on methods for ensuring generalization, there is an increased recognition that many scientific findings generalize poorly (e.g. Ioannidis [21]). While there are many reasons a conclusion might fail to generalize, one that is receiving increasing attention is adaptivity—when the choice of method for analyzing the dataset depends on previous interactions with the same dataset.

Adaptivity can arise from many common practices, such as exploratory data analysis, using the same data set for feature selection and regression, and the re-use of datasets across research projects. Unfortunately, adaptivity invalidates traditional methods for ensuring generalization and statistical validity, which assume that the method is selected independently of the data. The misinterpretation of adaptively selected results has even been blamed for a “statistical crisis” in empirical science 


Once methods are selected adaptively, data analysts must take selection into account when interpreting results—a problem the statistics literature refers to as selective inference, or post-selection inference. Several approaches have been devised for statistical inference after selection. These generally require conditional inference algorithms, or uniform convergence arguments, tailored to a particular sequence of analyses. Perhaps more fundamentally, this line of work is not prescriptive, in that it does not provide design principles that guide the selection of a sequence of analyses to improve later inference. We discuss that work further in Section 1.2.

A recent line of work initiated by Dwork  [9] posed the question: Can we design general-purpose algorithms for ensuring generalization in the presence of adaptivity, together with guarantees on their accuracy? These works identified properties of an algorithm that ensure good generalization of queries selected based on its output, including differential privacy [9, 1], information-theoretic measures [8, 28, 27, 34], and compression [6]. They also identified many powerful general-purpose algorithms satisfying these properties, leading to algorithms for post-selection data analysis with greater statistical power than all previously known approaches.

Each of the aforementioned properties give incomparable generalization guarantees, and allow for qualitatively different types of algorithms. The common thread in each of these approaches is to establish a notion of post hoc generalization (first named robust generalization by Cummings et al. [6]). Informally, an algorithm satisfies post hoc generalization if there is no way, given only the output of , to identify any statistical query [22] such that the value of that query on the dataset is significantly different from its answer on the whole population.111The definition extends seamlessly to richer classes of statistics, but we specialize to these queries for concreteness. That our negative results hold for such a simple class of queries only makes our results stronger.

More formally: a statistical query (or linear functional) is defined by a function , where is the set of possible data points. The query’s empirical mean on a data set is its average over the points in , i.e., . Given a distribution (called the population) on , the query’s population mean is , the expectation of on a fresh sample from .

[Post Hoc Generalization [6]] An algorithm satisfies -post hoc generalization if for every distribution over and every algorithm that outputs a bounded function , if and , then

where and

, and the probability is over the sampling of

and any randomness of and .

We phrase the definition as a tail bound on the post hoc generalization error, but one could also cast the definition in terms of the generalization error’s moments (e.g. mean squared error). For the purpose of proving asymptotic lower bounds, the definition we use is more general.

Post hoc generalization is easily satisfied whenever is large enough to ensure uniform convergence for the class of statistical queries. However, uniform convergence is only satisfied in the unrealistic regime where is much larger than . Algorithms that satisfy post hoc generalization are interesting in the realistic regime where there will exist queries for which and are far, but these queries cannot be found.

Since all existing general-purpose algorithms for post-selection data analysis are analyzed via post hoc generalization, it is crucial to understand what we can achieve with algorithms satisfying post hoc generalization. This work shows two types of strong limitations on post hoc generalization:

  • We show new, almost tight limits on the accuracy of any post hoc generalizing algorithm for estimating the answers to adaptively chosen statistical queries.

  • We show that the composition of algorithms satisfying post hoc generalization does not always satisfy post hoc generalization.

Our results identify natural barriers to progress in this area, and highlight important challenges for future research on post-selection data analysis.

1.1 Our Results

1.1.1 Sample Complexity Bounds for Statistical Queries

Our first contribution is a strong new lower bound on any algorithm that answers a sequence of adaptively chosen statistical queries while satisfying post hoc generalization. State-of-the-art algorithms for this problem do satisfy post hoc generalization, thus our lower bound shows that improving on the state of the art would require a fundamentally different analysis paradigm.

Dwork et al. [9] were the first to study the sample size required to answer adaptive statistical queries; a recent line of work deepened the study of their model [1, 19, 29, 6, 27, 14, 15]. In the model of, there is an underlying distribution on a set that an analyst would like to study. An algorithm holds a sample , receives statistical queries , and returns accurate answers such that . To model adaptivity, consider a data analyst that issues a sequence of queries where each query may depend on the answers given by the algorithm in response to previous queries. Our goal is to design mechanisms that answer a sequence of queries selected by an arbitrary (possibly adversarial) analyst at a desired accuracy using as small a sample size as possible.

The simplest algorithm would return the empirical mean in response to each query. One can show that this algorithm answers each query to within if .222Formally, in order to rule out pathological queries, this sample-complexity guarantee applies only if we round the empirical mean to some suitable precision . Surprisingly, we can improve the sample complexity to —a quadratic improvement in —by returning perturbed with carefully calibrated noise [9, 1]. The analysis of this approach relies on post hoc generalization: one can show that no matter how the analyst selects queries, each query will satisfy with high probability. The noise distribution also ensures that the difference between each answer and the empirical mean , hence .

Our main result shows that the sample complexity is essentially optimal for any algorithm that uses the framework of post hoc generalization. Our construction refines the techniques in Hardt and Ullman [19] and Steinke and Ullman [29]—which yield a lower bound of for .

[Informal] If takes a sample of size , satisfies -post hoc generalization, and for every distribution over and every data analyst who asks statistical queries,

(where the probability is taken over and the coins of and ), then .333Independently, Wang [32] proved a quantitatively similar bound to Theorem 1.1.1. However, Wang’s lower bound assumes that the algorithm can see only the empirical mean of each query, and not the raw data . Their bound also applies for a slightly different (though closely related) class of statistics, where all possible query values are jointly Gaussian.

The dimensionality of required in our result is at least as large as ; that dependence is essentially necessary. Indeed, if the support of the distribution is , then there is an algorithm that takes a sample of size just  [9, 1], so the conclusion is simply false if . Even when , the aforementioned algorithms require running time at least per query. [19, 29] also showed that any polynomial time algorithm that answers queries to constant error requires . We improve this result to have the optimal dependence on .

[Informal] Assume pseudorandom generators exist and let be any constant. If takes a sample of size , has polynomial running time, satisfies -post hoc generalization, and for every distribution over and every data analyst who asks statistical queries,

then , where the probability is taken over and the coins of and .

1.1.2 Negative Results for Composition

One of the motivations for studying post hoc generalization is to allow for exploratory data analysis and dataset re-use. In these settings, the same dataset may be analyzed by many different algorithms, each satisfying post hoc generalization. Thus it is important to understand whether the composition of these algorithms also satisfies post hoc generalization. We show that this is not always the case. For every there is a collection of algorithms that take samples from a distribution over such that

  1. for every and for , each of these algorithms satisfies -post hoc generalization, but

  2. the composition does not satisfy -post hoc generalization.

By standard anti-concentration results, no algorithm satisfies -post hoc generalization for . On the other hand, every algorithm trivially satisfies - and -post-hoc generalization. Thus, Theorem 1.1.2 states that there is a set of algorithms that have almost optimal post hoc generalization, but whose composition does not have any non-trivial post hoc generalization. Certain subclasses of post hoc generalizing algorithms—such as differential privacy [7]—do satisfy composition where the parameters grow at worst linearly in . In contrast, for arbitrary post hoc generalization, can grow with . Our results give additional motivation to studying subclasses that do compose.

We first consider a relaxed notion of computational post hoc generalization, for which we show that composition can fail even for just two algorithms. Informally, computational post hoc generalization requires that Definition 1 hold when the analyst runs in polynomial time. Assume pseudorandom generators exist. For every there are two algorithms that take samples from a distribution over such that

  1. for every and for , both algorithms satisfy -computational post hoc generalization, but

  2. the composition is not -computationally post hoc generalizing.

We prove the information-theoretic result (Theorem 1.1.2) in Section 4. Due to space restrictions, we defer the computational result (Theorem 1.1.2) to the full version of this work.

1.2 Related Work

In addition to the upper and lower bounds mentioned above for adaptive linear queries, a number of works have explored variants of the model, notably the models of jointly Gaussian queries [28, 33, 32] and a Bayesian model with symmetric information [12, 13].

In the statistics literature, work on selective inference dates at least to the works of Freedman and Freedman [17]Hurvich and Tsai [20], and Pötscher [26]. The last decade or so has seen a resurgence of interest, for example in [23, 24, 11, 16, 31, 2, 4] (this list is necessarily incomplete). One line of work due to Berk et al. [2] and Buja et al. [4] provides uniform validity results for all analyses in a particular classes. As we mentioned above, uniform convergence is not possible in many settings without prohibitively large sample sizes. Another line of work looks at selective inference by explicity conditioning on the sequence of previous query answers (see Bi et al. [3] for a high-level summary of the approach). Explicit conditioning has the advantage of optimality in several contexts, but requires formulating a prior over possible distributions, and can be computationally infeasible. Perhaps most problematically, the work we are aware of along these lines is more “descriptive” than “prescriptive,” providing no estimates of power or accuracy before the execution of an experiment, and thus no guidance on the optimal design of the experiment. Understanding the connections between that line of work and the work on which we build directly remains an intriguing open problem.

While our work considers biases arising from adaptive analysis of iid data, a recent work of Nie et al. [25] investigates the bias introduced by adaptive sampling of data. These questions have similar motivation, but are technically orthogonal.

2 Lower Bounds for Statistical Queries

2.1 Post Hoc Generalization for Adaptive Statistical Queries

We are interested in the ability of interactive algorithms satisfying post hoc generalization to answer a sequence of statistical queries. Definition 1 applies to such algorithms via the following experiment.

chooses a distribution over
and is given to (but not to )
for  do
       outputs a statistical query (possibly depending on )
Algorithm 1

An algorithm is -post hoc generalizing for adaptive queries over given samples if for every adversary ,

An algorithm is -accurate for adaptive queries over given samples if for every adversary ,

2.2 A Lower Bound for Natural Algorithms

We begin with an information-theoretic lower bound for a class of algorithms that we call natural algorithms. These are algorithms that can only evaluate the query on the sample points they are given. That is, an algorithm is natural if, when given a sample and a statistical query , the algorithm returns an answer that is a function only of . In particular, it cannot evaluate on data points of its choice. Many algorithms in the literature have this property. Formally, we define natural algorithms via the game . This game is identical to except that when outputs , does not receive all of , but instead receives only .

chooses a distribution over
and is given to (but not to )
for  do
       outputs a statistical query (possibly depending on )
       outputs (possibly depending on )
Algorithm 2

[Lower Bound for Natural Algorithms] There is an adversary such that for every natural algorithm , and for universe size , if


The proof uses the analyst described in Algorithm 3. For notational convenience, actually asks queries, but this does not affect the final result.

Parameters: sample size , universe size , number of queries , target accuracy
Let , , and
for  do
       for  do
             Sample and let
      Ask query and receive answer
       for  do
               where takes and returns the nearest point in to .
             Let (N.B. By construction, .)
for  do
       Define and
Let be defined by
Algorithm 3

In order to prove Theorem 2, it suffices to prove that either the answer to one of the initial queries fails to be accurate (in which case is not accurate, or that the final query gives significantly different answers on and (in which case is not robustly generalizing). Formally, we have the following proposition. For an appropriate choice of and sufficiently large, for any natural , with probability at least , either

  1. , or

where the probability is taken over the game

We prove Proposition 3 using a series of claims. The first claim states that none of the values are ever too large in absolute value, which follows immediately from the definition of the set and the fact that each term is bounded. For every , .


Note that for every , we have . Now, if then by definition , and the claim follows. Otherwise, suppose is the minimal value such that , then

The next claim states that, no matter how the mechanism answers, very few of the items not in the sample get “accused” of membership, that is, included in the set . [Few Accusations] .


Fix the biases as well as the all the information visibile to the mechanism (the query values , as well as the answers ). We prove that the probability of is high conditioned on any setting of these variables.

The main observation is that, once we condition on the biases , the query values at are independent with . This is true because is a natural algorithm (so it sees only the query values for points in ) and, more subtly, because the analyst’s decisions about how to sample the ’s, and which points in to include in the sets , are independent of the query values outside of . By the principle of deferred decisions, we may thus think of the query values as selected after the interaction with the mechanism is complete.

Fix . For every and , we have

By linearity of expectation, we also have

Next, note that , since and . The terms are not independent, since if a partial sum ever exceeds , then subsequent values for will be set to 0. However, we may consider a related sequence given by sums of the terms (the difference from is that we use values regardless of whether item is in ). Once we have conditioned on the biases and mechanism’s outputs,

is a sum of bounded independent random variables. By Hoeffding’s Inequality, the sum is bounded

with high probability, for every

By Etemadi’s Inequality, a related bound holds uniformly over all the intermediate sums:

Finally, notice that by construction, the real scores are all set to 0 when an item is added to , so the sets are nested (), and a bound on partial sums of the applies equally well to the partial sums of the . Thus,

Now, the scores are independent across players (again, because we have fixed the biases and the mechanism’s outputs). We can bound the probability that more than elements are “accused” over the course of the algorithm using Chernoff’s bound:

The claim now follows by averaging over all of the choices we fixed. ∎

The next claim states that the sum of the scores over all not in the sample is small. With probability at least ,


Fix a choice of , the in-sample query values , and the answers . Conditioned on these, the values for are independent and identically distributed. They have expectation 0 (see the proof of Claim 3) and are bounded by (by Claim 3). By Hoeffding’s inequality, with probability at least as desired. The claim now follows by averaging over all of the choices we fixed. ∎

There exists such that, for all sufficiently small and sufficiently large , with probability at least , either

  1. (large error), or

  2. (high scores in sample).

The proof of Claim 2.2 relies on the following key lemma. The lemma has appeared in various forms [29, 10, 30]; the form we use is [5, Lemma 3.6] (rescaled from to ).

[Fingerprinting Lemma] Let be arbitrary. Sample and sample independently. Then

Proof of Claim 2.2.

To make use of the fingerprinting lemma, we consider a variant of Algorithm 3 that does not truncate the quantity to the range when computing the score for each element . Specifically, we consider scores based on the quantities

We prove two main statements: first, that these untruncated scores are equal to the truncated ones with high probability as long as the mechanism’s answers are accurate. Second, that the expected sum of the untruncated scores is large. This gives us the desired final statement.

To relate the truncated and untruncated scores, consider the following three key events:

  1. (“Few accusations”): Let the event that, at every round , set of “accused” items outside of the sample is small: . Since the are nested, event implies the same condition for all in .

  2. (“Low population error”): Let be the event that at every round , the mechanism’s anwer satisfies .

  3. (“Representative queries”): Let be the event that for all rounds —that is, each query’s population average is close to the corresponding sampling bias .

Conditioned on , the truncated and untruncated scores are equal. Specifically, for all .


We can bound the difference via the triangle inequality:

The first term is the mechanism’s sample error (bounded when occurs). The second is the distortion of the sample mean introduced by setting the query values of to 0. This distortion is at most . When occurs, has size at most , so the second term is at most . Finally, the last term is bounded by when occurs, by definition. The three terms add to at most when , , and all occur. ∎

We can bound the probability of via a Chernoff bound: The probability of that a binomial random variable deviates from its mean by is at most .

The technical core of the proof is the use of the fingerprinting lemma to analyze the difference between the sum of untruncated scores and the summed population errors:


We show that for each round , the expected sum of scores for that round is at least . This is true even when we condition on all the random choices and communication in rounds through . Adding up these expectations over all rounds gives the desired expectation bound for .

First, note that summing over all elements is the same as summing over that round’s unaccused elements (since for ). Thus,

We can now apply the Fingerprinting Lemma, with , , for , and (note that depends implicitly on , but since we condition on the outcome of previous rounds, we may take as fixed for round ). We obtain

Now the difference between and the actual population mean is at most . Thus we can upper-bound the term inside the right-hand side expectation above by . ∎

A direct corollary of Sub-Claim 2.2 is that there is a constant such that, with probability at least , . Let’s call that event .

Conditioned on , we know that each equals the real score (by the first sub-claim above), that for each , and that . If we also consider the intersection with , then we have (for sufficiently small ). By a union bound, the probability of is at most (for sufficiently large ). Thus we get where is positive for sufficiently small . This completes the proof of Claim 2.2. ∎

To complete the proof of the proposition, suppose that for every , so that we can assume . Then, we can show that, when is sufficiently large and , the final query will violate robust generalization. The following calculation shows that for the query that we defined, .

(Claims 3 and 2.2)

Now, we choose an appropriate we will have that . By this choice of , the first term in the final line above will be at least . Also, we have , so when is larger than some absolute constant, the term in the final line above is . Thus, by Claims 3 and 2.2, either fails to be accurate, so that , or we find a query such that

3 Lower Bounds for General Algorithms

In this section we show how to “lift” our lower bounds for natural oracles to arbitrary algorithms using techniques from [19].

3.1 Information-Theoretic Lower Bounds via Random Masks

We prove information-theoretic lower bounds by constructing the following transformation from an adversary that defeats all natural algorithms to an adversary (for a much a larger universe) that defeats all algorithms. The main idea of the reduction is to use random masks444In cryptographic terminology, a

one-time pad

. to hide information about the evaluation of the queries at points outside of the dataset. Since the algorithm does not obtain any information about the evaluation of the queries on points outside of its dataset, it is effectively forced to behave like a natural algorithm.

Parameters: sample size , universe size , number of queries , target accuracy .
Oracle: an adversary for natural algorithms with sample size , universe size , number of queries , target accuracy .
for  do

be the uniform distribution over pairs

for for  do
       Receive the query from
       Form the query (NB: )
       Send the query to and receive the answer
       Send the answer to
Algorithm 4

We now prove the following lemma, which states that if no natural algorithm can be robustly generalizing and accurate against , then no algorithm of any kind can be robustly generalizing and accurate against .

For every algorithm , and every adversary for natural algorithms given as an oracle to , the adversary satisfies

The following corollary is immediate by combining Theorem 2 with Lemma 3.1. There is an adversary such that for every algorithm , if

then .

We now return to proving Lemma 3.1

Proof of Lemma 3.1.

Fix any algorithm . We claim that there is an algorithm such that


We construct the algorithm as follows

Input: a sample
Oracle: an algorithm
for  do
for  do
for  do
       Receive the (partial) query from the adversary
       Form the query