1 Introduction
Modern data is messy and high dimensional, and it is often not clear a priori what is the right analysis to perform. To extract the most insight, the analyst typically needs to perform exploratory analysis to make sense of the data and identify interesting hypotheses. This is invariably an adaptive process; patterns in the data observed in the first stages of analysis inform which tests are run next and the process iterates. Ultimately, the data itself may influence which results the analyst chooses to report, introducing
researcher degrees of freedom
: an additional source of overfitting that isn’t accounted for in reported statistical estimates [28]. Even if the analyst is wellintentioned, this exploration can lead can lead to false discovery or large bias in reported estimates.The practice of dataexploration is largely outside the domain of classical statistical theory. Standard tools of multiple hypothesis testing and false discovery rate (FDR) control assume that all the hypotheses to be tested, and the procedure for testing them, are chosen independently of the dataset. Any “peeking” at the data before committing to an analysis procedure renders classical statistical theory invalid. Nevertheless, data exploration is ubiquitous, and folklore and experience suggest the risk of false discoveries differs substantially depending on how the analyst explores the data. This creates a glaring gap between the messy practice of data analysis, and the standard theoretical frameworks used to understand statistical procedures. In this paper, we aim to narrow this gap. We develop a general framework based on the concept of information usage and systematically study the degree of bias introduced by different forms of exploratory analysis, in which the choice of which function of the data to report is made after observing and analyzing the dataset.
To concretely illustrate the challenges of data exploration, consider two data scientists Alice and Bob.
Example 1.
Alice has a dataset of 1000 individuals for a weightloss biomarker study. For each individual, she has their weight measured at 3 time points and the current expression values of 2000 genes assayed from blood samples. There are three possible weight changes that Alice could have looked at—the difference between time points 1 and 2, 2 and 3 or 1 and 3—but Alice decides ahead of time to only analyze the weight change between 1 and 3. She computes the correlation across individuals between the expression of each gene and the weight change, and reports the gene with the highest correlations along with its value. This is a canonical setting where we have tools for controlling error in multiplehypothesis testing and the falsediscovery rate (FDR). It is wellrecognized that even if the reported gene passes the multipletesting threshold, its correlation in independent replication studies tend to be smaller than the reported correlation in the current study. This phenomenon is also called the Winner’s Curse selection bias.
Example 2.
Bob has the same data, and he performs some simple data exploration. He first uses data visualization to investigate the average expression of all the genes across all the individuals at each of the time points, and observes that there is very little difference between time 1 and 2 and there is a large jump between time 2 and 3 in the average expression. So he decides to focus on these later two time points. Next, he realizes that half of the genes always have low expression values and decides to simply filter them out. Finally, he computes the correlations between the expression of the 1000 postfiltered genes and the weight change between time 2 and 3. He selects the gene with the largest correlation and reports its value. Bob’s analysis consists of three steps and the results of each step depends on the results and decisions made in the previous steps. This adaptivity in Bob’s exploration makes it difficult to apply standard statistical frameworks. We suspect there is also a selection bias here leading to the reported correlation being systematically larger than the real correlations if those genes are tested again. How do we think about and quantify the selection bias and overfitting due to this more complex data exploration? When is it larger or smaller than Alice’s selection bias?
The toy examples of Alice and Bob illustrate several subtleties of bias due to data exploration. First, the adaptivity of Bob’s analysis makes it more difficult to quantify its bias compared to Alice’s analysis. Second, for the same analysis procedure, the amount of selection bias depends on the dataset. Take Alice for example, if across the population one gene is substantially more correlated with weight change than all other genes, then we expect the magnitude of Winner’s Curse decreases. Third, different steps of data exploration introduce different amounts of selection bias. Intuitively, Bob’s visualizing of aggregate expression values in the beginning should not introduce as much selection bias as his selection of the top gene at the last step.
This paper introduces a mathematical framework to formalize these intuitions and to study selection bias from data exploration. The main tool we develop is a metric of the bad information usage in the data exploration. The true signal in a dataset is the signal that is preserved in a replication dataset, and the noise is what changes across different replications. Using Shannon’s mutual information, we quantify the degree of dependence between the noise in the data and the choice of which result is reported. We then prove that the bias of an arbitrary dataexploration process is bounded by this measure of its bad information usage. This bound provides a quantitative measure of researcher degrees of freedom, and offers a single lens through which we investigate different forms of exploration.
In Section 2, we present a general model of exploratory dataanalysis that encompasses the procedures used by Alice and Bob. Then we define information usage and show how it upper and lower bounds various measures of bias and estimation error due to data exploration in Section 3. In Section 4, we study specific examples of data exploration through the lens of information usage, which gives insight into Bob’s practices of filtering, visualization, and maximum selection. Information usage naturally motivates randomization approaches to reduce bias and we explore this in Section 5. In Section 5, we also study a model of a data analyst who–like Bob–interacts adaptively with the data many times before selecting values to report.
2 A Model of Data Exploration
We consider a general framework in which a dataset
is drawn from a probability distribution
over a set of possible datasets . The analyst is considering a large number of possible analyses on the data, but wants to report only the most interesting results. She decides to report the result of a single analysis, and chooses which one after observing the realized dataset, , or some summary statistics of . More formally, the data analyst considers functions of the data, where denotes the output of the th analysis on the realization . Each function is typically called an estimator; eachis an estimate or statistic calculated from the sampled data, and is a random variable due to the randomness in the realization of
. After observing the sampleddata, the analyst chooses to report the value for . The selection rule captures how the analyst uses the data and chooses which result to report. Because the choice made by is itself a function of the sampleddata, the reported value may be significantly biased. For example, could be very far from zero even if each fixed function has zero mean.Note that although the number of estimators is assumed to be finite, it could be arbitrarily large; in particular can be exponential in the number of samples in the dataset. The ’s represent the set of all estimators that the analyst potentially could have considered during the course of exploration. Also, while for simplicity we focus on the case where exactly one estimate is selected and reported, our results apply in settings where the analyst selects and reports many estimates.^{1}^{1}1For example, if the analyst chooses to report results, our framework can be used to bound the average bias of the reported values by letting be a random draw from the selected analyses.
Example 1.
For Alice, is a 1000by2003 matrix, where the rows are the individuals and the columns are the 2000 genes plus the three possible weight changes. Here there are potential estimators and is the correlation between the th gene and the weight change between times 1 and 3. Alice’s analysis corresponds to the selection procedure .
Example 2.
Bob has the same dataset . Because his exploration could have led him to use any of the three possible weightchange measures, the set of potential estimators are the correlations between the expression of one gene and one of the three weight changes and there are such ’s. Bob’s adaptive exploration also corresponds to a selection procedure that takes the dataset and picks out a particular correlation value to report.
Selection Bias.
Denote the true value of estimator as ; this is the value that we expect if we apply on multiple independent replication datasets. On a particular dataset , if is the selected test, the output of data exploration is the value . The output and truevalue can be written more concisely as and . The difference captures the error in the reported value. We are interested in quantifying the bias due to dataexploration, which is defined as the average error . We will quantify other metrics of error, such as the expected absoluteerror or the squarederror . In each case, the expectation is over all the randomness in the dataset and any intrinsic randomness in .
Related work.
There is a large body of work on methods for providing meaningful statistical inference and preventing false discovery. Much of this literature has focused on controlling the false discovery rate in multiplehypothesis testing where the hypotheses are not adaptively chosen [2, 3]
. Another line of work studies confidence intervals and significance tests for parameter estimates in sparse high dimensional linear regression (see
[1, 31, 20, 23] and the references therein).One recent line of work [16, 29] proposes a framework for assigning significance and confidence intervals in selective inference, where model selection and significance testing are performed on the same dataset. These papers correct for selection bias by explicitly conditioning on the event that a particular model was chosen. While some powerful results can be derived in the selective inference framework (e.g. [30, 22]), it requires that the conditional distribution is known and can be directly analyzed. This requires that the candidate models and the selection procedure are mathematically tractable and specified by the analyst before looking at the data. Our approach does not explicitly adjust for selection bias, but it enables us to formalize insights that apply to very general selection procedures. For example, the selection rule could represent the choice made by a dataanalyst, like Bob, after performing several rounds of exploratory analysis.
A powerful line of work in computer science and learning theory [6, 26, 27] has explored the role of algorithmic stability in preventing overfitting. Related to stability is PACBayes analysis, which provides powerful generalization bounds in terms of KLdivergence [25]. There are two key differences between stability and our framework of information usage. First, stability is typically defined in the worst case setting and is agnostic of the data distribution. An algorithm is stable if, no matter the data distribution, changing one training point does not affect the predictions too much. Information usage gives more finegrained bias bounds that depend on the data distribution. For example, in Section 4.3 we show the same learning algorithm has lower bias and lower information usage as the signal in the data increases. The second difference is that stability analysis has been traditionally applied to prediction problems—i.e. to bounding generalization loss in prediction tasks. Information usage applies to prediction—e.g.
could be the squared loss of a classifier—but it also applies to model estimation where
could be the value of the th parameter.Exciting recent work in computer science [4, 19, 13, 14] has leveraged the connection between algorithmic stability and differential privacy to design specific differentially private mechanisms that reduce bias in adaptive data analysis. In this framework, the data analyst interacts with a dataset indirectly, and sees only the noisy output of a differentially private mechanism. In Section 5, we discuss how information usage also motivates using various forms of randomization to reduce bias. In the Appendix, we discuss the connections between mutual information and a recently introduced measure called maxinformation [14]. The results from this privacy literature are designed for worstcase, adversarial data analysts. We provide guarantees that vary with the selection rule, but apply to all possible selection procedures, including ones that are not differentially private. The results in algorithmic stability and differential privacy are complementary to our framework: these approaches are specific techniques that guarantee low bias for worstcase analysts, while our framework quantifies the bias of any general dataanalyst.
Finally it is also important to note the various practical approaches used in specific settings to quantify or reduce bias from exploration. Using random subsets of data for validation is a common prescription against overfitting. This is feasible if the data points are independent and identically distributed samples. However, for structured data—e.g. timeseries or network data—it is not clear how to create a validation set. The bounds on overfitting we derive based on information usage do not assume independence and apply to structured data. Special cases of selection procedures corresponding to filtering by summary statistics of biomarkers [5] and selection matrix factorization based on a stability criterion [33] have been studied. The insights from these specific settings agree with our general result that low information usage limits selection bias.
3 Controlling Exploration Bias via Information Usage
Information usage upper bounds bias.
In this paper, we bound the degree of bias in terms of an information–theoretic quantity: the mutual information between the choice of which estimate to report, and the actual realized value of the estimates . We state this result in a general framework, where and are any random variables defined on a common probability space. Let denote the mean of . Recall that a realvalued random variable is –subGaussian if for all ,
so that the moment generating function of
is dominated by that of a normal random variable. Zero–mean Gaussian random variables are subGaussian, as are bounded random variables.Proposition 3.1.
If is –subGaussian for each , then,
where denotes mutual information^{2}^{2}2The mutual information between two random variables is defined as ..
The randomness of is due to the randomness in the realization of the data . This captures how each estimate varies if a replication dataset is collected, and hence captures the noise in the statistics. The mutual information , which we call information usage, then quantifies the dependence of the selection process on the noise in the estimates. Intuitively, a selection process that is more sensitive to the noise (high ) is at a greater risk for bias. We will also refer to as bad information usage to highlight the intuition that it really captures how much information about the noise in the data goes into selecting which estimate to report. We normally think of data analysis as trying to extract the good information, i.e. the true signal, from data. The more bad information is used, the more likely the analysis procedure is to overfit.
When is determined entirely from the values , mutual information is equal to entropy . This quantifies how much varies over different independent replications of the data.
The parameter provides the natural scaling for the values of . The condition that is subGaussian ensures that its tail is not too heavy^{3}^{3}3A random variable is said to be subGaussian if for all .. In the Supplementary Information, we show how this condition can be relaxed to treat cases where is a subExponential random variables (Proposition A.2) as well as settings where the ’s have different scaling ’s (Proposition A.1).
Proposition 3.1 applies in a very general setting. The magnitude of overfitting depends on the generating distribution of dataset, and on the size of dataset, and this is all implicitly captured in by the mutualinformation . For example, a common type of estimate of interest is the sample average of some function based on an iid sequence . Note that if is subGaussian with parameter , then is subGaussian with parameter and therefore
To illustrate Proposition 3.1, we consider two extreme settings: one where is chosen independently of the data and one where heavily depends on the values of all the
’s. The subsequent sections will investigate the applications of information usage in depth in settings that interpolate between these two extremes.
Example: dataagnostic exploration.
Suppose is independent of . This may happen if the choice of which estimate to report is decided ahead of time and cannot change based on the actual data. It may also occur when the dataset can be split into two statistically independent parts, and separate parts are reserved for dataexploration and estimation. In such cases, one expects there is no bias because the selection does not depend on the actual values of the estimates. This is reflected in our bound: since is independent of , and therefore .
Example: maximum of Gaussians.
Suppose each is an independent sample from the zeromean normal . If , then because all ’s are symmetric and have equal chance of being selected by . Applying Proposition 3.1 gives This is the well known inequality for the maximum of Gaussian random variables. Moreover, it is also known that this equation approaches equality as the number of Gaussians, , increases, implying that the information usage precisely measures the bias of maxselection in this setting. It is illustrative to also consider a more general selection which first ranks the ’s from the largest to the smallest and then uniformly randomly selects one of the largest ’s to report. Here , where (by the symmetry of as before) and (since given the values of ’s there is still uniform randomness over which of the top is selected). We immediately have the following corollary.
Corollary 1.
Suppose for each , is a zerocentered subGaussian random variable with parameter . Let denote the values of sorted from the largest to the smallest. Then
In Appendix B, we show that this bound is also tight holds as and increase.
Information usage bounds other metrics of exploration error.
So far we have discussed how mutual information upper bounds the bias . In different application settings, it might be useful to control other measures of exploration error, such as the absolute error deviation and the squared error .
Here we extend Proposition 3.1 and show how and can be used to bound absolute error deviation and squared error. Note that due to inherent noise even in the absence of selection bias, the absolute or squared error can be of order or , respectively. The next result effectively bounds the additional error introduced by dataexploration in terms of informationusage.
Proposition 3.2.
Suppose for for each , is subGaussian. Then
and
where and are universal constants.
Information usage also lower bounds error.
In the maximum of Gaussians example, we have already seen a setting where information usage precisely quantifies bias. Here we show that this is a more general phenomenon by exhibiting a much broader setting in which mutualinformation lower bounds expectederror. This complements the upper bounds of Proposition 3.1 and Proposition 3.2.
Suppose where . Because is a deterministic function of , mutual information is equal to entropy. The probability
is a complicated function of the mean vector
, and the entropy provides a single number measuring the uncertainty in the selection process. Proposition 3.2 upper bounds the average squared distance between and by entropy. The next proposition provides a matching lower bound, and therefore establishes a fundamental link between information usage and selectionrisk in a natural family of models.Proposition 3.3.
Let where . There exist universal numerical constants , , , and such that for any and ,
Recall that the entropy of is defined as
Here is often interpreted as the “surprise” associated with the event and entropy is interpreted as expected surprise in the realization of . Proposition 3.3 relies on a link between the surprise associated with the selection of statistic , and the squared error on events when it is selected.
To understand this result, it is instructive to instead consider a simpler setting; imagine , always, , and the selection rule is . When is large,
and so the surprise associate with the event scales with the squared gap between the selection threshold and the true mean of . One can show that as ,
where denotes the selection rule with threshold and if as .
In the Supplement, we investigate additional thresholdbased selection policies applied to Gaussian and exponential random variables, allowing for arbitrary correlation among the ’s, and show that also provides a natural lower bound on estimationerror.
4 When is bias large or small? The view from information usage
In this section, we consider several simple but commonly used procedures of feature selection and parameter estimation. In many applications, such feature selection and estimation are performed on the same dataset. Information usage provides a unified framework to understand selection bias in these settings. Our results inform when these these procedures introduce significant selection bias and when they do not. The key idea is to understand which structures in the data and the selection procedure make the mutual information
significantly smaller than the worstcase value of . We provide several simulation experiments as illustrations.4.1 Filtering by marginal statistics
Imagine that is chosen after observing some dataset . This dataset determines the values of , but may also contain a great deal of other information. Manipulating the mutual information shows
where captures the fraction of the uncertainty in that is explained by the data in beyond the values . In many cases, instead of being a function of , the choice is a function of data that is more loosely coupled with , and therefore we expect that is much smaller than (which itself can be less than ).
One setting when the selection of depends on the statistics of that are only loosely coupled with
is variance based feature selection
[34, 21]. Suppose we have samples and biomarkers. Let denote the value of the th biomarker on sample . Here . Let be the empirical mean values of the th biomarker. We are interested in identifying the markers that show significant nonzero mean. Many studies first perform a filtering step to select only the markers that have high variance and remove the rest. The rationale is that markers that do not vary could be measurement errors or are likely to be less important. A natural question is whether such variance filtering introduces bias.In our framework, variance selection is exemplified by the selection rule where . Here we consider the case where only the marker with the largest variance is selected, but all the discussion applies to softer selection when we select the markers with the largest variance. The resulting bias is . Proposition 3.1 states that variance selection has low bias if is small, which is the case if the empirical means and variances, and , are not too dependent. In fact, when the are i.i.d. Gaussian samples, are independent of . Therefore and we can guarantee that there is no bias from variance selection.
This illustrates an important point that the bias bound depends on instead of . The selection process may depend heavily on the dataset and could be large. However as long as the statistics of the data used for selection have low mutual information with the estimators , there is low bias on the reported values.
We can apply our framework to analyze biases that arise from feature filtering more generally. A common practice in data analysis is to reduce multiple hypotheses testing burden and increase discovery power by first filtering out covariates or features that are unlikely to be relevant or interesting [5]. This can be viewed as a twostep procedure. For each feature , two marginal statistics are computed from the data, and . Filtering corresponds to a selection protocol on . Since , if the ’s do not reveal too much information about ’s then the filtering step does not create too much bias. In our example above, is the sample variance and is the sample mean of feature . General principles for creating independent and are given in [5].
4.2 Bias due to data visualization
Data visualization, using clustering for example, is a common technique to explore data and it can inform subsequent analysis. How much selection bias can be introduced by such visualization? While in principle a visualization could reveal details about every data point, a human analyst typically only extracts certain salient features from plots. For concreteness, we use clustering as an example, and imagine the analyst extracts the number of clusters from the analysis. In our framework the natural object of study is the information usage , since if the final selection is a function of , then by the dataprocessing inequality. In general, is a random variable that can take on values 1 to (if each point is assigned its own cluster). When there is structure in the data and the clustering algorithm captures it, then can be strongly concentrated around a specific number of clusters and . In this setting, clustering is informative to the analyst but does not lead to “bad informationusage” and therefore does not increase exploration bias.
4.3 Rank selection with signal
Rank selection is the procedure for selecting the with the largest value (or the top ’s with the largest values). It is the simplest selection policy and the one that we are instinctively most likely to use. We have seen previously how rank selection can introduce significant bias. In the biomarker example in Subsection 4.1, suppose there is no signal in the data, so and . Under rank selection, would have a bias close to .
What is the bias of rank selection when there is signal in the data? Our framework cleanly illustrates how signal in the data can reduce rank selection bias. As before, this insight follows transparently from studying the mutual information . Recall that mutual information is bounded by entropy: When the data provides a strong signal of which to select, the distribution of is far from uniform, and is much smaller than its worst case value of .
Consider the following simple example. Assume
where . The data analyst would like to identify and report the value of . To do this, she selects . When , there is no true signal in the data and is equally likely to take on any value in , . As increases, however, concentrates on , causing and the bias to diminish. We simulated this example with ’s, all but one of which are i.i.d. samples from and for . The simulation results, averaged over 1000 independent runs, are shown in Figure 1.
4.4 Information usage along the Least Angle Regression path
We have seen that both in theory and in practice, information usage tightly bounds the bias of optimization selections. Here we show that information usage also accurately captures the bias of a more complex selection procedure corresponding to Least Angle Regressions (LARS) [15]. LARS is an interesting example for two reasons. First it is widely used as a practical tool for sparse regression and is closely related to LASSO. Second LARS composes a sequence of maximum selections and thus provides a more complex example of selection. In Figure 2, we show the simulation results for LARS under three data settings corresponding to low, medium and high signaltonoise ratios. We use bootstrapping to empirically estimate the information usage and since we know the ground truth of the experiment, we can easily compute the bias of LARS. As the signal in the data increases, the information usage of LARS decreases and, consistent with the predictions of our theory, the bias of LARS also decreases. Moreover, as the number of selected features increases, the average (per feature) information usage of LARS decreases and, consistent with this, the average bias of LARS also decreases monotonically. Details of the experiment are in the Supplementary Information.
4.5 Differentially private algorithms
Recent papers [12, 14] have shown that techniques from differential privacy, which were initially inspired by the need to protect the security and privacy of datasets, can be used to develop adaptive data analysis algorithms with provable bounds on overfitting. These differentially private algorithms satisfy worst case bounds on certain likelihood ratios, and are guaranteed to have low informationusage. On the other hand, many algorithms have low informationusage without being differentially private. Moreover, as we have seen, the exploration bias of an algorithm could be large or small depending on the particular dataset (e.g. the signaltonoise ratio of the data) and information usage captures this. Differentially private algorithms have low information usage for all datasets and that is designed adversarial to exploit this dataset, so this is a much stricter condition. In [14], the authors also define and study a notion of maxinformation, which can be viewed as a worstcase analogue of mutual information. We discuss the relationship between these measures further in the Supplementary Information.
4.6 Information usage and classification overfitting
This section applies our framework to the problem of overfitting in classification. A classifier is trained on a dataset consisting of examples, with input features and corresponding labels . We consider here a setting where the features of the training examples are fixed, and study overfitting of the noisy labels. Each label is drawn independently of the other labels from an unknown distribution . A classifier associates a label with each input . The training error of a fixed classifier is
while its true error rate is
is the expected fraction of examples it misclassifies on a random draw of the labels . The process of training a classifier corresponds to selecting, as a function of the observed data, a particular classification rule from a large family of possible rules. Such a procedure may overfit the training data, causing the average training error to be much smaller than its true error rate .
As an example, suppose each is a –dimensional feature vector, and consists of all linear classifiers of the form . A training algorithm might set by choosing the parameter vector that minimizes the number of misclassifications on the training set. This procedure tends to overfit the noise in the training data, and as a result the average training of can be much smaller than its true error rate. The risk of overfitting tends to increase with the dimension , since higher dimensional models allow the algorithm to fit more complicated, but spurious, patterns in the training set.
The field of statistical learning provides numerous bounds on the magnitude of overfitting based on more general notions of the complexity of an arbitrary function class , with the most influential being the VapnikChervonenkis dimension, or VCdimension^{4}^{4}4The VCdimension of is the size of the largest set it shatters. A set is shattered by if for any choice of labels , there is some with for all ..
The next proposition provides informationusage bounds the degree of overfitting, and then shows that mutual information is upperbounded by the VC–dimension of . Therefore, informationusage is always constrained by functionclass complexity.
Proposition 4.1.
Let , , and . Then,
If has VCdimension , then
The proof of the information usage bound follows by an easy reduction to Proposition 3.1. The proof of the second claim relies on a known link between VCdimension and a notion of the logcovering numbers of the functionclass.
It is worth highlighting that because VCdimension depends only on the class of functions , bounds based on this measure can’t shed light on which types of datagenerating distributions and fitting procedures allow for effective generalization. Information usage depends on both, and a result could be much smaller than VCdimension; for example, this occurs when some classifiers in are much more likely to be selected after training than others. This can occur naturally due to properties of the training procedure, like regularization, or properties of the datagenerating distribution.
5 Limiting information usage and bias via randomization
We have seen how information usage provides a unified framework to investigate the magnitude of exploration bias across different analysis procedures and datasets. It also suggests that methods that reduces the mutual information between and can reduce bias. In this section, we explore simple procedures that leverages randomization to reduce information usage and hence bias, while still preserving the utility of the data analysis.
We first revisit the rankselection policy considered in the previous subsection, and derive a variant of this scheme that uses randomization to limit informationusage. We then consider a model of a human data analyst who interacts sequentially with the data. We use a stylized model to show that, even if the analysts procedure is unknown or difficult to describe, adding noise during the dataexploration process can provably limit the bias incurred. Many authors have investigated adding noise as a technique to reduce selection bias in specialized settings [12, 10]. The main goal of this section is to illustrate how the effects of adding noise is transparent through the lens of information usage.
5.1 Regularization via randomized selection
Subection 4.3 illustrates how signal in the data intrinsically reduces the bias of rank selection by reducing the entropy term in . A complementary approach to reduce bias is to increase conditional entropy by adding randomization to the selection policy . It is easy to maximize conditional entropy by choosing uniformly at random from , independently of . Imagine however that we want to not only ensure that conditional entropy is large, but want to choose such that the selected value is large. After observing , it is natural then to set the probability of setting by solving a maximization problem
subject to 
The solution to this problem is the maximum entropy or “Gibbs” distribution, which sets
(1) 
for that is chosen so that . This procedure effectively adds stability, or a kind of regularization, to the selection strategy by adding randomization. Whereas tiny perturbations to may change the identity of , the distribution is relatively insensitive to small changes in . Note that the strategy (1) is one of the most widely studied algorithms in the field of online learning [9], where it is often called exponential weights. It is also known as the exponential mechanism in differential privacy. In our framework it is transparent how it reduces bias.
To illustrate the effect of randomized selection, we use simulations to explore the tradeoff between bias and accuracy. We consider the following simple, maxentropy randomization scheme:

Take as input parameters and , and observations . Here is the inverse temperature in the Gibbs distribution and is number of ’s we need to select.

Sample without replacement indices from given in (1). Report the corresponding values .
We consider settings where we have two groups of ’s: after relabeling assume that and for . We define the bias of the selection to be and the accuracy of the selection to be , which is the fraction of reported with true signal . In Figure 3, we illustrate the tradeoff between accuracy and bias for (i.e. there are many more false signals than true signals), randomization strength , and the signal strength varying from 1 to 5. Consistent with the theoretical analysis, maxentropy selection significantly decreased bias. In the low signal regime (), both rank selection and maxentropy selection have low accuracy because the signal is overwhelmed by the large number of false positives. In the high signal regime (), both selection methods have accuracy close to one and maxentropy selection has significantly less bias. In the intermediate regime (), maxentropy selection has substantially less bias but is less accurate than rank selection.
5.2 Randomization for a multistep analyst
We next study how randomization can decrease information usage and bias even when we have very little knowledge of what the analyst is doing. To illustrate this idea, we analyze in detail a simple example of a very flexible data analyst who performs multiple steps of analysis. Flexibility in multistep data analysis presents a challenge to current statistical approaches for quantifying selection bias. Recent development in postselection inference have focused on settings where the selection rule is simple and analytically tractable, and the full analysis procedure is fixed and specified before any data analysis is performed. While powerful results can be derived in this framework—including exact bias corrections and valid postselection confidence intervals [16, 29]—these methods do not apply for exploratory analysis where the procedure can be quite flexible.
In this section, we show how our mutual information framework can be used to analyze bias for a flexible multistep analyst. We show that even if one does not know, or can’t fully describe, the selection procedure , one can control its bias by controlling the information it uses. The main idea is to inject a small amount of randomization at each step of the analysis. This randomization is guaranteed to keep the bad information usage low no matter what the analyst does.
The idea of adding randomization during data analysis to reduce overfitting has been implemented as practical ruleofthumb in several communities. Particle physicists, for example, have advocated blind data analysis: when deciding which results to report, the analyst interacts with a dataset that has been obfuscated through various means, such as adding noise to observations, removing some data points, or switching datalabels. The raw, uncorrupted, dataset is only used in computing the final reported values [24]. Adding noise is also closely related to a recent line of work inspired by differential privacy [4, 13, 14, 19].
A model of flexible, multistep analyst.
We consider a model of adaptive data analysis similar to that of [14, 13]. In this setting, the analyst learns about the data by running a series of analyses on the dataset. Each analysis is modeled by a function of the data , and choice of which analysis to run may depend on the results from all the earlier analyses. More formally, we define the model as follows:

At step 1, the analyst selects a statistic to query for and observes a result .

In the th iteration, the analyst chooses a statistic as a function of the results that she has received so far, , and receives result .

After iterations, the analyst selects as a function of
The simplest setting is when the result of the analysis is just the value of on the data : . An example of this is the rank selection considered before. At the th step, is queried (i.e. the order is fixed and does not depend on the previous results) and is returned. The analyst queries all ’s and returns the one with maximal value.
In general, we allow the analysis output to differ from the empirical value of the test and a particularly useful form is . This captures blind analysis settings, where the analyst intentionally adds noise throughout the data analysis in order to reduce overfitting. A natural goal is to ensure that for every query used in the adaptive analysis, the reported result is close to true value . We will show through analyzing the information usage that noise addition can indeed guarantee such accuracy.
This adaptive analysis protocol can be viewed as a Markov chain
By the information processing inequality [11], . Therefore, a procedure that controls the mutual information between the history of feedback and the statistics will automatically control the mutual information . By exploiting the structure of the adaptive analysis model, we can decompose the cumulative mutual information into a sum of terms. This is formalized in the following composition lemma for mutual information.
Lemma 1.
Let denote the history of interaction up to time . Then, under the adaptive analysis model
The important takeaway from this lemma is that by bounding the conditional mutual information between the response and the queried value at each step, , we can bound and hence bound the bias after rounds of adaptive queries. Given a dataset , we can imagine the analyst having a (mutual) information budget, , which is decided a priori based on the size of the data and her tolerance for bias. At each step of the adaptive data analysis, the analyst’s choice of statistic to query next (as a function of her analysis history) incurs an information cost quantified by . The information costs accumulate additively over the analysis steps, until it reaches , at which point the guarantee on bias requires the analysis to stop.
A trivial way to reduce mutual information is to return a response that is independent of the query , in which case the analyst learns nothing about the data and incurs no bias. However in order for the data to be useful for the analyst, we would like the results of the queries to also be accurate.
Adding randomization to reduce bias.
As before let denote the true answer of query . If each is –subGaussian, then . Using Proposition 3.2, we can bound the average excess error of the response , by the sum of two terms,
Response accuracy degrades with distortion, a measure of the magnitude of the noise added to responses, but this distortion also controls the degree of selection bias in future rounds. We will explicitly analyze the tradeoff between these terms in a stylized case of the general model.
Gaussian noise protocol.
We analyze the following special case.

Suppose and is jointly Gaussian for any .

For the th query , , the protocol returns a distorted response where . Note that unlike , the sequence is independent.
The term can be thought of as the number of samples in the dataset. Indeed, if is the empirical average of samples from a distribution, then . The ratio is the signaltonoise ratio of the th response. We want to choose the distortion levels so as to guarantee that a large number of queries can be answered accurately. In order to do this, we will use the next lemma to relate the distortion levels to the information provided by a response. The lemma gives a form for the mutual information where and are independent Gaussian random variables. As one would expect, this shows that mutual information is very small when the variance of is much larger than the variance of . Lemma 3, provided in the Supplementary Information, provides a similar result when is a general (not necessarily Gaussian) random variable.
Lemma 2.
If and where is independent of , then
where is the signal to noise ratio.
Using Lemma 2, we provide an explicit bound on the accuracy of as a function a function of and . Note that this result places no restriction on the procedure that generates () except that the choice can depend on only through the data available at time .
Proposition 5.1.
Suppose and is jointly Gaussian for any . If for the th query, where and is independent of , then for every
where denote a universal constant that is independent of and .
If the sequence of choices () were nonadaptive, simply returning responses without any noise ) would guarantee . In the adaptive model, the first few queries are still answered with accuracy of order , but the error increases for the later queries. This illustrates the fundamental tension that the longer the analyst explores the data, the more likely for the later analysis to overfit.
The factor can roughly be viewed as the worstcase price of adaptivity. It is worth emphasizing this price would be more severe if the system returned responses without any noise. When no noise is added error can be as large as , as is demonstrated in Example 1 in the Supplementary Information. Therefore, adding noise offers a fundamental improvement in attainable performance.
A similar insight was attained by [12], who noted that by adding Laplacian noise it is possible to answer up to queries accurately, whereas without noise accuracy degrades after queries. In the Gaussian case, it’s clear from our bound that as , all queries will be answered accurately as long as .
6 Discussion
We have introduced a general information usage approach to quantify bias that arises from data exploration. While we focus on bias, we show our mutual information based metric can be used to bound other error metrics of interest, such as the average absolute error . It is interesting to note that the same information usage also naturally appears in the lower bound on error, suggesting it may be fundamentally linked to exploration bias. This paper established lower bounds when the selection process corresponds to solving optimization problems—i.e. . An interesting direction of research is to understand more general exploration procedures in which information usage provides a tight approximation to bias.
One advantage of using mutual information to bound bias is that we have many tools to analyze and compute mutual information. This conceptual framework allow us to extract insight into settings when common data analysis procedures lead to severe bias and when they do not. In particular we show how signal in the data can reduce selection bias. Information usage also suggests engineering approaches to reduce mutual information (and hence bias) by adding randomization to each step of the data exploration. Another important project is to investigate implementations of such randomization approaches in practical analytic settings.
As discussed before, the information usage framework proposed here is very much complementary to the exciting developments in postselection inference and differential privacy. Postselection inference, for very specific settings, is able to exactly characterize and correct for exploration biases—in this case exploration is feature and model selection. Differential privacy lies at the other extreme in that it derives powerful but potentially conservative results that apply to an adversarial dataanalyst. The modern practice of data science often lies in between these two extremes—the analyst has more flexibility than assumed in postselection inference, but is also interested in finding true signals and hence is much less adversarial than the worstcase. Information usage provides a bound on exploration bias in all settings. It is also important that this bound is datadependent. In practice, the same analyst may be much less prone to false discoveries when exploring a highsignal dataset versus a lowsignal dataset, and this should be reflected in the bias metric. An interesting goal is to develop approaches that combine the sharpness of postselection inference and differential privacy with the generality of information usage.
References
 Belloni et al. [2014] Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference on treatment effects after selection among highdimensional controls. Review of economic studies, 81(287):608–650, 2014.
 Benjamini and Hochberg [1995] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), pages 289–300, 1995.
 Benjamini and Yekutieli [2001] Yoav Benjamini and Daniel Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of statistics, pages 1165–1188, 2001.

Blum and Hardt [2015]
Avrim Blum and Moritz Hardt.
The ladder: A reliable leaderboard for machine learning competitions.
In ICML 2015, 2015.  Bourgon et al. [2010] Richard Bourgon, Robert Gentleman, and Wolfgang Huber. Independent filtering increases detection power for highthroughput experiments. Proceedings of the National Academy of Sciences, 107(21):9546–9551, 2010.
 Bousquet and Elisseeff [2002] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2(Mar):499–526, 2002.

Bousquet et al. [2004]
Olivier Bousquet, Stéphane Boucheron, and Gábor Lugosi.
Introduction to statistical learning theory.
In Advanced lectures on machine learning, pages 169–207. Springer, 2004.  Buldygin and Moskvichova [2013] V Buldygin and K Moskvichova. The subgaussian norm of a binary random variable. Theory of Probability and Mathematical Statistics, 86:33–49, 2013.
 CesaBianchi and Lugosi [2006] Nicolo CesaBianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
 Chaudhuri et al. [2011] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(Mar):1069–1109, 2011.
 Cover and Thomas [2012] T.M. Cover and J.A. Thomas. Elements of information theory. John Wiley & Sons, 2012.
 Dwork et al. [2014] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Preserving statistical validity in adaptive data analysis. In STOC 2015. ACM, 2014.
 Dwork et al. [2015a] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toni Pitassi, Omer Reingold, and Aaron Roth. Generalization in adaptive data analysis and holdout reuse. In Advances in Neural Information Processing Systems, pages 2350–2358, 2015a.
 Dwork et al. [2015b] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis. Science, 349(6248):636–638, 2015b.
 Efron et al. [2004] Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angle regression. The Annals of statistics, 32(2):407–499, 2004.
 Fithian et al. [2014] William Fithian, Dennis Sun, and Jonathan Taylor. Optimal inference after model selection. arXiv preprint arXiv:1410.2597, 2014.
 Gelman et al. [2014] Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian data analysis, volume 2. Taylor & Francis, 2014.
 Gray [2011] R.M. Gray. Entropy and information theory. Springer, 2011.
 Hardt and Ullman [2014] Moritz Hardt and Jonathan Ullman. Preventing false discovery in interactive data analysis is hard. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 454–463. IEEE, 2014.
 Javanmard and Montanari [2014] Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for highdimensional regression. The Journal of Machine Learning Research, 15(1):2869–2909, 2014.
 Lee et al. [2011] I Lee, G Lushington, and M Visvanathan. A filterbased feature selection approach for identifying potential biomarkers for lung cancer. Journal of Clinical Bioinformatics, 2011.
 Lee et al. [2016] Jason D Lee, Dennis L Sun, Yuekai Sun, Jonathan E Taylor, et al. Exact postselection inference, with application to the lasso. The Annals of Statistics, 44(3):907–927, 2016.
 Lockhart et al. [2014] Richard Lockhart, Jonathan Taylor, Ryan J Tibshirani, and Robert Tibshirani. A significance test for the lasso. Annals of statistics, 42(2):413, 2014.
 MacCoun and Perlmutter [2015] Robert MacCoun and Saul Perlmutter. Blind analysis: Hide results to seek the truth. Nature, 526(7572):187–189, 2015.
 McAllester [2013] David McAllester. A pacbayesian tutorial with a dropout bound. arXiv preprint arXiv:1307.2118, 2013.
 Poggio et al. [2004] Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee, and Partha Niyogi. General conditions for predictivity in learning theory. Nature, 428(6981):419–422, 2004.
 ShalevShwartz et al. [2010] Shai ShalevShwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11(Oct):2635–2670, 2010.
 Simmons et al. [2011] Joseph P Simmons, Leif D Nelson, and Uri Simonsohn. Falsepositive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, page 0956797611417632, 2011.
 Taylor and Tibshirani [2015] Jonathan Taylor and Robert J. Tibshirani. Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25):7629–7634, 2015.
 Taylor et al. [2014] Jonathan Taylor, Richard Lockhart, Ryan J Tibshirani, and Robert Tibshirani. Exact postselection inference for forward stepwise and least angle regression. arXiv preprint arXiv:1401.3889, 2014.
 Van de Geer et al. [2014] Sara Van de Geer, Peter Buhlmann, Yaacov Ritov, Ruben Dezeure, et al. On asymptotically optimal confidence regions and tests for highdimensional models. The Annals of Statistics, 42(3):1166–1202, 2014.
 Wainwright [2015] Martin Wainwright. Basic tail and concentration bounds. 2015.
 Wu et al. [2016] Siqi Wu, Antony Joseph, Ann S Hammonds, Susan E Celniker, Bin Yu, and Erwin Frise. Stabilitydriven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks. Proceedings of the National Academy of Sciences, page 201521171, 2016.
 Zou et al. [2014] James Zou, Christoph Lippert, David Heckerman, Martin Aryee, and Jennifer Listgarten. Epigenomewide association studies without the need for celltype composition. Nature Methods, pages 309–11, 2014.
Appendix A Proofs of Information Usage Upper Bounds
a.1 Information Usage Upper Bounds Bias: Proof of Proposition 3.1
The proof of Proposition 3.1
relies on the following variational form of Kullback–Leibler divergence, which is given in Theorem 5.2.1 of Robert Gray’s textbook
Entropy and Information Theory [18].Fact 1.
Fix two probability measures and defined on a common measureable space Suppose that is absolutely continuous with respect to . Then
where the supremum is taken over all random variables such that the expectation of under is well defined, and is integrable under .
Proof of Proposition 3.1.
Applying Fact 1 with , , and , we have
where . Taking the derivative with respect to , we find that the optimizer is . This gives
By the tower property of conditional expectation and Jensen’s inequality
∎
Remark. In the first step of the proof of Proposition 3.1, we used the fact that, for all ,
which follows from the information processing inequality. The application of this inequality is not tight in general and can lead to gaps between the actual bias and our upper bound based on . Consider the following scenario. Suppose , i.e. is a deterministic function that uses the realized value of to decide which other to select. For example, imagine and is defined so that if , if , and so on. Here is deterministic, , and this is manifested in . However, if is independent of each other , then and the bias is also 0. The upper bound of Proposition 3.1 is tight in other settings; it is also useful in general because the mutual information is amenable to analysis and explicit calculation. In cases where there is a gap, we may study directly.
a.2 Extension to Unequal Variances
We can prove a generalization of Proposition 3.1 for settings when the estimates have unequal variances.
Proposition A.1.
Suppose that for each , is –subGaussian. Then,
Comments
There are no comments yet.