How much does your data exploration overfit? Controlling bias via information usage

11/16/2015
by   Daniel Russo, et al.
0

Modern data is messy and high-dimensional, and it is often not clear a priori what are the right questions to ask. Instead, the analyst typically needs to use the data to search for interesting analyses to perform and hypotheses to test. This is an adaptive process, where the choice of analysis to be performed next depends on the results of the previous analyses on the same data. Ultimately, which results are reported can be heavily influenced by the data. It is widely recognized that this process, even if well-intentioned, can lead to biases and false discoveries, contributing to the crisis of reproducibility in science. But while renders standard statistical theory invalid, experience suggests that different types of exploratory analysis can lead to disparate levels of bias, and the degree of bias also depends on the particulars of the data set. In this paper, we propose a general information usage framework to quantify and provably bound the bias and other error metrics of an arbitrary exploratory analysis. We prove that our mutual information based bound is tight in natural settings, and then use it to give rigorous insights into when commonly used procedures do or do not lead to substantially biased estimation. Through the lens of information usage, we analyze the bias of specific exploration procedures such as filtering, rank selection and clustering. Our general framework also naturally motivates randomization techniques that provably reduces exploration bias while preserving the utility of the data analysis. We discuss the connections between our approach and related ideas from differential privacy and blinded data analysis, and supplement our results with illustrative simulations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/14/2018

The Generic Holdout: Preventing False-Discoveries in Adaptive Data Science

Adaptive data analysis has posed a challenge to science due to its abili...
06/02/2017

Information, Privacy and Stability in Adaptive Data Analysis

Traditional statistical theory assumes that the analysis to be performed...
11/01/2019

Goals, Process, and Challenges of Exploratory Data Analysis: An Interview Study

How do analysis goals and context affect exploratory data analysis (EDA)...
08/27/2020

Every Query Counts: Analyzing the Privacy Loss of Exploratory Data Analyses

An exploratory data analysis is an essential step for every data analyst...
01/10/2022

On the interplay of data and cognitive bias in crisis information management – An exploratory study on epidemic response

Humanitarian crises, such as the 2014 West Africa Ebola epidemic, challe...
03/06/2021

Causal Reinforcement Learning: An Instrumental Variable Approach

In the standard data analysis framework, data is first collected (once f...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern data is messy and high dimensional, and it is often not clear a priori what is the right analysis to perform. To extract the most insight, the analyst typically needs to perform exploratory analysis to make sense of the data and identify interesting hypotheses. This is invariably an adaptive process; patterns in the data observed in the first stages of analysis inform which tests are run next and the process iterates. Ultimately, the data itself may influence which results the analyst chooses to report, introducing

researcher degrees of freedom

: an additional source of over-fitting that isn’t accounted for in reported statistical estimates [28]. Even if the analyst is well-intentioned, this exploration can lead can lead to false discovery or large bias in reported estimates.

The practice of data-exploration is largely outside the domain of classical statistical theory. Standard tools of multiple hypothesis testing and false discovery rate (FDR) control assume that all the hypotheses to be tested, and the procedure for testing them, are chosen independently of the dataset. Any “peeking” at the data before committing to an analysis procedure renders classical statistical theory invalid. Nevertheless, data exploration is ubiquitous, and folklore and experience suggest the risk of false discoveries differs substantially depending on how the analyst explores the data. This creates a glaring gap between the messy practice of data analysis, and the standard theoretical frameworks used to understand statistical procedures. In this paper, we aim to narrow this gap. We develop a general framework based on the concept of information usage and systematically study the degree of bias introduced by different forms of exploratory analysis, in which the choice of which function of the data to report is made after observing and analyzing the dataset.

To concretely illustrate the challenges of data exploration, consider two data scientists Alice and Bob.

Example 1.

Alice has a dataset of 1000 individuals for a weight-loss biomarker study. For each individual, she has their weight measured at 3 time points and the current expression values of 2000 genes assayed from blood samples. There are three possible weight changes that Alice could have looked at—the difference between time points 1 and 2, 2 and 3 or 1 and 3—but Alice decides ahead of time to only analyze the weight change between 1 and 3. She computes the correlation across individuals between the expression of each gene and the weight change, and reports the gene with the highest correlations along with its value. This is a canonical setting where we have tools for controlling error in multiple-hypothesis testing and the false-discovery rate (FDR). It is well-recognized that even if the reported gene passes the multiple-testing threshold, its correlation in independent replication studies tend to be smaller than the reported correlation in the current study. This phenomenon is also called the Winner’s Curse selection bias.

Example 2.

Bob has the same data, and he performs some simple data exploration. He first uses data visualization to investigate the average expression of all the genes across all the individuals at each of the time points, and observes that there is very little difference between time 1 and 2 and there is a large jump between time 2 and 3 in the average expression. So he decides to focus on these later two time points. Next, he realizes that half of the genes always have low expression values and decides to simply filter them out. Finally, he computes the correlations between the expression of the 1000 post-filtered genes and the weight change between time 2 and 3. He selects the gene with the largest correlation and reports its value. Bob’s analysis consists of three steps and the results of each step depends on the results and decisions made in the previous steps. This adaptivity in Bob’s exploration makes it difficult to apply standard statistical frameworks. We suspect there is also a selection bias here leading to the reported correlation being systematically larger than the real correlations if those genes are tested again. How do we think about and quantify the selection bias and overfitting due to this more complex data exploration? When is it larger or smaller than Alice’s selection bias?

The toy examples of Alice and Bob illustrate several subtleties of bias due to data exploration. First, the adaptivity of Bob’s analysis makes it more difficult to quantify its bias compared to Alice’s analysis. Second, for the same analysis procedure, the amount of selection bias depends on the dataset. Take Alice for example, if across the population one gene is substantially more correlated with weight change than all other genes, then we expect the magnitude of Winner’s Curse decreases. Third, different steps of data exploration introduce different amounts of selection bias. Intuitively, Bob’s visualizing of aggregate expression values in the beginning should not introduce as much selection bias as his selection of the top gene at the last step.

This paper introduces a mathematical framework to formalize these intuitions and to study selection bias from data exploration. The main tool we develop is a metric of the bad information usage in the data exploration. The true signal in a dataset is the signal that is preserved in a replication dataset, and the noise is what changes across different replications. Using Shannon’s mutual information, we quantify the degree of dependence between the noise in the data and the choice of which result is reported. We then prove that the bias of an arbitrary data-exploration process is bounded by this measure of its bad information usage. This bound provides a quantitative measure of researcher degrees of freedom, and offers a single lens through which we investigate different forms of exploration.

In Section 2, we present a general model of exploratory data-analysis that encompasses the procedures used by Alice and Bob. Then we define information usage and show how it upper and lower bounds various measures of bias and estimation error due to data exploration in Section 3. In Section 4, we study specific examples of data exploration through the lens of information usage, which gives insight into Bob’s practices of filtering, visualization, and maximum selection. Information usage naturally motivates randomization approaches to reduce bias and we explore this in Section 5. In Section  5, we also study a model of a data analyst who–like Bob–interacts adaptively with the data many times before selecting values to report.

2 A Model of Data Exploration

We consider a general framework in which a dataset

is drawn from a probability distribution

over a set of possible datasets . The analyst is considering a large number of possible analyses on the data, but wants to report only the most interesting results. She decides to report the result of a single analysis, and chooses which one after observing the realized dataset, , or some summary statistics of . More formally, the data analyst considers functions of the data, where denotes the output of the th analysis on the realization . Each function is typically called an estimator; each

is an estimate or statistic calculated from the sampled data, and is a random variable due to the randomness in the realization of

. After observing the sampled-data, the analyst chooses to report the value for . The selection rule captures how the analyst uses the data and chooses which result to report. Because the choice made by is itself a function of the sampled-data, the reported value may be significantly biased. For example, could be very far from zero even if each fixed function has zero mean.

Note that although the number of estimators is assumed to be finite, it could be arbitrarily large; in particular can be exponential in the number of samples in the dataset. The ’s represent the set of all estimators that the analyst potentially could have considered during the course of exploration. Also, while for simplicity we focus on the case where exactly one estimate is selected and reported, our results apply in settings where the analyst selects and reports many estimates.111For example, if the analyst chooses to report results, our framework can be used to bound the average bias of the reported values by letting be a random draw from the selected analyses.

Example 1.

For Alice, is a 1000-by-2003 matrix, where the rows are the individuals and the columns are the 2000 genes plus the three possible weight changes. Here there are potential estimators and is the correlation between the th gene and the weight change between times 1 and 3. Alice’s analysis corresponds to the selection procedure .

Example 2.

Bob has the same dataset . Because his exploration could have led him to use any of the three possible weight-change measures, the set of potential estimators are the correlations between the expression of one gene and one of the three weight changes and there are such ’s. Bob’s adaptive exploration also corresponds to a selection procedure that takes the dataset and picks out a particular correlation value to report.

Selection Bias.

Denote the true value of estimator as ; this is the value that we expect if we apply on multiple independent replication datasets. On a particular dataset , if is the selected test, the output of data exploration is the value . The output and true-value can be written more concisely as and . The difference captures the error in the reported value. We are interested in quantifying the bias due to data-exploration, which is defined as the average error . We will quantify other metrics of error, such as the expected absolute-error or the squared-error . In each case, the expectation is over all the randomness in the dataset and any intrinsic randomness in .

Related work.

There is a large body of work on methods for providing meaningful statistical inference and preventing false discovery. Much of this literature has focused on controlling the false discovery rate in multiple-hypothesis testing where the hypotheses are not adaptively chosen [2, 3]

. Another line of work studies confidence intervals and significance tests for parameter estimates in sparse high dimensional linear regression (see

[1, 31, 20, 23] and the references therein).

One recent line of work [16, 29] proposes a framework for assigning significance and confidence intervals in selective inference, where model selection and significance testing are performed on the same dataset. These papers correct for selection bias by explicitly conditioning on the event that a particular model was chosen. While some powerful results can be derived in the selective inference framework (e.g. [30, 22]), it requires that the conditional distribution is known and can be directly analyzed. This requires that the candidate models and the selection procedure are mathematically tractable and specified by the analyst before looking at the data. Our approach does not explicitly adjust for selection bias, but it enables us to formalize insights that apply to very general selection procedures. For example, the selection rule could represent the choice made by a data-analyst, like Bob, after performing several rounds of exploratory analysis.

A powerful line of work in computer science and learning theory [6, 26, 27] has explored the role of algorithmic stability in preventing overfitting. Related to stability is PAC-Bayes analysis, which provides powerful generalization bounds in terms of KL-divergence [25]. There are two key differences between stability and our framework of information usage. First, stability is typically defined in the worst case setting and is agnostic of the data distribution. An algorithm is stable if, no matter the data distribution, changing one training point does not affect the predictions too much. Information usage gives more fine-grained bias bounds that depend on the data distribution. For example, in Section 4.3 we show the same learning algorithm has lower bias and lower information usage as the signal in the data increases. The second difference is that stability analysis has been traditionally applied to prediction problems—i.e. to bounding generalization loss in prediction tasks. Information usage applies to prediction—e.g.

could be the squared loss of a classifier—but it also applies to model estimation where

could be the value of the th parameter.

Exciting recent work in computer science [4, 19, 13, 14] has leveraged the connection between algorithmic stability and differential privacy to design specific differentially private mechanisms that reduce bias in adaptive data analysis. In this framework, the data analyst interacts with a dataset indirectly, and sees only the noisy output of a differentially private mechanism. In Section 5, we discuss how information usage also motivates using various forms of randomization to reduce bias. In the Appendix, we discuss the connections between mutual information and a recently introduced measure called max-information [14]. The results from this privacy literature are designed for worst-case, adversarial data analysts. We provide guarantees that vary with the selection rule, but apply to all possible selection procedures, including ones that are not differentially private. The results in algorithmic stability and differential privacy are complementary to our framework: these approaches are specific techniques that guarantee low bias for worst-case analysts, while our framework quantifies the bias of any general data-analyst.

Finally it is also important to note the various practical approaches used in specific settings to quantify or reduce bias from exploration. Using random subsets of data for validation is a common prescription against overfitting. This is feasible if the data points are independent and identically distributed samples. However, for structured data—e.g. time-series or network data—it is not clear how to create a validation set. The bounds on overfitting we derive based on information usage do not assume independence and apply to structured data. Special cases of selection procedures corresponding to filtering by summary statistics of biomarkers [5] and selection matrix factorization based on a stability criterion [33] have been studied. The insights from these specific settings agree with our general result that low information usage limits selection bias.

3 Controlling Exploration Bias via Information Usage

Information usage upper bounds bias.

In this paper, we bound the degree of bias in terms of an information–theoretic quantity: the mutual information between the choice of which estimate to report, and the actual realized value of the estimates . We state this result in a general framework, where and are any random variables defined on a common probability space. Let denote the mean of . Recall that a real-valued random variable is –sub-Gaussian if for all ,

so that the moment generating function of

is dominated by that of a normal random variable. Zero–mean Gaussian random variables are sub-Gaussian, as are bounded random variables.

Proposition 3.1.

If is –sub-Gaussian for each , then,

where denotes mutual information222The mutual information between two random variables is defined as ..

The randomness of is due to the randomness in the realization of the data . This captures how each estimate varies if a replication dataset is collected, and hence captures the noise in the statistics. The mutual information , which we call information usage, then quantifies the dependence of the selection process on the noise in the estimates. Intuitively, a selection process that is more sensitive to the noise (high ) is at a greater risk for bias. We will also refer to as bad information usage to highlight the intuition that it really captures how much information about the noise in the data goes into selecting which estimate to report. We normally think of data analysis as trying to extract the good information, i.e. the true signal, from data. The more bad information is used, the more likely the analysis procedure is to overfit.

When is determined entirely from the values , mutual information is equal to entropy . This quantifies how much varies over different independent replications of the data.

The parameter provides the natural scaling for the values of . The condition that is -sub-Gaussian ensures that its tail is not too heavy333A random variable is said to be -sub-Gaussian if for all .. In the Supplementary Information, we show how this condition can be relaxed to treat cases where is a sub-Exponential random variables (Proposition A.2) as well as settings where the ’s have different scaling ’s (Proposition A.1).

Proposition 3.1 applies in a very general setting. The magnitude of overfitting depends on the generating distribution of data-set, and on the size of data-set, and this is all implicitly captured in by the mutual-information . For example, a common type of estimate of interest is the sample average of some function based on an iid sequence . Note that if is sub-Gaussian with parameter , then is sub-Gaussian with parameter and therefore

To illustrate Proposition 3.1, we consider two extreme settings: one where is chosen independently of the data and one where heavily depends on the values of all the

’s. The subsequent sections will investigate the applications of information usage in depth in settings that interpolate between these two extremes.

Example: data-agnostic exploration.

Suppose is independent of . This may happen if the choice of which estimate to report is decided ahead of time and cannot change based on the actual data. It may also occur when the dataset can be split into two statistically independent parts, and separate parts are reserved for data-exploration and estimation. In such cases, one expects there is no bias because the selection does not depend on the actual values of the estimates. This is reflected in our bound: since is independent of , and therefore .

Example: maximum of Gaussians.

Suppose each is an independent sample from the zero-mean normal . If , then because all ’s are symmetric and have equal chance of being selected by . Applying Proposition 3.1 gives This is the well known inequality for the maximum of Gaussian random variables. Moreover, it is also known that this equation approaches equality as the number of Gaussians, , increases, implying that the information usage precisely measures the bias of max-selection in this setting. It is illustrative to also consider a more general selection which first ranks the ’s from the largest to the smallest and then uniformly randomly selects one of the largest ’s to report. Here , where (by the symmetry of as before) and (since given the values of ’s there is still uniform randomness over which of the top is selected). We immediately have the following corollary.

Corollary 1.

Suppose for each , is a zero-centered sub-Gaussian random variable with parameter . Let denote the values of sorted from the largest to the smallest. Then

In Appendix B, we show that this bound is also tight holds as and increase.

Information usage bounds other metrics of exploration error.

So far we have discussed how mutual information upper bounds the bias . In different application settings, it might be useful to control other measures of exploration error, such as the absolute error deviation and the squared error .

Here we extend Proposition 3.1 and show how and can be used to bound absolute error deviation and squared error. Note that due to inherent noise even in the absence of selection bias, the absolute or squared error can be of order or , respectively. The next result effectively bounds the additional error introduced by data-exploration in terms of information-usage.

Proposition 3.2.

Suppose for for each , is sub-Gaussian. Then

and

where and are universal constants.

Information usage also lower bounds error.

In the maximum of Gaussians example, we have already seen a setting where information usage precisely quantifies bias. Here we show that this is a more general phenomenon by exhibiting a much broader setting in which mutual-information lower bounds expected-error. This complements the upper bounds of Proposition 3.1 and Proposition 3.2.

Suppose where . Because is a deterministic function of , mutual information is equal to entropy. The probability

is a complicated function of the mean vector

, and the entropy provides a single number measuring the uncertainty in the selection process. Proposition 3.2 upper bounds the average squared distance between and by entropy. The next proposition provides a matching lower bound, and therefore establishes a fundamental link between information usage and selection-risk in a natural family of models.

Proposition 3.3.

Let where . There exist universal numerical constants , , , and such that for any and ,

Recall that the entropy of is defined as

Here is often interpreted as the “surprise” associated with the event and entropy is interpreted as expected surprise in the realization of . Proposition 3.3 relies on a link between the surprise associated with the selection of statistic , and the squared error on events when it is selected.

To understand this result, it is instructive to instead consider a simpler setting; imagine , always, , and the selection rule is . When is large,

and so the surprise associate with the event scales with the squared gap between the selection threshold and the true mean of . One can show that as ,

where denotes the selection rule with threshold and if as .

In the Supplement, we investigate additional threshold-based selection policies applied to Gaussian and exponential random variables, allowing for arbitrary correlation among the ’s, and show that also provides a natural lower bound on estimation-error.

4 When is bias large or small? The view from information usage

In this section, we consider several simple but commonly used procedures of feature selection and parameter estimation. In many applications, such feature selection and estimation are performed on the same dataset. Information usage provides a unified framework to understand selection bias in these settings. Our results inform when these these procedures introduce significant selection bias and when they do not. The key idea is to understand which structures in the data and the selection procedure make the mutual information

significantly smaller than the worst-case value of . We provide several simulation experiments as illustrations.

4.1 Filtering by marginal statistics

Imagine that is chosen after observing some dataset . This dataset determines the values of , but may also contain a great deal of other information. Manipulating the mutual information shows

where captures the fraction of the uncertainty in that is explained by the data in beyond the values . In many cases, instead of being a function of , the choice is a function of data that is more loosely coupled with , and therefore we expect that is much smaller than (which itself can be less than ).

One setting when the selection of depends on the statistics of that are only loosely coupled with

is variance based feature selection

[34, 21]. Suppose we have samples and bio-markers. Let denote the value of the -th bio-marker on sample . Here . Let be the empirical mean values of the -th biomarker. We are interested in identifying the markers that show significant non-zero mean. Many studies first perform a filtering step to select only the markers that have high variance and remove the rest. The rationale is that markers that do not vary could be measurement errors or are likely to be less important. A natural question is whether such variance filtering introduces bias.

In our framework, variance selection is exemplified by the selection rule where . Here we consider the case where only the marker with the largest variance is selected, but all the discussion applies to softer selection when we select the markers with the largest variance. The resulting bias is . Proposition 3.1 states that variance selection has low bias if is small, which is the case if the empirical means and variances, and , are not too dependent. In fact, when the are i.i.d. Gaussian samples, are independent of . Therefore and we can guarantee that there is no bias from variance selection.

This illustrates an important point that the bias bound depends on instead of . The selection process may depend heavily on the dataset and could be large. However as long as the statistics of the data used for selection have low mutual information with the estimators , there is low bias on the reported values.

We can apply our framework to analyze biases that arise from feature filtering more generally. A common practice in data analysis is to reduce multiple hypotheses testing burden and increase discovery power by first filtering out covariates or features that are unlikely to be relevant or interesting [5]. This can be viewed as a two-step procedure. For each feature , two marginal statistics are computed from the data, and . Filtering corresponds to a selection protocol on . Since , if the ’s do not reveal too much information about ’s then the filtering step does not create too much bias. In our example above, is the sample variance and is the sample mean of feature . General principles for creating independent and are given in [5].

4.2 Bias due to data visualization

Data visualization, using clustering for example, is a common technique to explore data and it can inform subsequent analysis. How much selection bias can be introduced by such visualization? While in principle a visualization could reveal details about every data point, a human analyst typically only extracts certain salient features from plots. For concreteness, we use clustering as an example, and imagine the analyst extracts the number of clusters from the analysis. In our framework the natural object of study is the information usage , since if the final selection is a function of , then by the data-processing inequality. In general, is a random variable that can take on values 1 to (if each point is assigned its own cluster). When there is structure in the data and the clustering algorithm captures it, then can be strongly concentrated around a specific number of clusters and . In this setting, clustering is informative to the analyst but does not lead to “bad information-usage” and therefore does not increase exploration bias.

4.3 Rank selection with signal

Rank selection is the procedure for selecting the with the largest value (or the top ’s with the largest values). It is the simplest selection policy and the one that we are instinctively most likely to use. We have seen previously how rank selection can introduce significant bias. In the bio-marker example in Subsection 4.1, suppose there is no signal in the data, so and . Under rank selection, would have a bias close to .

What is the bias of rank selection when there is signal in the data? Our framework cleanly illustrates how signal in the data can reduce rank selection bias. As before, this insight follows transparently from studying the mutual information . Recall that mutual information is bounded by entropy: When the data provides a strong signal of which to select, the distribution of is far from uniform, and is much smaller than its worst case value of .

Consider the following simple example. Assume

where . The data analyst would like to identify and report the value of . To do this, she selects . When , there is no true signal in the data and is equally likely to take on any value in , . As increases, however, concentrates on , causing and the bias to diminish. We simulated this example with ’s, all but one of which are i.i.d. samples from and for . The simulation results, averaged over 1000 independent runs, are shown in Figure 1.

Figure 1: As the signal strength increases ( increases), the entropy of selection decreases, causing the information upper bound to also decrease. The bias of the selection decreases as well.

4.4 Information usage along the Least Angle Regression path

We have seen that both in theory and in practice, information usage tightly bounds the bias of optimization selections. Here we show that information usage also accurately captures the bias of a more complex selection procedure corresponding to Least Angle Regressions (LARS) [15]. LARS is an interesting example for two reasons. First it is widely used as a practical tool for sparse regression and is closely related to LASSO. Second LARS composes a sequence of maximum selections and thus provides a more complex example of selection. In Figure 2, we show the simulation results for LARS under three data settings corresponding to low, medium and high signal-to-noise ratios. We use bootstrapping to empirically estimate the information usage and since we know the ground truth of the experiment, we can easily compute the bias of LARS. As the signal in the data increases, the information usage of LARS decreases and, consistent with the predictions of our theory, the bias of LARS also decreases. Moreover, as the number of selected features increases, the average (per feature) information usage of LARS decreases and, consistent with this, the average bias of LARS also decreases monotonically. Details of the experiment are in the Supplementary Information.

4.5 Differentially private algorithms

Recent papers [12, 14] have shown that techniques from differential privacy, which were initially inspired by the need to protect the security and privacy of datasets, can be used to develop adaptive data analysis algorithms with provable bounds on over-fitting. These differentially private algorithms satisfy worst case bounds on certain likelihood ratios, and are guaranteed to have low information-usage. On the other hand, many algorithms have low information-usage without being differentially private. Moreover, as we have seen, the exploration bias of an algorithm could be large or small depending on the particular dataset (e.g. the signal-to-noise ratio of the data) and information usage captures this. Differentially private algorithms have low information usage for all datasets and that is designed adversarial to exploit this dataset, so this is a much stricter condition. In [14], the authors also define and study a notion of max-information, which can be viewed as a worst-case analogue of mutual information. We discuss the relationship between these measures further in the Supplementary Information.

Figure 2: Information bound (dotted lines) and bias of Least Angle Regression (solid lines). Results are shown for low (red), medium (blue) and high (green) signal-to-noise settings. The -axis indicates the number of features selected by LARS and the -axis corresponds to the average information usage and bias in the selected features.

4.6 Information usage and classification overfitting

This section applies our framework to the problem of overfitting in classification. A classifier is trained on a dataset consisting of examples, with input features and corresponding labels . We consider here a setting where the features of the training examples are fixed, and study overfitting of the noisy labels. Each label is drawn independently of the other labels from an unknown distribution . A classifier associates a label with each input . The training error of a fixed classifier is

while its true error rate is

is the expected fraction of examples it mis-classifies on a random draw of the labels . The process of training a classifier corresponds to selecting, as a function of the observed data, a particular classification rule from a large family of possible rules. Such a procedure may overfit the training data, causing the average training error to be much smaller than its true error rate .

As an example, suppose each is a –dimensional feature vector, and consists of all linear classifiers of the form . A training algorithm might set by choosing the parameter vector that minimizes the number of mis-classifications on the training set. This procedure tends to overfit the noise in the training data, and as a result the average training of can be much smaller than its true error rate. The risk of over-fitting tends to increase with the dimension , since higher dimensional models allow the algorithm to fit more complicated, but spurious, patterns in the training set.

The field of statistical learning provides numerous bounds on the magnitude of overfitting based on more general notions of the complexity of an arbitrary function class , with the most influential being the Vapnik-Chervonenkis dimension, or VC-dimension444The VC-dimension of is the size of the largest set it shatters. A set is shattered by if for any choice of labels , there is some with for all ..

The next proposition provides information-usage bounds the degree of over-fitting, and then shows that mutual information is upper-bounded by the VC–dimension of . Therefore, information-usage is always constrained by function-class complexity.

Proposition 4.1.

Let , , and . Then,

If has VC-dimension , then

The proof of the information usage bound follows by an easy reduction to Proposition 3.1. The proof of the second claim relies on a known link between VC-dimension and a notion of the log-covering numbers of the function-class.

It is worth highlighting that because VC-dimension depends only on the class of functions , bounds based on this measure can’t shed light on which types of data-generating distributions and fitting procedures allow for effective generalization. Information usage depends on both, and a result could be much smaller than VC-dimension; for example, this occurs when some classifiers in are much more likely to be selected after training than others. This can occur naturally due to properties of the training procedure, like regularization, or properties of the data-generating distribution.

5 Limiting information usage and bias via randomization

We have seen how information usage provides a unified framework to investigate the magnitude of exploration bias across different analysis procedures and datasets. It also suggests that methods that reduces the mutual information between and can reduce bias. In this section, we explore simple procedures that leverages randomization to reduce information usage and hence bias, while still preserving the utility of the data analysis.

We first revisit the rank-selection policy considered in the previous subsection, and derive a variant of this scheme that uses randomization to limit information-usage. We then consider a model of a human data analyst who interacts sequentially with the data. We use a stylized model to show that, even if the analysts procedure is unknown or difficult to describe, adding noise during the data-exploration process can provably limit the bias incurred. Many authors have investigated adding noise as a technique to reduce selection bias in specialized settings [12, 10]. The main goal of this section is to illustrate how the effects of adding noise is transparent through the lens of information usage.

5.1 Regularization via randomized selection

Subection 4.3 illustrates how signal in the data intrinsically reduces the bias of rank selection by reducing the entropy term in . A complementary approach to reduce bias is to increase conditional entropy by adding randomization to the selection policy . It is easy to maximize conditional entropy by choosing uniformly at random from , independently of . Imagine however that we want to not only ensure that conditional entropy is large, but want to choose such that the selected value is large. After observing , it is natural then to set the probability of setting by solving a maximization problem

subject to

The solution to this problem is the maximum entropy or “Gibbs” distribution, which sets

(1)

for that is chosen so that . This procedure effectively adds stability, or a kind of regularization, to the selection strategy by adding randomization. Whereas tiny perturbations to may change the identity of , the distribution is relatively insensitive to small changes in . Note that the strategy (1) is one of the most widely studied algorithms in the field of online learning [9], where it is often called exponential weights. It is also known as the exponential mechanism in differential privacy. In our framework it is transparent how it reduces bias.

To illustrate the effect of randomized selection, we use simulations to explore the tradeoff between bias and accuracy. We consider the following simple, max-entropy randomization scheme:

  • Take as input parameters and , and observations . Here is the inverse temperature in the Gibbs distribution and is number of ’s we need to select.

  • Sample without replacement indices from given in (1). Report the corresponding values .

We consider settings where we have two groups of ’s: after relabeling assume that and for . We define the bias of the selection to be and the accuracy of the selection to be , which is the fraction of reported with true signal . In Figure 3, we illustrate the tradeoff between accuracy and bias for (i.e. there are many more false signals than true signals), randomization strength , and the signal strength varying from 1 to 5. Consistent with the theoretical analysis, max-entropy selection significantly decreased bias. In the low signal regime (), both rank selection and max-entropy selection have low accuracy because the signal is overwhelmed by the large number of false positives. In the high signal regime (), both selection methods have accuracy close to one and max-entropy selection has significantly less bias. In the intermediate regime (), max-entropy selection has substantially less bias but is less accurate than rank selection.

Figure 3: Tradeoff between accuracy and bias as the signal strength increases. The two curves illustrate the tradeoff for the maximum selection (i.e. reporting the largest values of ) and the max-entropy randomized selection procedures.

5.2 Randomization for a multi-step analyst

We next study how randomization can decrease information usage and bias even when we have very little knowledge of what the analyst is doing. To illustrate this idea, we analyze in detail a simple example of a very flexible data analyst who performs multiple steps of analysis. Flexibility in multi-step data analysis presents a challenge to current statistical approaches for quantifying selection bias. Recent development in post-selection inference have focused on settings where the selection rule is simple and analytically tractable, and the full analysis procedure is fixed and specified before any data analysis is performed. While powerful results can be derived in this framework—including exact bias corrections and valid post-selection confidence intervals [16, 29]—these methods do not apply for exploratory analysis where the procedure can be quite flexible.

In this section, we show how our mutual information framework can be used to analyze bias for a flexible multi-step analyst. We show that even if one does not know, or can’t fully describe, the selection procedure , one can control its bias by controlling the information it uses. The main idea is to inject a small amount of randomization at each step of the analysis. This randomization is guaranteed to keep the bad information usage low no matter what the analyst does.

The idea of adding randomization during data analysis to reduce overfitting has been implemented as practical rule-of-thumb in several communities. Particle physicists, for example, have advocated blind data analysis: when deciding which results to report, the analyst interacts with a dataset that has been obfuscated through various means, such as adding noise to observations, removing some data points, or switching data-labels. The raw, uncorrupted, dataset is only used in computing the final reported values [24]. Adding noise is also closely related to a recent line of work inspired by differential privacy [4, 13, 14, 19].

A model of flexible, multi-step analyst.

We consider a model of adaptive data analysis similar to that of [14, 13]. In this setting, the analyst learns about the data by running a series of analyses on the dataset. Each analysis is modeled by a function of the data , and choice of which analysis to run may depend on the results from all the earlier analyses. More formally, we define the model as follows:

  1. At step 1, the analyst selects a statistic to query for and observes a result .

  2. In the -th iteration, the analyst chooses a statistic as a function of the results that she has received so far, , and receives result .

  3. After iterations, the analyst selects as a function of

The simplest setting is when the result of the analysis is just the value of on the data : . An example of this is the rank selection considered before. At the -th step, is queried (i.e. the order is fixed and does not depend on the previous results) and is returned. The analyst queries all ’s and returns the one with maximal value.

In general, we allow the analysis output to differ from the empirical value of the test and a particularly useful form is . This captures blind analysis settings, where the analyst intentionally adds noise throughout the data analysis in order to reduce over-fitting. A natural goal is to ensure that for every query used in the adaptive analysis, the reported result is close to true value . We will show through analyzing the information usage that noise addition can indeed guarantee such accuracy.

This adaptive analysis protocol can be viewed as a Markov chain

By the information processing inequality [11], . Therefore, a procedure that controls the mutual information between the history of feedback and the statistics will automatically control the mutual information . By exploiting the structure of the adaptive analysis model, we can decompose the cumulative mutual information into a sum of terms. This is formalized in the following composition lemma for mutual information.

Lemma 1.

Let denote the history of interaction up to time . Then, under the adaptive analysis model

The important takeaway from this lemma is that by bounding the conditional mutual information between the response and the queried value at each step, , we can bound and hence bound the bias after rounds of adaptive queries. Given a dataset , we can imagine the analyst having a (mutual) information budget, , which is decided a priori based on the size of the data and her tolerance for bias. At each step of the adaptive data analysis, the analyst’s choice of statistic to query next (as a function of her analysis history) incurs an information cost quantified by . The information costs accumulate additively over the analysis steps, until it reaches , at which point the guarantee on bias requires the analysis to stop.

A trivial way to reduce mutual information is to return a response that is independent of the query , in which case the analyst learns nothing about the data and incurs no bias. However in order for the data to be useful for the analyst, we would like the results of the queries to also be accurate.

Adding randomization to reduce bias.

As before let denote the true answer of query . If each is –sub-Gaussian, then . Using Proposition 3.2, we can bound the average excess error of the response , by the sum of two terms,

Response accuracy degrades with distortion, a measure of the magnitude of the noise added to responses, but this distortion also controls the degree of selection bias in future rounds. We will explicitly analyze the tradeoff between these terms in a stylized case of the general model.

Gaussian noise protocol.

We analyze the following special case.

  1. Suppose and is jointly Gaussian for any .

  2. For the th query , , the protocol returns a distorted response where . Note that unlike , the sequence is independent.

The term can be thought of as the number of samples in the data-set. Indeed, if is the empirical average of samples from a distribution, then . The ratio is the signal-to-noise ratio of the th response. We want to choose the distortion levels so as to guarantee that a large number of queries can be answered accurately. In order to do this, we will use the next lemma to relate the distortion levels to the information provided by a response. The lemma gives a form for the mutual information where and are independent Gaussian random variables. As one would expect, this shows that mutual information is very small when the variance of is much larger than the variance of . Lemma 3, provided in the Supplementary Information, provides a similar result when is a general (not necessarily Gaussian) random variable.

Lemma 2.

If and where is independent of , then

where is the signal to noise ratio.

Using Lemma 2, we provide an explicit bound on the accuracy of as a function a function of and . Note that this result places no restriction on the procedure that generates () except that the choice can depend on only through the data available at time .

Proposition 5.1.

Suppose and is jointly Gaussian for any . If for the th query, where and is independent of , then for every

where denote a universal constant that is independent of and .

If the sequence of choices () were non-adaptive, simply returning responses without any noise ) would guarantee . In the adaptive model, the first few queries are still answered with accuracy of order , but the error increases for the later queries. This illustrates the fundamental tension that the longer the analyst explores the data, the more likely for the later analysis to overfit.

The factor can roughly be viewed as the worst-case price of adaptivity. It is worth emphasizing this price would be more severe if the system returned responses without any noise. When no noise is added error can be as large as , as is demonstrated in Example 1 in the Supplementary Information. Therefore, adding noise offers a fundamental improvement in attainable performance.

A similar insight was attained by [12], who noted that by adding Laplacian noise it is possible to answer up to queries accurately, whereas without noise accuracy degrades after queries. In the Gaussian case, it’s clear from our bound that as , all queries will be answered accurately as long as .

6 Discussion

We have introduced a general information usage approach to quantify bias that arises from data exploration. While we focus on bias, we show our mutual information based metric can be used to bound other error metrics of interest, such as the average absolute error . It is interesting to note that the same information usage also naturally appears in the lower bound on error, suggesting it may be fundamentally linked to exploration bias. This paper established lower bounds when the selection process corresponds to solving optimization problems—i.e. . An interesting direction of research is to understand more general exploration procedures in which information usage provides a tight approximation to bias.

One advantage of using mutual information to bound bias is that we have many tools to analyze and compute mutual information. This conceptual framework allow us to extract insight into settings when common data analysis procedures lead to severe bias and when they do not. In particular we show how signal in the data can reduce selection bias. Information usage also suggests engineering approaches to reduce mutual information (and hence bias) by adding randomization to each step of the data exploration. Another important project is to investigate implementations of such randomization approaches in practical analytic settings.

As discussed before, the information usage framework proposed here is very much complementary to the exciting developments in post-selection inference and differential privacy. Post-selection inference, for very specific settings, is able to exactly characterize and correct for exploration biases—in this case exploration is feature and model selection. Differential privacy lies at the other extreme in that it derives powerful but potentially conservative results that apply to an adversarial data-analyst. The modern practice of data science often lies in between these two extremes—the analyst has more flexibility than assumed in post-selection inference, but is also interested in finding true signals and hence is much less adversarial than the worst-case. Information usage provides a bound on exploration bias in all settings. It is also important that this bound is data-dependent. In practice, the same analyst may be much less prone to false discoveries when exploring a high-signal dataset versus a low-signal dataset, and this should be reflected in the bias metric. An interesting goal is to develop approaches that combine the sharpness of post-selection inference and differential privacy with the generality of information usage.

References

  • Belloni et al. [2014] Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference on treatment effects after selection among high-dimensional controls. Review of economic studies, 81(287):608–650, 2014.
  • Benjamini and Hochberg [1995] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), pages 289–300, 1995.
  • Benjamini and Yekutieli [2001] Yoav Benjamini and Daniel Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of statistics, pages 1165–1188, 2001.
  • Blum and Hardt [2015] Avrim Blum and Moritz Hardt.

    The ladder: A reliable leaderboard for machine learning competitions.

    In ICML 2015, 2015.
  • Bourgon et al. [2010] Richard Bourgon, Robert Gentleman, and Wolfgang Huber. Independent filtering increases detection power for high-throughput experiments. Proceedings of the National Academy of Sciences, 107(21):9546–9551, 2010.
  • Bousquet and Elisseeff [2002] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2(Mar):499–526, 2002.
  • Bousquet et al. [2004] Olivier Bousquet, Stéphane Boucheron, and Gábor Lugosi.

    Introduction to statistical learning theory.

    In Advanced lectures on machine learning, pages 169–207. Springer, 2004.
  • Buldygin and Moskvichova [2013] V Buldygin and K Moskvichova. The sub-gaussian norm of a binary random variable. Theory of Probability and Mathematical Statistics, 86:33–49, 2013.
  • Cesa-Bianchi and Lugosi [2006] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
  • Chaudhuri et al. [2011] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(Mar):1069–1109, 2011.
  • Cover and Thomas [2012] T.M. Cover and J.A. Thomas. Elements of information theory. John Wiley & Sons, 2012.
  • Dwork et al. [2014] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Preserving statistical validity in adaptive data analysis. In STOC 2015. ACM, 2014.
  • Dwork et al. [2015a] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toni Pitassi, Omer Reingold, and Aaron Roth. Generalization in adaptive data analysis and holdout reuse. In Advances in Neural Information Processing Systems, pages 2350–2358, 2015a.
  • Dwork et al. [2015b] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis. Science, 349(6248):636–638, 2015b.
  • Efron et al. [2004] Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angle regression. The Annals of statistics, 32(2):407–499, 2004.
  • Fithian et al. [2014] William Fithian, Dennis Sun, and Jonathan Taylor. Optimal inference after model selection. arXiv preprint arXiv:1410.2597, 2014.
  • Gelman et al. [2014] Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian data analysis, volume 2. Taylor & Francis, 2014.
  • Gray [2011] R.M. Gray. Entropy and information theory. Springer, 2011.
  • Hardt and Ullman [2014] Moritz Hardt and Jonathan Ullman. Preventing false discovery in interactive data analysis is hard. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 454–463. IEEE, 2014.
  • Javanmard and Montanari [2014] Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research, 15(1):2869–2909, 2014.
  • Lee et al. [2011] I Lee, G Lushington, and M Visvanathan. A filter-based feature selection approach for identifying potential biomarkers for lung cancer. Journal of Clinical Bioinformatics, 2011.
  • Lee et al. [2016] Jason D Lee, Dennis L Sun, Yuekai Sun, Jonathan E Taylor, et al. Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907–927, 2016.
  • Lockhart et al. [2014] Richard Lockhart, Jonathan Taylor, Ryan J Tibshirani, and Robert Tibshirani. A significance test for the lasso. Annals of statistics, 42(2):413, 2014.
  • MacCoun and Perlmutter [2015] Robert MacCoun and Saul Perlmutter. Blind analysis: Hide results to seek the truth. Nature, 526(7572):187–189, 2015.
  • McAllester [2013] David McAllester. A pac-bayesian tutorial with a dropout bound. arXiv preprint arXiv:1307.2118, 2013.
  • Poggio et al. [2004] Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee, and Partha Niyogi. General conditions for predictivity in learning theory. Nature, 428(6981):419–422, 2004.
  • Shalev-Shwartz et al. [2010] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11(Oct):2635–2670, 2010.
  • Simmons et al. [2011] Joseph P Simmons, Leif D Nelson, and Uri Simonsohn. False-positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, page 0956797611417632, 2011.
  • Taylor and Tibshirani [2015] Jonathan Taylor and Robert J. Tibshirani. Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25):7629–7634, 2015.
  • Taylor et al. [2014] Jonathan Taylor, Richard Lockhart, Ryan J Tibshirani, and Robert Tibshirani. Exact post-selection inference for forward stepwise and least angle regression. arXiv preprint arXiv:1401.3889, 2014.
  • Van de Geer et al. [2014] Sara Van de Geer, Peter Buhlmann, Yaacov Ritov, Ruben Dezeure, et al. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202, 2014.
  • Wainwright [2015] Martin Wainwright. Basic tail and concentration bounds. 2015.
  • Wu et al. [2016] Siqi Wu, Antony Joseph, Ann S Hammonds, Susan E Celniker, Bin Yu, and Erwin Frise. Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks. Proceedings of the National Academy of Sciences, page 201521171, 2016.
  • Zou et al. [2014] James Zou, Christoph Lippert, David Heckerman, Martin Aryee, and Jennifer Listgarten. Epigenome-wide association studies without the need for cell-type composition. Nature Methods, pages 309–11, 2014.

Appendix A Proofs of Information Usage Upper Bounds

a.1 Information Usage Upper Bounds Bias: Proof of Proposition 3.1

The proof of Proposition 3.1

relies on the following variational form of Kullback–Leibler divergence, which is given in Theorem 5.2.1 of Robert Gray’s textbook

Entropy and Information Theory [18].

Fact 1.

Fix two probability measures and defined on a common measureable space Suppose that is absolutely continuous with respect to . Then

where the supremum is taken over all random variables such that the expectation of under is well defined, and is integrable under .

Proof of Proposition 3.1.

Applying Fact 1 with ,   , and , we have

where . Taking the derivative with respect to , we find that the optimizer is . This gives

By the tower property of conditional expectation and Jensen’s inequality

Remark. In the first step of the proof of Proposition 3.1, we used the fact that, for all ,

which follows from the information processing inequality. The application of this inequality is not tight in general and can lead to gaps between the actual bias and our upper bound based on . Consider the following scenario. Suppose , i.e. is a deterministic function that uses the realized value of to decide which other to select. For example, imagine and is defined so that if , if , and so on. Here is deterministic, , and this is manifested in . However, if is independent of each other , then and the bias is also 0. The upper bound of Proposition 3.1 is tight in other settings; it is also useful in general because the mutual information is amenable to analysis and explicit calculation. In cases where there is a gap, we may study directly.

a.2 Extension to Unequal Variances

We can prove a generalization of Proposition 3.1 for settings when the estimates have unequal variances.

Proposition A.1.

Suppose that for each , is –sub-Gaussian. Then,