How much does your data exploration overfit? Controlling bias via information usage

11/16/2015
by   Daniel Russo, et al.
0

Modern data is messy and high-dimensional, and it is often not clear a priori what are the right questions to ask. Instead, the analyst typically needs to use the data to search for interesting analyses to perform and hypotheses to test. This is an adaptive process, where the choice of analysis to be performed next depends on the results of the previous analyses on the same data. Ultimately, which results are reported can be heavily influenced by the data. It is widely recognized that this process, even if well-intentioned, can lead to biases and false discoveries, contributing to the crisis of reproducibility in science. But while renders standard statistical theory invalid, experience suggests that different types of exploratory analysis can lead to disparate levels of bias, and the degree of bias also depends on the particulars of the data set. In this paper, we propose a general information usage framework to quantify and provably bound the bias and other error metrics of an arbitrary exploratory analysis. We prove that our mutual information based bound is tight in natural settings, and then use it to give rigorous insights into when commonly used procedures do or do not lead to substantially biased estimation. Through the lens of information usage, we analyze the bias of specific exploration procedures such as filtering, rank selection and clustering. Our general framework also naturally motivates randomization techniques that provably reduces exploration bias while preserving the utility of the data analysis. We discuss the connections between our approach and related ideas from differential privacy and blinded data analysis, and supplement our results with illustrative simulations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/14/2018

The Generic Holdout: Preventing False-Discoveries in Adaptive Data Science

Adaptive data analysis has posed a challenge to science due to its abili...
research
06/02/2017

Information, Privacy and Stability in Adaptive Data Analysis

Traditional statistical theory assumes that the analysis to be performed...
research
01/30/2023

Selective inference for clustering with unknown variance

In many modern statistical problems, the limited available data must be ...
research
11/01/2019

Goals, Process, and Challenges of Exploratory Data Analysis: An Interview Study

How do analysis goals and context affect exploratory data analysis (EDA)...
research
08/27/2020

Every Query Counts: Analyzing the Privacy Loss of Exploratory Data Analyses

An exploratory data analysis is an essential step for every data analyst...
research
12/12/2022

Reinforced Approximate Exploratory Data Analysis

Exploratory data analytics (EDA) is a sequential decision making process...
research
04/04/2021

Filtering ASVs/OTUs via Mutual Information-Based Microbiome Network Analysis

Microbial communities are widely studied using high-throughput sequencin...

Please sign up or login with your details

Forgot password? Click here to reset