Log In Sign Up

Generalization in the Face of Adaptivity: A Bayesian Perspective

by   Moshe Shenfeld, et al.

Repeated use of a data sample via adaptively chosen queries can rapidly lead to overfitting, wherein the issued queries yield answers on the sample that differ wildly from the values of those queries on the underlying data distribution. Differential privacy provides a tool to ensure generalization despite adaptively-chosen queries, but its worst-case nature means that it cannot, for example, yield improved results for low-variance queries. In this paper, we give a simple new characterization that illuminates the core problem of adaptive data analysis. We show explicitly that the harms of adaptivity come from the covariance between the behavior of future queries and a Bayes factor-based measure of how much information about the data sample was encoded in the responses given to past queries. We leverage this intuition to introduce a new stability notion; we then use it to prove new generalization results for the most basic noise-addition mechanisms (Laplace and Gaussian noise addition), with guarantees that scale with the variance of the queries rather than the square of their range. Our characterization opens the door to new insights and new algorithms for the fundamental problem of achieving generalization in adaptive data analysis.


page 1

page 2

page 3

page 4


Calibrating Noise to Variance in Adaptive Data Analysis

Datasets are often used multiple times and each successive analysis may ...

Generalization for Adaptively-chosen Estimators via Stable Median

Datasets are often reused to perform multiple statistical analyses in an...

Linear Queries Estimation with Local Differential Privacy

We study the problem of estimating a set of d linear queries with respec...

A necessary and sufficient stability notion for adaptive generalization

We introduce a new notion of the stability of computations, which holds ...

Natural Analysts in Adaptive Data Analysis

Adaptive data analysis is frequently criticized for its pessimistic gene...

The advantages of multiple classes for reducing overfitting from test set reuse

Excessive reuse of holdout data can lead to overfitting. However, there ...

Observations on the Bias of Nonnegative Mechanisms for Differential Privacy

We study two methods for differentially private analysis of bounded data...