How bad is worst-case data if you know where it comes from?

11/09/2019
by   Justin Y. Chen, et al.
0

We introduce a framework for studying how distributional assumptions on the process by which data is partitioned into a training and test set can be leveraged to provide accurate estimation or learning algorithms, even for worst-case datasets. We consider a setting of n datapoints, x_1,...,x_n, together with a specified distribution, P, over partitions of these datapoints into a training set, test set, and irrelevant set. An algorithm takes as input a description of P (or sample access), the indices of the test and training sets, and the datapoints in the training set, and returns a model or estimate that will be evaluated on the datapoints in the test set. We evaluate an algorithm in terms of its worst-case expected performance: the expected performance over potential test/training sets, for worst-case datapoints, x_1,...,x_n. This framework is a departure from more typical distributional assumptions on the datapoints (e.g. that data is drawn independently, or according to an exchangeable process), and can model a number of natural data collection processes, including processes with dependencies such as "snowball sampling" and "chain sampling", and settings where test and training sets satisfy chronological constraints (e.g. the test instances were observed after the training instances). Within this framework, we consider the setting where datapoints are bounded real numbers, and the goal is to estimate the mean of the test set. We give an efficient algorithm that returns a weighted combination of the training set—whose weights depend on the distribution, P, and on the training and test set indices—and show that the worst-case expected error achieved by this algorithm is at most a multiplicative π/2 factor worse than the optimal of such algorithms. The algorithm, and its proof, leverage a surprising connection to the Grothendieck problem.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/27/2021

Faster Algorithms and Constant Lower Bounds for the Worst-Case Expected Error

The study of statistical estimation without distributional assumptions o...
research
03/23/2023

Robust Generalization against Photon-Limited Corruptions via Worst-Case Sharpness Minimization

Robust generalization aims to tackle the most challenging data distribut...
research
09/07/2018

Simple coarse graining and sampling strategies for image recognition

A conceptually simple way to recognize images is to directly compare tes...
research
12/03/2019

Less Is Better: Unweighted Data Subsampling via Influence Function

In the time of Big Data, training complex models on large-scale data set...
research
11/17/2022

Data-Centric Debugging: mitigating model failures via targeted data collection

Deep neural networks can be unreliable in the real world when the traini...
research
12/09/2019

Machine Unlearning

Once users have shared their data online, it is generally difficult for ...
research
06/22/2020

Good linear classifiers are abundant in the interpolating regime

Within the machine learning community, the widely-used uniform convergen...

Please sign up or login with your details

Forgot password? Click here to reset