Dependency Leakage: Analysis and Scalable Estimators

07/18/2018
by   Matt Barnes, et al.
0

In this paper, we prove the first theoretical results on dependency leakage -- a phenomenon in which learning on noisy clusters biases cross-validation and model selection results. This is a major concern for domains involving human record databases (e.g. medical, census, advertising), which are almost always noisy due to the effects of record linkage and which require special attention to machine learning bias. The proposed theoretical properties justify regularization choices in several existing statistical estimators and allow us to construct the first hypothesis test for cross-validation bias due to dependency leakage. Furthermore, we propose a novel matrix sketching technique which, along with standard function approximation techniques, enables dramatically improving the sample and computational scalability of existing estimators. Empirical results on several benchmark datasets validate our theoretical results and proposed methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/28/2013

Estimating the Maximum Expected Value: An Analysis of (Nested) Cross Validation and the Maximum Sample Average

We investigate the accuracy of the two most common estimators for the ma...
research
03/15/2023

Distribution-free Deviation Bounds of Learning via Model Selection with Cross-validation Risk Estimation

Cross-validation techniques for risk estimation and model selection are ...
research
01/29/2020

Asymptotics of Cross-Validation

Cross validation is a central tool in evaluating the performance of mach...
research
10/20/2019

hv-Block Cross Validation is not a BIBD: a Note on the Paper by Jeff Racine (2000)

This note corrects a mistake in the paper "consistent cross-validatory m...
research
12/08/2020

Robustness of Accuracy Metric and its Inspirations in Learning with Noisy Labels

For multi-class classification under class-conditional label noise, we p...
research
06/24/2023

Cross-Validation Is All You Need: A Statistical Approach To Label Noise Estimation

Label noise is prevalent in machine learning datasets. It is crucial to ...

Please sign up or login with your details

Forgot password? Click here to reset