Model Similarity Mitigates Test Set Overuse

05/29/2019
by   Horia Mania, et al.
8

Excessive reuse of test data has become commonplace in today's machine learning workflows. Popular benchmarks, competitions, industrial scale tuning, among other applications, all involve test data reuse beyond guidance by statistical confidence bounds. Nonetheless, recent replication studies give evidence that popular benchmarks continue to support progress despite years of extensive reuse. We proffer a new explanation for the apparent longevity of test data: Many proposed models are similar in their predictions and we prove that this similarity mitigates overfitting. Specifically, we show empirically that models proposed for the ImageNet ILSVRC benchmark agree in their predictions well beyond what we can conclude from their accuracy levels alone. Likewise, models created by large scale hyperparameter search enjoy high levels of similarity. Motivated by these empirical observations, we give a non-asymptotic generalization bound that takes similarity into account, leading to meaningful confidence bounds in practical settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/06/2019

Detecting Overfitting via Adversarial Examples

The repeated reuse of test sets in popular benchmark problems raises dou...
research
05/24/2019

The advantages of multiple classes for reducing overfitting from test set reuse

Excessive reuse of holdout data can lead to overfitting. However, there ...
research
02/25/2021

Rip van Winkle's Razor: A Simple Estimate of Overfit to Test Data

Traditional statistics forbids use of test data (a.k.a. holdout data) du...
research
07/29/2022

The Effects of Data Quality on Machine Learning Performance

Modern artificial intelligence (AI) applications require large quantitie...
research
05/17/2023

Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks

Data contamination has become especially prevalent and challenging with ...
research
08/08/2020

FrUITeR: A Framework for Evaluating UI Test Reuse

UI testing is tedious and time-consuming due to the manual effort requir...
research
01/02/2023

Test Reuse Based on Adaptive Semantic Matching across Android Mobile Applications

Automatic test generation can help verify and develop the behavior of mo...

Please sign up or login with your details

Forgot password? Click here to reset