Do we train on test data? Purging CIFAR of near-duplicates

02/01/2019
by   Björn Barz, et al.
0

We find that 3.3 sets, respectively, have duplicates in the training set. This may incur a bias on the comparison of image recognition techniques with respect to their generalization capability on these heavily benchmarked datasets. To eliminate this bias, we provide the "fair CIFAR" (ciFAIR) dataset, where we replaced all duplicates in the test sets with new images sampled from the same domain. The training set remains unchanged, in order not to invalidate pre-trained models. We then re-evaluate the classification performance of various popular state-of-the-art CNN architectures on these new test sets to investigate whether recent research has overfitted to memorizing data instead of learning abstract concepts. Fortunately, this does not seem to be the case yet. The ciFAIR dataset and pre-trained models are available at https://cvjena.github.io/cifair/, where we also maintain a leaderboard.

READ FULL TEXT

page 1

page 2

page 4

research
12/05/2016

ImageNet pre-trained models with batch normalization

Convolutional neural networks (CNN) pre-trained on ImageNet are the back...
research
04/10/2023

Do We Train on Test Data? The Impact of Near-Duplicates on License Plate Recognition

This work draws attention to the large fraction of near-duplicates in th...
research
06/22/2022

Independent evaluation of state-of-the-art deep networks for mammography

Deep neural models have shown remarkable performance in image recognitio...
research
03/17/2021

Understanding Generalization in Adversarial Training via the Bias-Variance Decomposition

Adversarially trained models exhibit a large generalization gap: they ca...
research
05/18/2023

Statistical Foundations of Prior-Data Fitted Networks

Prior-data fitted networks (PFNs) were recently proposed as a new paradi...
research
07/25/2023

A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check

With the development of pre-trained models and the incorporation of phon...
research
02/24/2020

Using wavelets to analyze similarities in image datasets

Deep learning image classifiers usually rely on huge training sets and t...

Please sign up or login with your details

Forgot password? Click here to reset