From ImageNet to Image Classification: Contextualizing Progress on Benchmarks

05/22/2020
by   Dimitris Tsipras, et al.
24

Building rich machine learning datasets in a scalable manner often necessitates a crowd-sourced data collection pipeline. In this work, we use human studies to investigate the consequences of employing such a pipeline, focusing on the popular ImageNet dataset. We study how specific design choices in the ImageNet creation process impact the fidelity of the resulting dataset—including the introduction of biases that state-of-the-art models exploit. Our analysis pinpoints how a noisy data collection pipeline can lead to a systematic misalignment between the resulting benchmark and the real-world task it serves as a proxy for. Finally, our findings emphasize the need to augment our current model training and evaluation toolkit to take such misalignments into account. To facilitate further research, we release our refined ImageNet annotations at https://github.com/MadryLab/ImageNetMultiLabel.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 21

page 22

page 27

page 28

page 30

page 31

page 32

page 33

08/16/2021

Towards Efficient and Data Agnostic Image Classification Training Pipeline for Embedded Systems

Nowadays deep learning-based methods have achieved a remarkable progress...
03/16/2022

Towards Formalizing HRI Data Collection Processes

Within the human-robot interaction (HRI) community, many researchers hav...
07/02/2021

Vox Populi, Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription

Domain-specific data is the crux of the successful transfer of machine l...
05/11/2021

Diffusion Models Beat GANs on Image Synthesis

We show that diffusion models can achieve image sample quality superior ...
04/05/2022

Dynatask: A Framework for Creating Dynamic AI Benchmark Tasks

We introduce Dynatask: an open source system for setting up custom NLP t...
01/29/2019

Semantic Redundancies in Image-Classification Datasets: The 10 Don't Need

Large datasets have been crucial to the success of deep learning models ...
07/02/2021

Designing Machine Learning Pipeline Toolkit for AutoML Surrogate Modeling Optimization

The pipeline optimization problem in machine learning requires simultane...

Code Repositories

ImageNetMultiLabel

Fine-grained ImageNet annotations


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.