Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets

04/26/2021
by   Yuan-Hong Liao, et al.
9

Data is the engine of modern computer vision, which necessitates collecting large-scale datasets. This is expensive, and guaranteeing the quality of the labels is a major challenge. In this paper, we investigate efficient annotation strategies for collecting multi-class classification labels for a large collection of images. While methods that exploit learnt models for labeling exist, a surprisingly prevalent approach is to query humans for a fixed number of labels per datum and aggregate them, which is expensive. Building on prior work on online joint probabilistic modeling of human annotations and machine-generated beliefs, we propose modifications and best practices aimed at minimizing human labeling effort. Specifically, we make use of advances in self-supervised learning, view annotation as a semi-supervised learning problem, identify and mitigate pitfalls and ablate several key design choices to propose effective guidelines for labeling. Our analysis is done in a more realistic simulation that involves querying human labelers, which uncovers issues with evaluation using existing worker simulation methods. Simulated experiments on a 125k image subset of the ImageNet100 show that it can be annotated to 80 2.7x and 6.7x improvement over prior work and manual annotation, respectively. Project page: https://fidler-lab.github.io/efficient-annotation-cookbook

READ FULL TEXT

page 4

page 7

page 15

page 18

page 19

page 20

research
11/22/2021

Self-supervised Semi-supervised Learning for Data Labeling and Quality Evaluation

As the adoption of deep learning techniques in industrial applications g...
research
09/24/2020

Best Practices for Managing Data Annotation Projects

Annotation is the labeling of data by human effort. Annotation is critic...
research
09/30/2022

Semi-Supervised Single-View 3D Reconstruction via Prototype Shape Priors

The performance of existing single-view 3D reconstruction methods heavil...
research
05/22/2023

Label Smarter, Not Harder: CleverLabel for Faster Annotation of Ambiguous Image Classification with Higher Quality

High-quality data is crucial for the success of machine learning, but la...
research
05/17/2023

Cold PAWS: Unsupervised class discovery and the cold-start problem

In many machine learning applications, labeling datasets can be an arduo...
research
07/02/2018

Active Testing: An Efficient and Robust Framework for Estimating Accuracy

Much recent work on visual recognition aims to scale up learning to mass...
research
12/07/2021

The Origin and Value of Disagreement Among Data Labelers: A Case Study of the Individual Difference in Hate Speech Annotation

Human annotated data is the cornerstone of today's artificial intelligen...

Please sign up or login with your details

Forgot password? Click here to reset