Good Data from Bad Models : Foundations of Threshold-based Auto-labeling

11/22/2022
by   Harit Vishwakarma, et al.
0

Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Auto-labeling systems are a promising way to reduce reliance on manual labeling for dataset construction. Threshold-based auto-labeling, where validation data obtained from humans is used to find a threshold for confidence above which the data is machine-labeled, is emerging as a popular solution used widely in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. In this work, we analyze threshold-based auto-labeling systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two insights. First, reasonable chunks of the unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of threshold-based auto-labeling systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with simulations and study the efficacy of threshold-based auto-labeling on real datasets.

READ FULL TEXT
research
01/21/2022

Pseudo-Labeled Auto-Curriculum Learning for Semi-Supervised Keypoint Localization

Localizing keypoints of an object is a basic visual problem. However, su...
research
02/27/2023

Revisiting Self-Training with Regularized Pseudo-Labeling for Tabular Data

Recent progress in semi- and self-supervised learning has caused a rift ...
research
09/23/2021

Learning to Robustly Aggregate Labeling Functions for Semi-supervised Data Programming

A critical bottleneck in supervised machine learning is the need for lar...
research
03/03/2020

FLAME: A Self-Adaptive Auto-labeling System for Heterogeneous Mobile Processors

How to accurately and efficiently label data on a mobile device is criti...
research
03/27/2022

OneLabeler: A Flexible System for Building Data Labeling Tools

Labeled datasets are essential for supervised machine learning. Various ...
research
06/06/2019

Gradual Machine Learning for Aspect-level Sentiment Analysis

The state-of-the-art solutions for Aspect-Level Sentiment Analysis (ALSA...
research
08/06/2020

Functional Regularization for Representation Learning: A Unified Theoretical Perspective

Unsupervised and self-supervised learning approaches have become a cruci...

Please sign up or login with your details

Forgot password? Click here to reset