Data Shapley Valuation for Efficient Batch Active Learning

04/16/2021
by   Amirata Ghorbani, et al.
8

Annotating the right set of data amongst all available data points is a key challenge in many machine learning applications. Batch active learning is a popular approach to address this, in which batches of unlabeled data points are selected for annotation, while an underlying learning algorithm gets subsequently updated. Increasingly larger batches are particularly appealing in settings where data can be annotated in parallel, and model training is computationally expensive. A key challenge here is scale - typical active learning methods rely on diversity techniques, which select a diverse set of data points to annotate, from an unlabeled pool. In this work, we introduce Active Data Shapley (ADS) – a filtering layer for batch active learning that significantly increases the efficiency of active learning by pre-selecting, using a linear time computation, the highest-value points from an unlabeled dataset. Using the notion of the Shapley value of data, our method estimates the value of unlabeled data points with regards to the prediction task at hand. We show that ADS is particularly effective when the pool of unlabeled data exhibits real-world caveats: noise, heterogeneity, and domain shift. We run experiments demonstrating that when ADS is used to pre-select the highest-ranking portion of an unlabeled dataset, the efficiency of state-of-the-art batch active learning methods increases by an average factor of 6x, while preserving performance effectiveness.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/19/2023

Novel Batch Active Learning Approach and Its Application to Synthetic Aperture Radar Datasets

Active learning improves the performance of machine learning methods by ...
research
08/06/2019

Bayesian Batch Active Learning as Sparse Subset Approximation

Leveraging the wealth of unlabeled data produced in recent years provide...
research
12/06/2018

Active Learning Methods based on Statistical Leverage Scores

In many real-world machine learning applications, unlabeled data are abu...
research
11/27/2020

Active Learning in CNNs via Expected Improvement Maximization

Deep learning models such as Convolutional Neural Networks (CNNs) have d...
research
04/06/2021

Low-Regret Active learning

We develop an online learning algorithm for identifying unlabeled data p...
research
05/23/2020

Active Learning for Skewed Data Sets

Consider a sequential active learning problem where, at each round, an a...
research
07/20/2022

Stream-based active learning with linear models

The proliferation of automated data collection schemes and the advances ...

Please sign up or login with your details

Forgot password? Click here to reset