Similarity Search for Efficient Active Learning and Search of Rare Concepts

06/30/2020
by   Cody Coleman, et al.
7

Many active learning and search approaches are intractable for industrial settings with billions of unlabeled examples. Existing approaches, such as uncertainty sampling or information density, search globally for the optimal examples to label, scaling linearly or even quadratically with the unlabeled data. However, in practice, data is often heavily skewed; only a small fraction of collected data will be relevant for a given learning task. For example, when identifying rare classes, detecting malicious content, or debugging model performance, the ratio of positive to negative examples can be 1 to 1,000 or more. In this work, we exploit this skew in large training datasets to reduce the number of unlabeled examples considered in each selection round by only looking at the nearest neighbors to the labeled examples. Empirically, we observe that learned representations effectively cluster unseen concepts, making active learning very effective and substantially reducing the number of viable unlabeled examples. We evaluate several active learning and search techniques in this setting on three large-scale datasets: ImageNet, Goodreads spoiler detection, and OpenImages. For rare classes, active learning methods need as little as 0.31 full supervision. By limiting active learning methods to only consider the immediate neighbors of the labeled data as candidates for labeling, we need only process as little as 1 reductions in labeling costs as the traditional global approach. This process of expanding the candidate pool with the nearest neighbors of the labeled set can be done efficiently and reduces the computational complexity of selection by orders of magnitude.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/12/2019

ALiPy: Active Learning in Python

Supervised machine learning methods usually require a large set of label...
research
11/02/2020

Exemplar Guided Active Learning

We consider the problem of wisely using a limited budget to label a smal...
research
01/18/2022

Active Learning for Open-set Annotation

Existing active learning studies typically work in the closed-set settin...
research
05/23/2020

Active Learning for Skewed Data Sets

Consider a sequential active learning problem where, at each round, an a...
research
04/08/2021

Just Label What You Need: Fine-Grained Active Selection for Perception and Prediction through Partially Labeled Scenes

Self-driving vehicles must perceive and predict the future positions of ...
research
12/06/2018

Active Learning Methods based on Statistical Leverage Scores

In many real-world machine learning applications, unlabeled data are abu...
research
12/19/2016

Active and Continuous Exploration with Deep Neural Networks and Expected Model Output Changes

The demands on visual recognition systems do not end with the complexity...

Please sign up or login with your details

Forgot password? Click here to reset