Similarity Search for Efficient Active Learning and Search of Rare Concepts

by   Cody Coleman, et al.

Many active learning and search approaches are intractable for industrial settings with billions of unlabeled examples. Existing approaches, such as uncertainty sampling or information density, search globally for the optimal examples to label, scaling linearly or even quadratically with the unlabeled data. However, in practice, data is often heavily skewed; only a small fraction of collected data will be relevant for a given learning task. For example, when identifying rare classes, detecting malicious content, or debugging model performance, the ratio of positive to negative examples can be 1 to 1,000 or more. In this work, we exploit this skew in large training datasets to reduce the number of unlabeled examples considered in each selection round by only looking at the nearest neighbors to the labeled examples. Empirically, we observe that learned representations effectively cluster unseen concepts, making active learning very effective and substantially reducing the number of viable unlabeled examples. We evaluate several active learning and search techniques in this setting on three large-scale datasets: ImageNet, Goodreads spoiler detection, and OpenImages. For rare classes, active learning methods need as little as 0.31 full supervision. By limiting active learning methods to only consider the immediate neighbors of the labeled data as candidates for labeling, we need only process as little as 1 reductions in labeling costs as the traditional global approach. This process of expanding the candidate pool with the nearest neighbors of the labeled set can be done efficiently and reduces the computational complexity of selection by orders of magnitude.


page 1

page 2

page 3

page 4


ALiPy: Active Learning in Python

Supervised machine learning methods usually require a large set of label...

Just Label What You Need: Fine-Grained Active Selection for Perception and Prediction through Partially Labeled Scenes

Self-driving vehicles must perceive and predict the future positions of ...

Exemplar Guided Active Learning

We consider the problem of wisely using a limited budget to label a smal...

Active Learning for Skewed Data Sets

Consider a sequential active learning problem where, at each round, an a...

Active Data Discovery: Mining Unknown Data using Submodular Information Measures

Active Learning is a very common yet powerful framework for iteratively ...

Active Learning Methods based on Statistical Leverage Scores

In many real-world machine learning applications, unlabeled data are abu...

Improving traffic sign recognition by active search

We describe an iterative active-learning algorithm to recognise rare tra...