-
ALiPy: Active Learning in Python
Supervised machine learning methods usually require a large set of label...
read it
-
Active Learning for Skewed Data Sets
Consider a sequential active learning problem where, at each round, an a...
read it
-
Exemplar Guided Active Learning
We consider the problem of wisely using a limited budget to label a smal...
read it
-
Active Learning with Importance Sampling
We consider an active learning setting where the algorithm has access to...
read it
-
Active Learning Methods based on Statistical Leverage Scores
In many real-world machine learning applications, unlabeled data are abu...
read it
-
Active Learning in the Overparameterized and Interpolating Regime
Overparameterized models that interpolate training data often display su...
read it
-
Cost-Sensitive Active Learning for Intracranial Hemorrhage Detection
Deep learning for clinical applications is subject to stringent performa...
read it
Similarity Search for Efficient Active Learning and Search of Rare Concepts
Many active learning and search approaches are intractable for industrial settings with billions of unlabeled examples. Existing approaches, such as uncertainty sampling or information density, search globally for the optimal examples to label, scaling linearly or even quadratically with the unlabeled data. However, in practice, data is often heavily skewed; only a small fraction of collected data will be relevant for a given learning task. For example, when identifying rare classes, detecting malicious content, or debugging model performance, the ratio of positive to negative examples can be 1 to 1,000 or more. In this work, we exploit this skew in large training datasets to reduce the number of unlabeled examples considered in each selection round by only looking at the nearest neighbors to the labeled examples. Empirically, we observe that learned representations effectively cluster unseen concepts, making active learning very effective and substantially reducing the number of viable unlabeled examples. We evaluate several active learning and search techniques in this setting on three large-scale datasets: ImageNet, Goodreads spoiler detection, and OpenImages. For rare classes, active learning methods need as little as 0.31 full supervision. By limiting active learning methods to only consider the immediate neighbors of the labeled data as candidates for labeling, we need only process as little as 1 reductions in labeling costs as the traditional global approach. This process of expanding the candidate pool with the nearest neighbors of the labeled set can be done efficiently and reduces the computational complexity of selection by orders of magnitude.
READ FULL TEXT
Comments
There are no comments yet.