Committee-Based Sample Selection for Probabilistic Classifiers

06/01/2011
by   S. Argamon-Engelson, et al.
0

In many real-world learning tasks, it is expensive to acquire a sufficient number of labeled examples for training. This paper investigates methods for reducing annotation cost by `sample selection'. In this approach, during training the learning program examines many unlabeled examples and selects for labeling only those that are most informative at each stage. This avoids redundantly labeling examples that contribute little new information. Our work follows on previous research on Query By Committee, extending the committee-based paradigm to the context of probabilistic classification. We describe a family of empirical methods for committee-based sample selection in probabilistic classification models, which evaluate the informativeness of an example by measuring the degree of disagreement between several model variants. These variants (the committee) are drawn randomly from a probability distribution conditioned by the training set labeled so far. The method was applied to the real-world natural language processing task of stochastic part-of-speech tagging. We find that all variants of the method achieve a significant reduction in annotation cost, although their computational efficiency differs. In particular, the simplest variant, a two member committee with no parameters to tune, gives excellent results. We also show that sample selection yields a significant reduction in the size of the model used by the tagger.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/10/2018

Beyond the Selected Completely At Random Assumption for Learning from Positive and Unlabeled Data

Most positive and unlabeled data is subject to selection biases. The lab...
research
01/18/2022

Active Learning for Open-set Annotation

Existing active learning studies typically work in the closed-set settin...
research
06/11/2019

ADASS: Adaptive Sample Selection for Training Acceleration

Stochastic gradient decent (SGD) and its variants, including some accele...
research
12/13/2018

Local Probabilistic Model for Bayesian Classification: a Generalized Local Classification Model

In Bayesian classification, it is important to establish a probabilistic...
research
10/27/2020

Active Learning for Noisy Data Streams Using Weak and Strong Labelers

Labeling data correctly is an expensive and challenging task in machine ...
research
02/26/2021

Active Selection of Classification Features

Some data analysis applications comprise datasets, where explanatory var...

Please sign up or login with your details

Forgot password? Click here to reset