Sampling Bias in Deep Active Classification: An Empirical Study

09/20/2019
by   Ameya Prabhu, et al.
0

The exploding cost and time needed for data labeling and model training are bottlenecks for training DNN models on large datasets. Identifying smaller representative data samples with strategies like active learning can help mitigate such bottlenecks. Previous works on active learning in NLP identify the problem of sampling bias in the samples acquired by uncertainty-based querying and develop costly approaches to address it. Using a large empirical study, we demonstrate that active set selection using the posterior entropy of deep models like FastText.zip (FTZ) is robust to sampling biases and to various algorithmic choices (query size and strategies) unlike that suggested by traditional literature. We also show that FTZ based query strategy produces sample sets similar to those from more sophisticated approaches (e.g ensemble networks). Finally, we show the effectiveness of the selected samples by creating tiny high-quality datasets, and utilizing them for fast and cheap training of large models. Based on the above, we propose a simple baseline for deep active text classification that outperforms the state-of-the-art. We expect the presented work to be useful and informative for dataset compression and for problems involving active, semi-supervised or online learning scenarios. Code and models are available at: https://github.com/drimpossible/Sampling-Bias-Active-Learning

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/13/2021

Mitigating Sampling Bias and Improving Robustness in Active Learning

This paper presents simple and efficient methods to mitigate sampling bi...
research
07/25/2022

Exploiting Diversity of Unlabeled Data for Label-Efficient Semi-Supervised Active Learning

The availability of large labeled datasets is the key component for the ...
research
12/11/2019

Parting with Illusions about Deep Active Learning

Active learning aims to reduce the high labeling cost involved in traini...
research
09/08/2022

Data Feedback Loops: Model-driven Amplification of Dataset Biases

Datasets scraped from the internet have been critical to the successes o...
research
10/05/2022

Making Your First Choice: To Address Cold Start Problem in Vision Active Learning

Active learning promises to improve annotation efficiency by iteratively...
research
06/19/2023

Taming Small-sample Bias in Low-budget Active Learning

Active learning (AL) aims to minimize the annotation cost by only queryi...
research
07/14/2023

Exploiting Counter-Examples for Active Learning with Partial labels

This paper studies a new problem, active learning with partial labels (A...

Please sign up or login with your details

Forgot password? Click here to reset