On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks

by   Stephen Mussmann, et al.

Many pairwise classification tasks, such as paraphrase detection and open-domain question answering, naturally have extreme label imbalance (e.g., 99.99% of examples are negatives). In contrast, many recent datasets heuristically choose examples to ensure label balance. We show that these heuristics lead to trained models that generalize poorly: State-of-the art models trained on QQP and WikiQA each have only 2.4% average precision when evaluated on realistically imbalanced test data. We instead collect training data with active learning, using a BERT-based embedding model to efficiently retrieve uncertain points from a very large pool of unlabeled utterance pairs. By creating balanced training data with more informative negative examples, active learning greatly improves average precision to 32.5% on QQP and 20.1% on WikiQA.


page 1

page 2

page 3

page 4


Deep Active Learning via Open Set Recognition

In many applications, data is easy to acquire but expensive and time con...

Minority Class Oriented Active Learning for Imbalanced Datasets

Active learning aims to optimize the dataset annotation process when res...

BASIL: Balanced Active Semi-supervised Learning for Class Imbalanced Datasets

Current semi-supervised learning (SSL) methods assume a balance between ...

Cold Start Active Learning Strategies in the Context of Imbalanced Classification

We present novel active learning strategies dedicated to providing a sol...

Exposing Shallow Heuristics of Relation Extraction Models with Challenge Data

The process of collecting and annotating training data may introduce dis...

Prob2Vec: Mathematical Semantic Embedding for Problem Retrieval in Adaptive Tutoring

We propose a new application of embedding techniques for problem retriev...

BALanCe: Deep Bayesian Active Learning via Equivalence Class Annealing

Active learning has demonstrated data efficiency in many fields. Existin...