On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks

10/10/2020
by   Stephen Mussmann, et al.
0

Many pairwise classification tasks, such as paraphrase detection and open-domain question answering, naturally have extreme label imbalance (e.g., 99.99% of examples are negatives). In contrast, many recent datasets heuristically choose examples to ensure label balance. We show that these heuristics lead to trained models that generalize poorly: State-of-the art models trained on QQP and WikiQA each have only 2.4% average precision when evaluated on realistically imbalanced test data. We instead collect training data with active learning, using a BERT-based embedding model to efficiently retrieve uncertain points from a very large pool of unlabeled utterance pairs. By creating balanced training data with more informative negative examples, active learning greatly improves average precision to 32.5% on QQP and 20.1% on WikiQA.

READ FULL TEXT

page 1

page 2

page 3

page 4

07/04/2020

Deep Active Learning via Open Set Recognition

In many applications, data is easy to acquire but expensive and time con...
02/01/2022

Minority Class Oriented Active Learning for Imbalanced Datasets

Active learning aims to optimize the dataset annotation process when res...
03/10/2022

BASIL: Balanced Active Semi-supervised Learning for Class Imbalanced Datasets

Current semi-supervised learning (SSL) methods assume a balance between ...
01/25/2022

Cold Start Active Learning Strategies in the Context of Imbalanced Classification

We present novel active learning strategies dedicated to providing a sol...
10/07/2020

Exposing Shallow Heuristics of Relation Extraction Models with Challenge Data

The process of collecting and annotating training data may introduce dis...
03/21/2020

Prob2Vec: Mathematical Semantic Embedding for Problem Retrieval in Adaptive Tutoring

We propose a new application of embedding techniques for problem retriev...
12/27/2021

BALanCe: Deep Bayesian Active Learning via Equivalence Class Annealing

Active learning has demonstrated data efficiency in many fields. Existin...