Active clustering for labeling training data

10/27/2021
by   Quentin Lutz, et al.
0

Gathering training data is a key step of any supervised learning task, and it is both critical and expensive. Critical, because the quantity and quality of the training data has a high impact on the performance of the learned function. Expensive, because most practical cases rely on humans-in-the-loop to label the data. The process of determining the correct labels is much more expensive than comparing two items to see whether they belong to the same class. Thus motivated, we propose a setting for training data gathering where the human experts perform the comparatively cheap task of answering pairwise queries, and the computer groups the items into classes (which can be labeled cheaply at the very end of the process). Given the items, we consider two random models for the classes: one where the set partition they form is drawn uniformly, the other one where each item chooses its class independently following a fixed distribution. In the first model, we characterize the algorithms that minimize the average number of queries required to cluster the items and analyze their complexity. In the second model, we analyze a specific algorithm family, propose as a conjecture that they reach the minimum average number of queries and compare their performance to a random approach. We also propose solutions to handle errors or inconsistencies in the experts' answers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/05/2021

How to Query An Oracle? Efficient Strategies to Label Data

We consider the basic problem of querying an expert oracle for labeling ...
research
08/09/2018

Training De-Confusion: An Interactive, Network-Supported Visual Analysis System for Resolving Errors in Image Classification Training Data

Convolutional neural networks gain more and more popularity in image cla...
research
03/31/2019

Semisupervised Clustering by Queries and Locally Encodable Source Coding

Source coding is the canonical problem of data compression in informatio...
research
12/11/2019

Identifying Mislabeled Instances in Classification Datasets

A key requirement for supervised machine learning is labeled training da...
research
01/25/2022

DebtFree: Minimizing Labeling Cost in Self-Admitted Technical Debt Identification using Semi-Supervised Learning

Keeping track of and managing Self-Admitted Technical Debts (SATDs) is i...
research
11/01/2017

Active Tolerant Testing

In this work, we give the first algorithms for tolerant testing of nontr...
research
03/24/2020

A Pitfall of Learning from User-generated Data: In-depth Analysis of Subjective Class Problem

Research in the supervised learning algorithms field implicitly assumes ...

Please sign up or login with your details

Forgot password? Click here to reset