Stratified Sampling for Extreme Multi-Label Data

03/05/2021
by   Maximillian Merrillees, et al.
0

Extreme multi-label classification (XML) is becoming increasingly relevant in the era of big data. Yet, there is no method for effectively generating stratified partitions of XML datasets. Instead, researchers typically rely on provided test-train splits that, 1) aren't always representative of the entire dataset, and 2) are missing many of the labels. This can lead to poor generalization ability and unreliable performance estimates, as has been established in the binary and multi-class settings. As such, this paper presents a new and simple algorithm that can efficiently generate stratified partitions of XML datasets with millions of unique labels. We also examine the label distributions of prevailing benchmark splits, and investigate the issues that arise from using unrepresentative subsets of data for model development. The results highlight the difficulty of stratifying XML data, and demonstrate the importance of using stratified partitions for training and evaluation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/27/2018

A no-regret generalization of hierarchical softmax to extreme multi-label classification

Extreme multi-label classification (XMLC) is a problem of tagging an ins...
research
08/31/2016

A High Speed Multi-label Classifier based on Extreme Learning Machines

In this paper a high speed neural network classifier based on extreme le...
research
07/26/2022

On Missing Labels, Long-tails and Propensities in Extreme Multi-label Classification

The propensity model introduced by Jain et al. 2016 has become a standar...
research
11/23/2020

The Emerging Trends of Multi-Label Learning

Exabytes of data are generated daily by humans, leading to the growing n...
research
04/17/2019

Bonsai - Diverse and Shallow Trees for Extreme Multi-label Classification

Extreme multi-label classification refers to supervised multi-label lear...
research
08/01/2021

DECAF: Deep Extreme Classification with Label Features

Extreme multi-label classification (XML) involves tagging a data point w...
research
05/31/2022

A Reduction to Binary Approach for Debiasing Multiclass Datasets

We propose a novel reduction-to-binary (R2B) approach that enforces demo...

Please sign up or login with your details

Forgot password? Click here to reset