Throwing Away Data Improves Worst-Class Error in Imbalanced Classification

05/23/2022
by   Martin Arjovsky, et al.
0

Class imbalances pervade classification problems, yet their treatment differs in theory and practice. On the one hand, learning theory instructs us that more data is better, as sample size relates inversely to the average test error over the entire data distribution. On the other hand, practitioners have long developed a plethora of tricks to improve the performance of learning machines over imbalanced data. These include data reweighting and subsampling, synthetic construction of additional samples from minority classes, ensembling expensive one-versus all architectures, and tweaking classification losses and thresholds. All of these are efforts to minimize the worst-class error, which is often associated to the minority group in the training data, and finds additional motivation in the robustness, fairness, and out-of-distribution literatures. Here we take on the challenge of developing learning theory able to describe the worst-class error of classifiers over linearly-separable data when fitted either on (i) the full training set, or (ii) a subset where the majority class is subsampled to match in size the minority class. We borrow tools from extreme value theory to show that, under distributions with certain tail properties, throwing away most data from the majority class leads to better worst-class error.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/09/2020

Handling Imbalanced Data: A Case Study for Binary Class Problems

For several years till date, the major issues in terms of solving for cl...
research
12/29/2021

Two-phase training mitigates class imbalance for camera trap image classification with CNNs

By leveraging deep learning to automatically classify camera trap images...
research
05/09/2020

An Investigation of Why Overparameterization Exacerbates Spurious Correlations

We study why overparameterization – increasing model size well beyond th...
research
08/23/2022

Enhancement Encoding: A New Imbalanced Classification Approach via Encoding the Labels

Class imbalance, which is also called long-tailed distribution, is a com...
research
04/19/2022

Imbalanced Classification via a Tabular Translation GAN

When presented with a binary classification problem where the data exhib...
research
08/15/2023

ImbSAM: A Closer Look at Sharpness-Aware Minimization in Class-Imbalanced Recognition

Class imbalance is a common challenge in real-world recognition tasks, w...
research
03/13/2014

Box Drawings for Learning with Imbalanced Data

The vast majority of real world classification problems are imbalanced, ...

Please sign up or login with your details

Forgot password? Click here to reset