An Exploration of How Training Set Composition Bias in Machine Learning Affects Identifying Rare Objects

07/07/2022
by   Sean E. Lake, et al.
0

When training a machine learning classifier on data where one of the classes is intrinsically rare, the classifier will often assign too few sources to the rare class. To address this, it is common to up-weight the examples of the rare class to ensure it isn't ignored. It is also a frequent practice to train on restricted data where the balance of source types is closer to equal for the same reason. Here we show that these practices can bias the model toward over-assigning sources to the rare class. We also explore how to detect when training data bias has had a statistically significant impact on the trained model's predictions, and how to reduce the bias's impact. While the magnitude of the impact of the techniques developed here will vary with the details of the application, for most cases it should be modest. They are, however, universally applicable to every time a machine learning classification model is used, making them analogous to Bessel's correction to the sample variance.

READ FULL TEXT

page 10

page 14

research
11/28/2019

Detection and Mitigation of Rare Subclasses in Neural Network Classifiers

Regions of high-dimensional input spaces that are underrepresented in tr...
research
06/30/2023

Dataset balancing can hurt model performance

Machine learning from training data with a skewed distribution of exampl...
research
12/05/2012

Making Early Predictions of the Accuracy of Machine Learning Applications

The accuracy of machine learning systems is a widely studied research to...
research
01/27/2022

A Systematic Study of Bias Amplification

Recent research suggests that predictions made by machine-learning model...
research
03/20/2022

RareGAN: Generating Samples for Rare Classes

We study the problem of learning generative adversarial networks (GANs) ...
research
03/20/2013

Ensembling classification models based on phalanxes of variables with applications in drug discovery

Statistical detection of a rare class of objects in a two-class classifi...
research
02/17/2023

Find Beauty in the Rare: Contrastive Composition Feature Clustering for Nontrivial Cropping Box Regression

Automatic image cropping algorithms aim to recompose images like human-b...

Please sign up or login with your details

Forgot password? Click here to reset