Systematic Bias in Sample Inference and its Effect on Machine Learning

07/03/2023
by   Owen O'Neill, et al.
0

A commonly observed pattern in machine learning models is an underprediction of the target feature, with the model's predicted target rate for members of a given category typically being lower than the actual target rate for members of that category in the training set. This underprediction is usually larger for members of minority groups; while income level is underpredicted for both men and women in the 'adult' dataset, for example, the degree of underprediction is significantly higher for women (a minority in that dataset). We propose that this pattern of underprediction for minorities arises as a predictable consequence of statistical inference on small samples. When presented with a new individual for classification, an ML model performs inference not on the entire training set, but on a subset that is in some way similar to the new individual, with sizes of these subsets typically following a power law distribution so that most are small (and with these subsets being necessarily smaller for the minority group). We show that such inference on small samples is subject to systematic and directional statistical bias, and that this bias produces the observed patterns of underprediction seen in ML models. Analysing a standard sklearn decision tree model's predictions on a set of over 70 subsets of the 'adult' and COMPAS datasets, we found that a bias prediction measure based on small-sample inference had a significant positive correlations (0.56 and 0.85) with the observed underprediction rate for these subsets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/11/2022

Sampling Bias Correction for Supervised Machine Learning: A Bayesian Inference Approach with Practical Applications

Given a supervised machine learning problem where the training set has b...
research
12/02/2019

Proving Data-Poisoning Robustness in Decision Trees

Machine learning models are brittle, and small changes in the training d...
research
07/30/2020

Label-Leaks: Membership Inference Attack with Label

Machine learning (ML) has made tremendous progress during the past decad...
research
03/30/2021

Scalable Statistical Inference of Photometric Redshift via Data Subsampling

Handling big data has largely been a major bottleneck in traditional sta...
research
10/16/2020

Universal guarantees for decision tree induction via a higher-order splitting criterion

We propose a simple extension of top-down decision tree learning heurist...
research
10/13/2021

Sub-Setting Algorithm for Training Data Selection in Pattern Recognition

Modern pattern recognition tasks use complex algorithms that take advant...
research
03/24/2022

Addressing Missing Sources with Adversarial Support-Matching

When trained on diverse labeled data, machine learning models have prove...

Please sign up or login with your details

Forgot password? Click here to reset