Sampling To Improve Predictions For Underrepresented Observations In Imbalanced Data

11/17/2021
by   Rune D. Kjærsgaard, et al.
0

Data imbalance is common in production data, where controlled production settings require data to fall within a narrow range of variation and data are collected with quality assessment in mind, rather than data analytic insights. This imbalance negatively impacts the predictive performance of models on underrepresented observations. We propose sampling to adjust for this imbalance with the goal of improving the performance of models trained on historical production data. We investigate the use of three sampling approaches to adjust for imbalance. The goal is to downsample the covariates in the training data and subsequently fit a regression model. We investigate how the predictive power of the model changes when using either the sampled or the original data for training. We apply our methods on a large biopharmaceutical manufacturing data set from an advanced simulation of penicillin production and find that fitting a model using the sampled data gives a small reduction in the overall predictive performance, but yields a systematically better performance on underrepresented observations. In addition, the results emphasize the need for alternative, fair, and balanced model evaluations.

READ FULL TEXT
research
04/06/2021

Survey of Imbalanced Data Methodologies

Imbalanced data set is a problem often found and well-studied in financi...
research
05/23/2021

A Study imbalance handling by various data sampling methods in binary classification

The purpose of this research report is to present the our learning curve...
research
10/17/2019

KDE sampling for imbalanced class distribution

Imbalanced response variable distribution is not an uncommon occurrence ...
research
07/24/2017

Big Data Regression Using Tree Based Segmentation

Scaling regression to large datasets is a common problem in many applica...
research
04/29/2022

How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?

A multilingual tokenizer is a fundamental component of multilingual neur...
research
08/21/2023

An engine to simulate insurance fraud network data

Traditionally, the detection of fraudulent insurance claims relies on bu...
research
03/26/2021

Predictive and explanatory models might miss informative features in educational data

We encounter variables with little variation often in educational data m...

Please sign up or login with your details

Forgot password? Click here to reset