Overly Optimistic Prediction Results on Imbalanced Data: Flaws and Benefits of Applying Over-sampling

01/15/2020
by   Gilles Vandewiele, et al.
0

Information extracted from electrohysterography recordings could potentially prove to be an interesting additional source of information to estimate the risk on preterm birth. Recently, a large number of studies have reported near-perfect results to distinguish between recordings of patients that will deliver term or preterm using a public resource, called the Term/Preterm Electrohysterogram database. However, we argue that these results are overly optimistic due to a methodological flaw being made. In this work, we focus on one specific type of methodological flaw: applying over-sampling before partitioning the data into mutually exclusive training and testing sets. We show how this causes the results to be biased using two artificial datasets and reproduce results of studies in which this flaw was identified. Moreover, we evaluate the actual impact of over-sampling on predictive performance, when applied prior to data partitioning, using the same methodologies of related studies, to provide a realistic view of these methodologies' generalization capabilities. We make our research reproducible by providing all the code under an open license.

READ FULL TEXT
research
04/06/2021

Survey of Imbalanced Data Methodologies

Imbalanced data set is a problem often found and well-studied in financi...
research
05/06/2022

Benchmarking Econometric and Machine Learning Methodologies in Nowcasting

Nowcasting can play a key role in giving policymakers timelier insight t...
research
11/22/2021

Benchmarking Predictive Risk Models for Emergency Departments with Large Public Electronic Health Records

There is a continuously growing demand for emergency department (ED) ser...
research
06/12/2018

Detection of Premature Ventricular Contractions Using Densely Connected Deep Convolutional Neural Network with Spatial Pyramid Pooling Layer

Prematureventricularcontraction(PVC)isatypeof prematureectopicbeatorigin...
research
08/22/2021

Evaluation Methodologies for Code Learning Tasks

There has been a growing interest in developing machine learning (ML) mo...
research
12/10/2021

Sentiment Analysis on Brazilian Portuguese User Reviews

Sentiment Analysis is one of the most classical and primarily studied na...
research
07/12/2022

From Spectral Graph Convolutions to Large Scale Graph Convolutional Networks

Graph Convolutional Networks (GCNs) have been shown to be a powerful con...

Please sign up or login with your details

Forgot password? Click here to reset