Using Undersampling with Ensemble Learning to Identify Factors Contributing to Preterm Birth

09/23/2020
by   Shi Dong, et al.
0

In this paper, we propose Ensemble Learning models to identify factors contributing to preterm birth. Our work leverages a rich dataset collected by a NIEHS P42 Center that is trying to identify the dominant factors responsible for the high rate of premature births in northern Puerto Rico. We investigate analytical models addressing two major challenges present in the dataset: 1) the significant amount of incomplete data in the dataset, and 2) class imbalance in the dataset. First, we leverage and compare two types of missing data imputation methods: 1) mean-based and 2) similarity-based, increasing the completeness of this dataset. Second, we propose a feature selection and evaluation model based on using undersampling with Ensemble Learning to address class imbalance present in the dataset. We leverage and compare multiple Ensemble Feature selection methods, including Complete Linear Aggregation (CLA), Weighted Mean Aggregation (WMA), Feature Occurrence Frequency (OFA), and Classification Accuracy Based Aggregation (CAA). To further address missing data present in each feature, we propose two novel methods: 1) Missing Data Rate and Accuracy Based Aggregation (MAA), and 2) Entropy and Accuracy Based Aggregation (EAA). Both proposed models balance the degree of data variance introduced by the missing data handling during the feature selection process while maintaining model performance. Our results show a 42% improvement in sensitivity versus fallout over previous state-of-the-art methods.

READ FULL TEXT
research
04/18/2021

Multi-objective Feature Selection with Missing Data in Classification

Feature selection (FS) is an important research topic in machine learnin...
research
02/25/2020

Missing Data Imputation for Classification Problems

Imputation of missing data is a common application in various classifica...
research
09/18/2023

Towards Better Modeling with Missing Data: A Contrastive Learning-based Visual Analytics Perspective

Missing data can pose a challenge for machine learning (ML) modeling. To...
research
02/10/2020

Missing Data Imputation using Optimal Transport

Missing data is a crucial issue when applying machine learning algorithm...
research
12/15/2022

A new weighted ensemble model for phishing detection based on feature selection

A phishing attack is a sort of cyber assault in which the attacker sends...
research
11/06/2017

An Iterative Scheme for Leverage-based Approximate Aggregation

Currently data explosion poses great challenges to approximate aggregati...
research
07/22/2021

Benchmarking AutoML Frameworks for Disease Prediction Using Medical Claims

We ascertain and compare the performances of AutoML tools on large, high...

Please sign up or login with your details

Forgot password? Click here to reset