Benchmarking AutoML Frameworks for Disease Prediction Using Medical Claims

07/22/2021
by   Roland Albert A. Romero, et al.
23

We ascertain and compare the performances of AutoML tools on large, highly imbalanced healthcare datasets. We generated a large dataset using historical administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated model performances on several metrics. The AutoML tools showed improvement from the baseline random forest model but did not differ significantly from each other. All models recorded low area under the precision-recall curve and failed to predict true positives while keeping the true negative rate high. Model performance was not directly related to prevalence. We provide a specific use-case to illustrate how to select a threshold that gives the best balance between true and false positive rates, as this is an important consideration in medical applications. Healthcare datasets present several challenges for AutoML tools, including large sample size, high imbalance, and limitations in the available features types. Improvements in scalability, combinations of imbalance-learning resampling and ensemble approaches, and curated feature selection are possible next steps to achieve better performance. Among the three explored, no AutoML tool consistently outperforms the rest in terms of predictive performance. The performances of the models in this study suggest that there may be room for improvement in handling medical claims data. Finally, selection of the optimal prediction threshold should be guided by the specific practical application.

READ FULL TEXT
research
07/03/2020

The Effect of Class Imbalance on Precision-Recall Curves

In this note I study how the precision of a classifier depends on the ra...
research
02/03/2022

Net benefit, calibration, threshold selection, and training objectives for algorithmic fairness in healthcare

A growing body of work uses the paradigm of algorithmic fairness to fram...
research
03/07/2017

Propensity score prediction for electronic healthcare databases using Super Learner and High-dimensional Propensity Score Methods

The optimal learner for prediction modeling varies depending on the unde...
research
04/06/2019

A Novel Big Data Analytics Framework to Predict the Risk of Opioid Use Disorder

Addiction and overdose related to prescription opioids have reached an e...
research
09/04/2022

Fraud Detection Using Optimized Machine Learning Tools Under Imbalance Classes

Fraud detection is a challenging task due to the changing nature of frau...
research
09/23/2020

Using Undersampling with Ensemble Learning to Identify Factors Contributing to Preterm Birth

In this paper, we propose Ensemble Learning models to identify factors c...
research
03/16/2021

Balancing Biases and Preserving Privacy on Balanced Faces in the Wild

There are demographic biases in the SOTA CNN used for FR. Our BFW datase...

Please sign up or login with your details

Forgot password? Click here to reset