Ensemble Synthetic EHR Generation for Increasing Subpopulation Model's Performance

05/25/2023
by   Oriel Perets, et al.
0

Electronic health records (EHR) often contain different rates of representation of certain subpopulations (SP). Factors like patient demographics, clinical condition prevalence, and medical center type contribute to this underrepresentation. Consequently, when training machine learning models on such datasets, the models struggle to generalize well and perform poorly on underrepresented SPs. To address this issue, we propose a novel ensemble framework that utilizes generative models. Specifically, we train a GAN-based synthetic data generator for each SP and incorporate synthetic samples into each SP training set. Ultimately, we train SP-specific prediction models. To properly evaluate this method, we design an evaluation pipeline with 2 real-world use case datasets, queried from the MIMIC database. Our approach shows increased model performance over underrepresented SPs. Our code and models are given as supplementary and will be made available on a public repository.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/22/2021

Generating Synthetic Mixed-type Longitudinal Electronic Health Records for Artificial Intelligent Applications

The recent availability of electronic health records (EHRs) have provide...
research
01/14/2022

Synthesising Electronic Health Records: Cystic Fibrosis Patient Group

Class imbalance can often degrade predictive performance of supervised l...
research
10/11/2022

Synthetic Model Combination: An Instance-wise Approach to Unsupervised Ensemble Learning

Consider making a prediction over new test data without any opportunity ...
research
09/06/2021

Generation of Synthetic Electronic Health Records Using a Federated GAN

Sensitive medical data is often subject to strict usage constraints. In ...
research
08/08/2023

From Fake to Real (FFR): A two-stage training pipeline for mitigating spurious correlations with synthetic data

Visual recognition models are prone to learning spurious correlations in...
research
08/02/2019

Feature Robustness in Non-stationary Health Records: Caveats to Deployable Model Performance in Common Clinical Machine Learning Tasks

When training clinical prediction models from electronic health records ...
research
03/22/2023

Synthetic Health-related Longitudinal Data with Mixed-type Variables Generated using Diffusion Models

This paper presents a novel approach to simulating electronic health rec...

Please sign up or login with your details

Forgot password? Click here to reset