Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

Tabular data is hard to acquire and is subject to missing values. This paper proposes a novel approach to generate and impute mixed-type (continuous and categorical) tabular data using score-based diffusion and conditional flow matching. Contrary to previous work that relies on neural networks as function approximators, we instead utilize XGBoost, a popular Gradient-Boosted Tree (GBT) method. In addition to being elegant, we empirically show on various datasets that our method i) generates highly realistic synthetic data when the training dataset is either clean or tainted by missing data and ii) generates diverse plausible data imputations. Our method often outperforms deep-learning generation methods and can trained in parallel using CPUs without the need for a GPU. To make it easily accessible, we release our code through a Python library on PyPI and an R package on CRAN.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/23/2023

Une comparaison des algorithmes d'apprentissage pour la survie avec données manquantes

Survival analysis is an essential tool for the study of health data. An ...
research
07/12/2020

Multiple Imputation and Synthetic Data Generation with the R package NPBayesImputeCat

In many contexts, missing data and disclosure control are ubiquitous and...
research
07/02/2023

MissDiff: Training Diffusion Models on Tabular Data with Missing Values

The diffusion model has shown remarkable performance in modeling data di...
research
08/03/2021

Categorical EHR Imputation with Generative Adversarial Nets

Electronic Health Records often suffer from missing data, which poses a ...
research
11/05/2022

Towards a methodology for addressing missingness in datasets, with an application to demographic health datasets

Missing data is a common concern in health datasets, and its impact on g...
research
06/24/2021

MIxBN: library for learning Bayesian networks from mixed data

This paper describes a new library for learning Bayesian networks from d...
research
07/21/2021

Interpreting diffusion score matching using normalizing flow

Scoring matching (SM), and its related counterpart, Stein discrepancy (S...

Please sign up or login with your details

Forgot password? Click here to reset