Diffusion models for missing value imputation in tabular data

10/31/2022
by   Shuhan Zheng, et al.
0

Missing value imputation in machine learning is the task of estimating the missing values in the dataset accurately using available information. In this task, several deep generative modeling methods have been proposed and demonstrated their usefulness, e.g., generative adversarial imputation networks. Recently, diffusion models have gained popularity because of their effectiveness in the generative modeling task in images, texts, audio, etc. To our knowledge, less attention has been paid to the investigation of the effectiveness of diffusion models for missing value imputation in tabular data. Based on recent development of diffusion models for time-series data imputation, we propose a diffusion model approach called "Conditional Score-based Diffusion Models for Tabular data" (CSDI_T). To effectively handle categorical variables and numerical variables simultaneously, we investigate three techniques: one-hot encoding, analog bits encoding, and feature tokenization. Experimental results on benchmark datasets demonstrated the effectiveness of CSDI_T compared with well-known existing methods, and also emphasized the importance of the categorical embedding techniques.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/27/2019

Improving Missing Data Imputation with Deep Generative Models

Datasets with missing values are very common on industry applications, a...
research
08/19/2022

Diffusion-based Time Series Imputation and Forecasting with Structured State Space Models

The imputation of missing values represents a significant obstacle for m...
research
06/15/2022

HyperImpute: Generalized Iterative Imputation with Automatic Model Selection

Consider the problem of imputing missing values in a dataset. One the on...
research
06/10/2023

Machine Learning Based Missing Values Imputation in Categorical Datasets

This study explored the use of machine learning algorithms for predictin...
research
10/04/2022

Representing missing values through polar encoding

We propose polar encoding, a representation of categorical and numerical...
research
02/23/2023

A Comparison of Modeling Preprocessing Techniques

This paper compares the performance of various data processing methods i...
research
02/20/2023

PriSTI: A Conditional Diffusion Framework for Spatiotemporal Imputation

Spatiotemporal data mining plays an important role in air quality monito...

Please sign up or login with your details

Forgot password? Click here to reset