MargCTGAN: A "Marginally” Better CTGAN for the Low Sample Regime

07/16/2023
by   Tejumade Afonja, et al.
0

The potential of realistic and useful synthetic data is significant. However, current evaluation methods for synthetic tabular data generation predominantly focus on downstream task usefulness, often neglecting the importance of statistical properties. This oversight becomes particularly prominent in low sample scenarios, accompanied by a swift deterioration of these statistical measures. In this paper, we address this issue by conducting an evaluation of three state-of-the-art synthetic tabular data generators based on their marginal distribution, column-pair correlation, joint distribution and downstream task utility performance across high to low sample regimes. The popular CTGAN model shows strong utility, but underperforms in low sample settings in terms of utility. To overcome this limitation, we propose MargCTGAN that adds feature matching of de-correlated marginals, which results in a consistent improvement in downstream utility as well as statistical properties of the synthetic data.

READ FULL TEXT
research
05/24/2023

Post-processing Private Synthetic Data for Improving Utility on Selected Measures

Existing private synthetic data generation algorithms are agnostic to do...
research
05/17/2023

Utility Theory of Synthetic Data Generation

Evaluating the utility of synthetic data is critical for measuring the e...
research
11/17/2022

Permutation-Invariant Tabular Data Synthesis

Tabular data synthesis is an emerging approach to circumvent strict regu...
research
11/26/2022

A new PCA-based utility measure for synthetic data evaluation

Data synthesis is a privacy enhancing technology aiming to produce reali...
research
07/02/2022

Comparing the Utility and Disclosure Risk of Synthetic Data with Samples of Microdata

Most statistical agencies release randomly selected samples of Census mi...
research
01/21/2023

Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms

Marginal-based methods achieve promising performance in the synthetic da...
research
09/10/2023

A supervised generative optimization approach for tabular data

Synthetic data generation has emerged as a crucial topic for financial i...

Please sign up or login with your details

Forgot password? Click here to reset