Permutation-Invariant Tabular Data Synthesis

11/17/2022
by   Yujin Zhu, et al.
0

Tabular data synthesis is an emerging approach to circumvent strict regulations on data privacy while discovering knowledge through big data. Although state-of-the-art AI-based tabular data synthesizers, e.g., table-GAN, CTGAN, TVAE, and CTAB-GAN, are effective at generating synthetic tabular data, their training is sensitive to column permutations of input data. In this paper, we first conduct an extensive empirical study to disclose such a property of permutation invariance and an in-depth analysis of the existing synthesizers. We show that changing the input column order worsens the statistical difference between real and synthetic data by up to 38.67 the encoding of tabular data and the network architectures. To fully unleash the potential of big synthetic tabular data, we propose two solutions: (i) AE-GAN, a synthesizer that uses an autoencoder network to represent the tabular data and GAN networks to synthesize the latent representation, and (ii) a feature sorting algorithm to find the suitable column order of input data for CNN-based synthesizers. We evaluate the proposed solutions on five datasets in terms of the sensitivity to the column permutation, the quality of synthetic data, and the utility in downstream analyses. Our results show that we enhance the property of permutation-invariance when training synthesizers and further improve the quality and utility of synthetic data, up to 22 existing synthesizers.

READ FULL TEXT
research
03/11/2015

Learning Classifiers from Synthetic Data Using a Multichannel Autoencoder

We propose a method for using synthetic data to help learning classifier...
research
07/16/2023

MargCTGAN: A "Marginally” Better CTGAN for the Low Sample Regime

The potential of realistic and useful synthetic data is significant. How...
research
04/01/2022

CTAB-GAN+: Enhancing Tabular Data Synthesis

While data sharing is crucial for knowledge development, privacy concern...
research
01/29/2021

Synthetic Data and Hierarchical Object Detection in Overhead Imagery

The performance of neural network models is often limited by the availab...
research
08/28/2020

Relational Data Synthesis using Generative Adversarial Networks: A Design Space Exploration

The proliferation of big data has brought an urgent demand for privacy-p...
research
02/16/2021

CTAB-GAN: Effective Table Data Synthesizing

While data sharing is crucial for knowledge development, privacy concern...
research
09/29/2021

SEAWEED BIOMASS FOR THE REMOVAL OF BASIC DYES FROM AQUEOUS SOLUTIONS

ABSTRACT: The main objective of this study was to evaluate the feasibili...

Please sign up or login with your details

Forgot password? Click here to reset