Programmable Synthetic Tabular Data Generation

by   Mark Vero, et al.

Large amounts of tabular data remain underutilized due to privacy, data quality, and data sharing limitations. While training a generative model producing synthetic data resembling the original distribution addresses some of these issues, most applications require additional constraints from the generated data. Existing synthetic data approaches are limited as they typically only handle specific constraints, e.g., differential privacy (DP) or increased fairness, and lack an accessible interface for declaring general specifications. In this work, we introduce ProgSyn, the first programmable synthetic tabular data generation algorithm that allows for comprehensive customization over the generated data. To ensure high data quality while adhering to custom specifications, ProgSyn pre-trains a generative model on the original dataset and fine-tunes it on a differentiable loss automatically derived from the provided specifications. These can be programmatically declared using statistical and logical expressions, supporting a wide range of requirements (e.g., DP or fairness, among others). We conduct an extensive experimental evaluation of ProgSyn on a number of constraints, achieving a new state-of-the-art on some, while remaining general. For instance, at the same fairness level we achieve 2.3 state-of-the-art in fair synthetic data generation on the Adult dataset. Overall, ProgSyn provides a versatile and accessible framework for generating constrained synthetic tabular data, allowing for specifications that generalize beyond the capabilities of prior work.


page 1

page 2

page 3

page 4


PreFair: Privately Generating Justifiably Fair Synthetic Data

When a database is protected by Differential Privacy (DP), its usability...

Robin Hood and Matthew Effects – Differential Privacy Has Disparate Impact on Synthetic Data

Generative models trained using Differential Privacy (DP) are increasing...

Evaluating the Fairness Impact of Differentially Private Synthetic Data

Differentially private (DP) synthetic data is a promising approach to ma...

DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks

Machine learning models have been criticized for reflecting unfair biase...

Differentially Private Mixed-Type Data Generation For Unsupervised Learning

In this work we introduce the DP-auto-GAN framework for synthetic data g...

Neural Circuit Synthesis from Specification Patterns

We train hierarchical Transformers on the task of synthesizing hardware ...

Transitioning from Real to Synthetic data: Quantifying the bias in model

With the advent of generative modeling techniques, synthetic data and it...

Please sign up or login with your details

Forgot password? Click here to reset