Generating tabular datasets under differential privacy

08/28/2023
by   Gianluca Truda, et al.
0

Machine Learning (ML) is accelerating progress across fields and industries, but relies on accessible and high-quality training data. Some of the most important datasets are found in biomedical and financial domains in the form of spreadsheets and relational databases. But this tabular data is often sensitive in nature. Synthetic data generation offers the potential to unlock sensitive data, but generative models tend to memorise and regurgitate training data, which undermines the privacy goal. To remedy this, researchers have incorporated the mathematical framework of Differential Privacy (DP) into the training process of deep neural networks. But this creates a trade-off between the quality and privacy of the resulting data. Generative Adversarial Networks (GANs) are the dominant paradigm for synthesising tabular data under DP, but suffer from unstable adversarial training and mode collapse, which are exacerbated by the privacy constraints and challenging tabular data modality. This work optimises the quality-privacy trade-off of generative models, producing higher quality tabular datasets with the same privacy guarantees. We implement novel end-to-end models that leverage attention mechanisms to learn reversible tabular representations. We also introduce TableDiffusion, the first differentially-private diffusion model for tabular data synthesis. Our experiments show that TableDiffusion produces higher-fidelity synthetic datasets, avoids the mode collapse problem, and achieves state-of-the-art performance on privatised tabular data synthesis. By implementing TableDiffusion to predict the added noise, we enabled it to bypass the challenges of reconstructing mixed-type tabular data. Overall, the diffusion paradigm proves vastly more data and privacy efficient than the adversarial paradigm, due to augmented re-use of each data batch and a smoother iterative training process.

READ FULL TEXT
research
01/27/2020

DP-CGAN: Differentially Private Synthetic Data and Label Generation

Generative Adversarial Networks (GANs) are one of the well-known models ...
research
11/01/2021

Don't Generate Me: Training Differentially Private Generative Models with Sinkhorn Divergence

Although machine learning models trained on massive data have led to bre...
research
03/02/2020

Generating Higher-Fidelity Synthetic Datasets with Privacy Guarantees

This paper considers the problem of enhancing user privacy in common mac...
research
06/28/2022

Improving Correlation Capture in Generating Imbalanced Data using Differentially Private Conditional GANs

Despite the remarkable success of Generative Adversarial Networks (GANs)...
research
12/06/2019

Differentially Private Mixed-Type Data Generation For Unsupervised Learning

In this work we introduce the DP-auto-GAN framework for synthetic data g...
research
03/10/2023

EHRDiff: Exploring Realistic EHR Synthesis with Diffusion Models

Electronic health records (EHR) contain vast biomedical knowledge and ar...
research
09/04/2023

FinDiff: Diffusion Models for Financial Tabular Data Generation

The sharing of microdata, such as fund holdings and derivative instrumen...

Please sign up or login with your details

Forgot password? Click here to reset