Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data

04/01/2021
by   Michael Platzer, et al.
0

AI-based data synthesis has seen rapid progress over the last several years, and is increasingly recognized for its promise to enable privacy-respecting high-fidelity data sharing. However, adequately evaluating the quality of generated synthetic datasets is still an open challenge. We introduce and demonstrate a holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions for mixed-type tabular data. Measuring fidelity is based on statistical distances of lower-dimensional marginal distributions, which provide a model-free and easy-to-communicate empirical metric for the representativeness of a synthetic dataset. Privacy risk is assessed by calculating the individual-level distances to closest record with respect to the training data. By showing that the synthetic samples are just as close to the training as to the holdout data, we yield strong evidence that the synthesizer indeed learned to generalize patterns and is independent of individual training records. We demonstrate the presented framework for seven distinct synthetic data solutions across four mixed-type datasets and compare these to more traditional statistical disclosure techniques. The results highlight the need to systematically assess the fidelity just as well as the privacy of these emerging class of synthetic data generators.

READ FULL TEXT
research
01/18/2021

Fidelity and Privacy of Synthetic Medical Data

The digitization of medical records ushered in a new era of big data to ...
research
06/17/2023

Achilles' Heels: Vulnerable Record Identification in Synthetic Data Publishing

Synthetic data is seen as the most promising solution to share individua...
research
11/18/2022

A Unified Framework for Quantifying Privacy Risk in Synthetic Data

Synthetic data is often presented as a method for sharing sensitive info...
research
12/06/2019

Differentially Private Mixed-Type Data Generation For Unsupervised Learning

In this work we introduce the DP-auto-GAN framework for synthetic data g...
research
10/10/2021

Synthetic Data for Multi-Parameter Camera-Based Physiological Sensing

Synthetic data is a powerful tool in training data hungry deep learning ...
research
04/21/2023

Auditing and Generating Synthetic Data with Controllable Trust Trade-offs

Data collected from the real world tends to be biased, unbalanced, and a...

Please sign up or login with your details

Forgot password? Click here to reset