Utility Theory of Synthetic Data Generation

05/17/2023
by   Shirong Xu, et al.
0

Evaluating the utility of synthetic data is critical for measuring the effectiveness and efficiency of synthetic algorithms. Existing results focus on empirical evaluations of the utility of synthetic data, whereas the theoretical understanding of how utility is affected by synthetic data algorithms remains largely unexplored. This paper establishes utility theory from a statistical perspective, aiming to quantitatively assess the utility of synthetic algorithms based on a general metric. The metric is defined as the absolute difference in generalization between models trained on synthetic and original datasets. We establish analytical bounds for this utility metric to investigate critical conditions for the metric to converge. An intriguing result is that the synthetic feature distribution is not necessarily identical to the original one for the convergence of the utility metric as long as the model specification in downstream learning tasks is correct. Another important utility metric is model comparison based on synthetic data. Specifically, we establish sufficient conditions for synthetic data algorithms so that the ranking of generalization performances of models trained on the synthetic data is consistent with that from the original data. Finally, we conduct extensive experiments using non-parametric models and deep neural networks to validate our theoretical findings.

READ FULL TEXT
research
05/24/2023

Post-processing Private Synthetic Data for Improving Utility on Selected Measures

Existing private synthetic data generation algorithms are agnostic to do...
research
07/16/2023

MargCTGAN: A "Marginally” Better CTGAN for the Low Sample Regime

The potential of realistic and useful synthetic data is significant. How...
research
09/26/2021

Assessing, visualizing and improving the utility of synthetic data

The synthpop package for R https://www.synthpop.org.uk provides tools to...
research
11/26/2022

A new PCA-based utility measure for synthetic data evaluation

Data synthesis is a privacy enhancing technology aiming to produce reali...
research
06/04/2018

Composite Marginal Likelihood Methods for Random Utility Models

We propose a novel and flexible rank-breaking-then-composite-marginal-li...
research
11/23/2022

Utility Assessment of Synthetic Data Generation Methods

Big data analysis poses the dual problem of privacy preservation and uti...
research
12/12/2017

Guidelines for Producing Useful Synthetic Data

We report on our experiences of helping staff of the Scottish Longitudin...

Please sign up or login with your details

Forgot password? Click here to reset