Assessing, visualizing and improving the utility of synthetic data

by   Gillian M. Raab, et al.

The synthpop package for R provides tools to allow data custodians to create synthetic versions of confidential microdata that can be distributed with fewer restrictions than the original. The synthesis can be customized to ensure that relationships evident in the real data are reproduced in the synthetic data. A number of measures have been proposed to assess this aspect, commonly known as the utility of the synthetic data. We show that all these measures, including those calculated from tabulations, can be derived from a propensity score model. The measures will be reviewed and compared, and relations between them illustrated. All the measures compared are highly correlated and some are shown to be identical. The method used to define the propensity score model is more important than the choice of measure. These measures and methods are incorporated into utility modules in the synthpop package that include methods to visualize the results and thus provide immediate feedback to allow the person creating the synthetic data to improve its quality. The utility functions were originally designed to be used for synthetic data objects of class synds, created by the synthpop function syn() or syn.strata(), but they can now be used to compare one or more synthesised data sets with the original records, where the records are R data frames or lists of data frames.


Guidelines for Producing Useful Synthetic Data

We report on our experiences of helping staff of the Scottish Longitudin...

Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data

This paper introduces two methods of creating differentially private (DP...

Post-processing Private Synthetic Data for Improving Utility on Selected Measures

Existing private synthetic data generation algorithms are agnostic to do...

Utility Theory of Synthetic Data Generation

Evaluating the utility of synthetic data is critical for measuring the e...

Marginality: a numerical mapping for enhanced treatment of nominal and hierarchical attributes

The purpose of statistical disclosure control (SDC) of microdata, a.k.a....

Bayesian Estimation of Attribute and Identification Disclosure Risks in Synthetic Data

The synthetic data approach to data confidentiality has been actively re...

Second-order Control of Complex Systems with Correlated Synthetic Data

Generation of hybrid synthetic data resembling real data to some criteri...

Please sign up or login with your details

Forgot password? Click here to reset