Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data

06/03/2022
by   Gillian M. Raab, et al.
0

This paper introduces two methods of creating differentially private (DP) synthetic data that are now incorporated into the synthpop package for R. Both are suitable for synthesising categorical data, or numeric data grouped into categories. Ten data sets with varying characteristics were used to evaluate the methods. Measures of disclosiveness and of utility were defined and calculated The first method is to add DP noise to a cross tabulation of all the variables and create synthetic data by a multinomial sample from the resulting probabilities. While this method certainly reduced disclosure risk, it did not provide synthetic data of adequate quality for any of the data sets. The other method is to create a set of noisy marginal distributions that are made to agree with each other with an iterative proportional fitting algorithm and then to use the fitted probabilities as above. This proved to provide useable synthetic data for most of these data sets at values of the differentially privacy parameter ϵ as low as 0.5. The relationship between the disclosure risk and ϵ is illustrated for each of the data sets. Results show how the trade-off between disclosiveness and data utility depend on the characteristics of the data sets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/28/2019

Comparative Study of Differentially Private Synthetic Data Algorithms and Evaluation Standards

Differentially private synthetic data generation is becoming a popular s...
research
05/12/2022

On integrating the number of synthetic data sets m into the 'a priori' synthesis approach

Until recently, multiple synthetic data sets were always released to ana...
research
09/26/2021

Assessing, visualizing and improving the utility of synthetic data

The synthpop package for R https://www.synthpop.org.uk provides tools to...
research
12/12/2017

Guidelines for Producing Useful Synthetic Data

We report on our experiences of helping staff of the Scottish Longitudin...
research
09/15/2022

Private Synthetic Data for Multitask Learning and Marginal Queries

We provide a differentially private algorithm for producing synthetic da...
research
01/25/2022

A Latent Class Modeling Approach for Generating Synthetic Data and Making Posterior Inferences from Differentially Private Counts

Several algorithms exist for creating differentially private counts from...

Please sign up or login with your details

Forgot password? Click here to reset