On integrating the number of synthetic data sets m into the 'a priori' synthesis approach

05/12/2022
by   James Edward Jackson, et al.
0

Until recently, multiple synthetic data sets were always released to analysts, to allow valid inferences to be obtained. However, under certain conditions - including when saturated count models are used to synthesize categorical data - single imputation (m=1) is sufficient. Nevertheless, increasing m causes utility to improve, but at the expense of higher risk, an example of the risk-utility trade-off. The question, therefore, is: which value of m is optimal with respect to the risk-utility trade-off? Moreover, the paper considers two ways of analysing categorical data sets: as they have a contingency table representation, multiple categorical data sets can be averaged before being analysed, as opposed to the usual way of averaging post-analysis. This paper also introduces a pair of metrics, τ_3(k,d) and τ_4(k,d), that are suited for assessing disclosure risk in multiple categorical synthetic data sets. Finally, the synthesis methods are demonstrated empirically.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/03/2022

Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data

This paper introduces two methods of creating differentially private (DP...
research
05/22/2022

Privacy Protection for Youth Risk Behavior Using Bayesian Data Synthesis: A Case Study to the YRBS

The large number of publicly available survey datasets of wide variety, ...
research
08/01/2022

On Shapley Value in Data Assemblage Under Independent Utility

In many applications, an organization may want to acquire data from many...
research
12/18/2013

Perturbed Gibbs Samplers for Synthetic Data Release

We propose a categorical data synthesizer with a quantifiable disclosure...
research
12/12/2017

Guidelines for Producing Useful Synthetic Data

We report on our experiences of helping staff of the Scottish Longitudin...
research
02/27/2012

Marginality: a numerical mapping for enhanced treatment of nominal and hierarchical attributes

The purpose of statistical disclosure control (SDC) of microdata, a.k.a....

Please sign up or login with your details

Forgot password? Click here to reset