Perturbed Gibbs Samplers for Synthetic Data Release

12/18/2013
by   Yubin Park, et al.
0

We propose a categorical data synthesizer with a quantifiable disclosure risk. Our algorithm, named Perturbed Gibbs Sampler, can handle high-dimensional categorical data that are often intractable to represent as contingency tables. The algorithm extends a multiple imputation strategy for fully synthetic data by utilizing feature hashing and non-parametric distribution approximations. California Patient Discharge data are used to demonstrate statistical properties of the proposed synthesizing methodology. Marginal and conditional distributions, as well as the coefficients of regression models built on the synthesized data are compared to those obtained from the original data. Intruder scenarios are simulated to evaluate disclosure risks of the synthesized data from multiple angles. Limitations and extensions of the proposed algorithm are also discussed.

READ FULL TEXT

page 6

page 9

research
06/11/2019

Likelihood-free approximate Gibbs sampling

Likelihood-free methods such as approximate Bayesian computation (ABC) h...
research
03/30/2022

Generation and Simulation of Synthetic Datasets with Copulas

This paper proposes a new method to generate synthetic data sets based o...
research
10/25/2019

A Gibbs sampler for a class of random convex polytopes

We present a Gibbs sampler to implement the Dempster-Shafer (DS) theory ...
research
05/12/2022

On integrating the number of synthetic data sets m into the 'a priori' synthesis approach

Until recently, multiple synthetic data sets were always released to ana...
research
09/15/2022

Private Synthetic Data for Multitask Learning and Marginal Queries

We provide a differentially private algorithm for producing synthetic da...
research
09/17/2023

Fully Synthetic Data for Complex Surveys

When seeking to release public use files for confidential data, statisti...
research
04/14/2021

A hybrid Gibbs sampler for edge-preserving tomographic reconstruction with uncertain view angles

In computed tomography, data consist of measurements of the attenuation ...

Please sign up or login with your details

Forgot password? Click here to reset