Moving towards practical user-friendly synthesis: Scalable synthetic data methods for large confidential administrative databases using saturated count models

07/16/2021
by   James Jackson, et al.
0

Over the past three decades, synthetic data methods for statistical disclosure control have continually developed; methods have adapted to account for different data types, but mainly within the domain of survey data sets. Certain characteristics of administrative databases - sometimes just the sheer volume of records of which they are comprised - present challenges from a synthesis perspective and thus require special attention. This paper, through the fitting of saturated models, presents a way in which administrative databases can not only be synthesized quickly, but also allows risk and utility to be formalised in a manner inherently unfeasible in other techniques. The paper explores how the flexibility afforded by two-parameter count models (the negative binomial and Poisson-inverse Gaussian) can be utilised to protect respondents' - especially uniques' - privacy in synthetic data. Finally an empirical example is carried out through the synthesis of a database which can be viewed as a good representative to the English School Census.

READ FULL TEXT

page 26

page 27

research
05/12/2022

On integrating the number of synthetic data sets m into the 'a priori' synthesis approach

Until recently, multiple synthetic data sets were always released to ana...
research
06/03/2022

Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data

This paper introduces two methods of creating differentially private (DP...
research
07/02/2022

Comparing the Utility and Disclosure Risk of Synthetic Data with Samples of Microdata

Most statistical agencies release randomly selected samples of Census mi...
research
06/02/2019

Generating Poisson-Distributed Differentially Private Synthetic Data

The dissemination of synthetic data can be an effective means of making ...
research
09/17/2021

Data Privacy Protection and Utility Preservation through Bayesian Data Synthesis: A Case Study on Airbnb Listings

When releasing record-level data containing sensitive information to the...
research
04/01/2021

Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data

AI-based data synthesis has seen rapid progress over the last several ye...
research
06/02/2020

Two-Phase Data Synthesis for Income: An Application to the NHIS

We propose a two-phase synthesis process for synthesizing income, a sens...

Please sign up or login with your details

Forgot password? Click here to reset