Risk-Efficient Bayesian Data Synthesis for Privacy Protection

08/20/2019
by   Jingchen Hu, et al.
0

High-utility and low-risks synthetic data facilitates microdata dissemination by statistical agencies. In a previous work, we induced privacy protection into any Bayesian data synthesis model by employing a pseudo posterior likelihood that exponentiates each contribution by an observation record-indexed weight in [0, 1], defined to be inversely proportional to the marginal identification risk for that record. Relatively risky records with high marginal probabilities of identification risk tend to be isolated from other records. The downweighting of their likelihood contribution will tend to shrink the synthetic data value for those high-risk records, which in turn often tends to increase the isolation of other moderate-risk records. The result is that the identification risk actually increases for some moderate-risk records after risk-weighted pseudo posterior estimation synthesis, compared to an unweighted synthesis; a phenomenon we label "whack-a-mole". This paper constructs a weight for each record from a collection of pairwise identification risk probabilities with other records, where each pairwise probability measures the joint probability of re-identification of the pair of records. The by-record weights constructed from the pairwise identification risk probabilities tie together the identification risk probabilities across the data records and compresses the distribution of by-record risks, which mitigates the whack-a-mole and produces a more efficient set of synthetic data with lower risk and higher utility. We illustrate our method with an application to the Consumer Expenditure Surveys of the U.S. Bureau of Labor Statistics. We provide general guidelines to statistical agencies to achieve their desired utility-risk trade-off balance when disseminating public use microdata files through synthetic data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/19/2019

Bayesian Pseudo Posterior Synthesis for Data Privacy Protection

Statistical agencies utilize models to synthesize respondent-level data ...
research
06/01/2020

Identification Risk Evaluation of Continuous Synthesized Variables

We propose a general approach to evaluating identification risk of conti...
research
03/31/2022

Assessing the risk of re-identification arising from an attack on anonymised data

Objective: The use of routinely-acquired medical data for research purpo...
research
09/26/2018

Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys

The release of synthetic data generated from a model estimated on the da...
research
11/24/2022

Identifying discreditable firms in a large-scale ownership network

Violations of laws and regulations about food safety, production safety,...
research
06/01/2020

Re-weighting of Vector-weighted Mechanisms for Utility Maximization under Differential Privacy

We implement a pseudo posterior synthesizer for microdata dissemination ...
research
05/12/2023

The Progression of Disparities within the Criminal Justice System: Differential Enforcement and Risk Assessment Instruments

Algorithmic risk assessment instruments (RAIs) increasingly inform decis...

Please sign up or login with your details

Forgot password? Click here to reset