Risk-Efficient Bayesian Data Synthesis for Privacy Protection
High-utility and low-risks synthetic data facilitates microdata dissemination by statistical agencies. In a previous work, we induced privacy protection into any Bayesian data synthesis model by employing a pseudo posterior likelihood that exponentiates each contribution by an observation record-indexed weight in [0, 1], defined to be inversely proportional to the marginal identification risk for that record. Relatively risky records with high marginal probabilities of identification risk tend to be isolated from other records. The downweighting of their likelihood contribution will tend to shrink the synthetic data value for those high-risk records, which in turn often tends to increase the isolation of other moderate-risk records. The result is that the identification risk actually increases for some moderate-risk records after risk-weighted pseudo posterior estimation synthesis, compared to an unweighted synthesis; a phenomenon we label "whack-a-mole". This paper constructs a weight for each record from a collection of pairwise identification risk probabilities with other records, where each pairwise probability measures the joint probability of re-identification of the pair of records. The by-record weights constructed from the pairwise identification risk probabilities tie together the identification risk probabilities across the data records and compresses the distribution of by-record risks, which mitigates the whack-a-mole and produces a more efficient set of synthetic data with lower risk and higher utility. We illustrate our method with an application to the Consumer Expenditure Surveys of the U.S. Bureau of Labor Statistics. We provide general guidelines to statistical agencies to achieve their desired utility-risk trade-off balance when disseminating public use microdata files through synthetic data.
READ FULL TEXT