Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys

The release of synthetic data generated from a model estimated on the data helps statistical agencies disseminate respondent-level data with high utility and privacy protection. Motivated by the challenge of disseminating sensitive variables containing geographic information in the Consumer Expenditure Surveys (CE) at the U.S. Bureau of Labor Statistics, we propose two non-parametric Bayesian models as data synthesizers for the county identifier of each data record: a Bayesian latent class model and a Bayesian areal model. Both data synthesizers use Dirichlet Process priors to cluster observations of similar characteristics and allow borrowing information across observations. We develop innovative disclosure risks measures to quantify inherent risks in the original CE data and how those data risks are ameliorated by our proposed synthesizers. By creating a lower bound and an upper bound of disclosure risks under a minimum and a maximum disclosure risks scenarios respectively, our proposed inherent risks measures provide a range of acceptable disclosure risks for evaluating risks level in the synthetic datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/17/2021

Bayesian Estimation of Attribute Disclosure Risks in Synthetic Data with the R Package

Synthetic data is a promising approach to privacy protection in many con...
08/20/2019

Risk-Efficient Bayesian Data Synthesis for Privacy Protection

High-utility and low-risks synthetic data facilitates microdata dissemin...
04/09/2018

Bayesian Estimation of Attribute and Identification Disclosure Risks in Synthetic Data

The synthetic data approach to data confidentiality has been actively re...
01/19/2019

Bayesian Pseudo Posterior Synthesis for Data Privacy Protection

Statistical agencies utilize models to synthesize respondent-level data ...
06/01/2020

Identification Risk Evaluation of Continuous Synthesized Variables

We propose a general approach to evaluating identification risk of conti...
09/25/2019

Bayesian Pseudo Posterior Mechanism under Differential Privacy

We propose a Bayesian pseudo posterior mechanism to generate record-leve...
03/15/2018

Strategies to facilitate access to detailed geocoding information using synthetic data

In this paper we investigate if generating synthetic data can be a viabl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The U.S. Bureau of Labor Statistics (BLS) utilizes various survey programs to collect individual-level and business establishment-level data. The Consumer Expenditure Surveys (CE) at the BLS is a survey program focuses on collecting and publishing information about expenditures, income, and characteristics of consumers in the United States. The CE publishes summary, domain-level statistics used for both policy-making and research, including the most widely used measure of inflation - the Consumer Price Index (CPI), measures of poverty that determine thresholds for the U.S. Governments Supplemental Poverty Measure, estimation of the cost of raising a child for making policies on foster care and child support, and estimation of American spending on health care, to name a few.

The CE consists of two surveys: i) the Quarterly Interview Survey, which aims to capture large purchases (such as rent, utilities, and vehicles), contains approximately 7,000 interviews, and is conducted every quarter; and ii) the Diary Survey, administered on an annual basis, focuses on capturing small purchases (such as food, beverages, tobacco), contains approximately 14,000 interviews of households. In this project, we focus on the CE public-use microdata (PUMD) that result from these instruments. Unlike published CE tables, which release information from the CE data in aggregated forms, the CE PUMD releases the CE data at the individual, respondent level, which potentially enables the CE data users to conduct research tailored to their interests. Directly releasing individual-level data, however, poses privacy concerns. Under the U.S. Title 13 and Title 26, released versions of public-use data are subject to privacy and confidentiality protection. Values for sensitive variables deemed at high risk for privacy protection are often suppressed and not reported.

A class of approaches for encoding privacy protection into sensitive variables to permit their public release constructs a Bayesian model for the respondent-level variable(s), estimated on the data, from which new data are simulated or “synthesized” from the estimated model (Rubin, 1993; Little, 1993; Reiter, 2005b, c; Drechsler and Reiter, 2009; Caiola and Reiter, 2010; Kinney et al., 2011; Wang and Reiter, 2012; Paiva et al., 2014; Kinney et al., 2014; Quick et al., 2015; Wei and Reiter, 2016; Hu et al., 2018). The new data, commonly called “synthetic data”, are then proposed for release to the public. The synthetic data generated from flexible models, called synthesizers, should maintain a high level of usefulness (commonly called utility), while smoothing of the data distribution induced by the model often achieves a high level of privacy and confidentiality protection.

The current CE PUMD of the Interview Survey contains more than 300 variables about characteristics of the consumer units (CU, i.e. households) and CU members, and detailed tax, income, and expenditure information about the CUs and their members. While rich and useful, a set of important variables about the detailed location of the CUs is not currently released due to privacy concerns and other considerations. In this paper, we construct synthesizers for the county labels variable for CUs, along with new disclosure risk measures to ensure adequate privacy protection for synthetically-generated county labels, while at the same time ensuring that the synthetic data are useful to the CE data users for various research purposes of interest to them. Tailored for categorical variables present in the CE data, we propose two non-parametric Bayesian models as data synthesizers. The first synthesizer employs a Dirichlet Process mixtures of products of multinomials (DPMPM), which directly models the county labels variable as a categorical variable where each county in the data receives a unique code. The second synthesizer constructs a new, nonparametric version of areal models with Dirichlet Process priors (DP-areal), which models

counts of county labels of observations sharing similar characteristics, such as gender, income and age categories.

On utility of the synthetic datasets, we demonstrate and compare the effectiveness of the proposed DPMPM and DP-areal synthesizers in preserving important global and local distributional characteristics of the CE data. On disclosure risks evaluation (i.e. evaluating the risks of disclosure by releasing the synthetic data, as the level of disclosure risks is negatively proportional to privacy protection), we propose new disclosure risks measures to capture inherent risk in the original, confidential data to help set context for the reduction in disclosure risks offered by our two candidate synthesizers. Specifically, we consider the inherent minimum and maximum disclosure risks, which give us a lower bound and an upper bound of acceptable disclosure risks. The resulting range of acceptable disclosure risks therefore enables data disseminating agencies to make comparisons of the disclosure risks between the original data and the synthetic data to facilitate decision making about synthetic data release.

Section 1.1 introduces details of the CE sample data in the application, and discusses its features that motivate the development of our synthesizers. Section 1.2 provides a literature review of synthetic data generation, synthesis of geographic locations, and review of existing methods of disclosure risks evaluation.

1.1 The CE data

The CE data sample in our application comes from the 2017 1st quarter Interview Survey. There are n = 6,208 consumer units (CU) in this sample. We focus on 4 variables: gender, income, age, and county. Gender is categorical in nature, with 2 levels. Income and age are discretized, with 4 levels and 5 levels respectively. These three variables are non-geographic variables. See Table 1 for details of the variables.

Variable Description
Gender Gender of the reference person; 2 categories.
Income Imputed and reported income of the CU; 4 categories (based on 4 quartiles).
Age Age of the reference person; 5 categories (20, 20-40, 40-60, 60-80, 80).
County County label of the CU; 133 categories.
Table 1: Variables in the CE data sample.

The county variable represents the county labels of the CUs. As a variable containing detailed geographic information, it is currently not released in the CE PUMD for privacy protection. In the 2017 1st quarter CE data sample, there are 133 counties observed, i.e. 133 counties are sampled. These 133 counties are only a small subset of the total number of counties and county-equivalents in the US (3,142 counties and county-equivalents in 2018). The observed 133 counties are scattered around across the nation. Their sparsity in geographic locations results in the county labels carrying little geographic information. Therefore, we consider county as a categorical variable with 133 levels. We define a “pattern” as a unique composition of non-geographic variables, i.e., a pattern is determined by intersection of categories for the three non-geographic variables {Gender, Income, Age}. The cross tabulation of these three non-geographic variables creates 40 different patterns in total ().

Index Pattern Observations Index Pattern Observations
1 {1, 1, 1} 27 2 {1, 1, 2} 170
3 {1, 1, 3} 168 4 {1, 1, 4} 194
5 {1, 1, 5} 48 6 {1, 2, 1} 3
7 {1, 2, 2} 193 8 {1, 2, 3} 183
9 {1, 2, 4} 242 10 {1, 2, 5} 61
11 {1, 3, 1} 3 12 {1, 3, 2} 291
13 {1, 3, 3} 275 14 {1, 3, 4} 199
15 {1, 3, 5} 19 16 {1, 4, 1} 4
17 {1, 4, 2} 239 18 {1, 4, 3} 454
19 {1, 4, 4} 169 20 {1, 4, 5} 4
21 {2, 1, 1} 33 22 {2, 1, 2} 229
23 {2, 1, 3} 222 24 {2, 1, 4} 333
25 {2, 1, 5} 128 26 {2, 2, 1} 9
27 {2, 2, 2} 250 28 {2, 2, 3} 254
29 {2, 2, 4} 308 30 {2, 2, 5} 53
31 {2, 3, 1} 8 32 {2, 3, 2} 244
33 {2, 3, 3} 312 34 {2, 3, 4} 184
35 {2, 3, 5} 18 36 {2, 4, 1} 3
37 {2, 4, 2} 198 38 {2, 4, 3} 344
39 {2, 4, 4} 122 40 {2, 4, 5} 10
Table 2: List of 40 patterns and numbers of obversations in each pattern. A pattern is presented as {Gender, Income, Age}.

As evident in Table 2, the number of observations in every pattern varies greatly, from the maximum 454 observations in Pattern 18 to the minimum 3 observations in Patterns 6, 11, and 36. The presence of patterns with a small number of observations motivates us to develop data synthesizers that allow borrowing information across patterns to strengthen estimation for patterns with small numbers of observations.

1.2 Literature review

1.2.1 Synthetic data

Depending on the sensitive levels of the variables in a study, agencies can either generate fully synthetic datasets, where all variables are deemed sensitive, therefore synthesized (Rubin, 1993); or generate partially synthetic datasets, where only some variables are deemed sensitive and synthesized, and other variables are un-synthesized (Little, 1993)

. In the CE data sample, since only the county label is deemed sensitive, we aim to generate partially synthetic data where only the county is synthesized. Gender, income, and age are un-synthesized. The record label is maintained in partially synthetic data, though the synthesized variable (which is the county label in our CE application) is generated from the posterior predictive distribution of the synthesizer.

1.2.2 Synthesis of locations

In general, variables containing geographic information are generally deemed sensitive; however, these variables are extremely helpful for data analysts to conduct research related to locations. Therefore, a number of researchers have proposed synthesizers to generate synthetic geographic data.

One stream of work has treated the geographic location as variable(s) carrying little geographic information, therefore their proposed synthesizers do not incorporate spatial modeling. Wang and Reiter (2012); Drechsler and Hu (2018+) developed CART models (Reiter, 2005c) to synthesize continuous longitude and latitude. In addition, Drechsler and Hu (2018+) combined the continuous longitude and latitude variables into a single categorical geographic variable, and used versions of categorical CART models for its synthesis. In addition, Drechsler and Hu (2018+) used the DPMPM synthesizer (Hu et al., 2014) to generate synthetic categorical locations.

Another stream of work has explicitly incorporated spatial modeling in their synthesizers for densely-observed geographic variables. Paiva et al. (2014) aggregated counts of geographic locations to a pre-defined grid level, modeled the counts through areal level spatial model, which included spatial random effects with Conditional Autoregressive (CAR) priors (Besag et al., 1991), then synthesized counts of locations from the estimated model to release. Quick et al. (2015) developed synthesizers based on Bayesian marked point process (Liang et al., 2009). Zhou et al. (2010); Quick et al. (2018) developed differentially smoothing-based synthesizers based on spatially-indexed distances.

Considering the county labels in the CE data sample, especially the fact that the observed 133 county labels is only a small subset (a little over 4%) of the total number of 3,142 counties and county-equivalents in the US, we believe the county labels themselves carry little geographic information. Therefore, we work with synthesizers that do not incorporate spatial modeling. Specifically, we choose the DPMPM synthesizer, and we develop a new, non-spatial version of the count-based areal synthesizer in Paiva et al. (2014) with non-spatial, nonparametric priors, labeled as DP-areal synthesizer.

1.2.3 Disclosure risks

Developing appropriate synthesizers allow agencies to generate useful synthetic datasets; however, before the release of synthetic data to the public, the data disseminating agencies have to evaluate the level of privacy protection (or the lack of) provided by the synthetic data. Synthetic data release takes place only if the synthetic data expresses a sufficient level of privacy protection. Typically, privacy protection in the synthetic data is determined by the evaluation of its disclosure risks. The higher the disclosure risks, the lower the privacy protection, and vice versa.

There are two general categories of disclosure risks in synthetic datasets: i) identification disclosure risks, and ii) attribute disclosure risks. For partially synthetic data, such as only synthesizing the county labels and keeping other non-geographic variables un-synthesized in the CE data sample, both identification disclosure risks and attribute disclosure risks exist (Hu, 2018+).

Identification disclosure risks exist when an intruder has access to information about all of the variables for a target record through external files, and tries to match those values with available information in the released data to identify the name associated to that record. Widely used measures for this type of risks are based on Bayesian probabilistic matching (Duncan and Lambert, 1986, 1989; Lambert, 1993; Reiter, 2005a; Reiter and Mitra, 2009); for example, in the CE data sample, suppose an intruder knows the particular age, gender and income pattern, as well as the county label, of a person named “Betty”. The intruder wants to identify which record in the CE synthetic data belongs to Betty. Intuitively, the intruder will go through the synthetic datasets by matching the known pattern, and the synthesized county label. Suppose record, , belongs to Betty. Let denote the number of records in the sample sharing the same pattern and the true county label (in the original data) for record . Then

gives a probability of the intruder randomly and correctly guessing the record attached to Betty based on matching attribute values. In general, the larger the value of

, the lower the identification disclosure risks for record . If the county label in the released synthetic data for record is different from the true label in the real data, then record will not be among the records, which means Betty is not among the those records. Let be a binary indicator of whether the true match is among the records. If the county label in the synthetic data for Betty’s record, , is the same as the real data, then ; otherwise, if the county label in the synthetic data for record is different from that in the real data, then . If , the intruder has a probability of finding the record belonging to Betty. If , however, the intruder has a probability of finding the record belonging to Betty because her record is not among the . Therefore, the ratio (or ) provides a measure of expected identification match risk for record .

In addition to the expected match risks, measures such as the true match rate (the percentage of true unique matches among target records) and the false match rate (the percentage of false matches among unique matches) are also useful (Reiter and Mitra, 2009; Drechsler and Reiter, 2010; Hu and Hoshino, 2018; Hu, 2018+; Drechsler and Hu, 2018+).

Attribute disclosure risk measures how likely an intruder is to discover the true value of synthesized variables; in our case the county label. Reiter (2012); Wang and Reiter (2012); Reiter et al. (2014)

proposed a Bayesian approach to compute a posterior probability of identifying the true attribute for each record under the synthesizer

(Hu et al., 2014; Paiva et al., 2014; Wei and Reiter, 2016; Hu et al., 2018; Manrique-Vallier and Hu, 2018). This general framework has the advantages of providing interpretable probability statements of the attribute disclosure risks (Hu, 2018+). Yet, the procedure requires that the intruder knows the true attribute values of the synthesized variables for every other record except the target record in order to make the approach computationally tractable. The approach doesn’t scale to multiple categorical synthesized variables or a synthesized variables with many categories, such as our county label variable (with categories). In our synthetic CE data application we construct an attribute disclosure risk measure from our identification risk formulation. Our approach counts the number of records in the file where the synthesized value matches the true data value to provide an overall file level summary that contrasts with the record level statistic in Hu et al. (2014).

Moreover, in the sequel we extend these identification and attribute disclosure risk measures in a novel way to capture disclose risks inherent in the real data, independent of the synthesizers. We create risk measures under a minimum scenario and a maximum scenario and obtain a lower bound and an upper bound, respectively. We argue that disclosure risks in the synthetic datasets may be judged based on how they fit within these bounds.

The remainder of the paper is organized as follows: In Section 2, we describe the DPMPM synthesizer, the DP-areal synthesizer, and the computation details of their implementation. Section 3 presents the utility measures and results of the synthetic CE; Section 4 presents the proposed disclosure risks measures for the original CE data, and the disclosure risks results of the synthetic CE data. The paper concludes with discussion in Section 5.

2 Synthesizers

Our goal is to generate partially synthetic data for the CE data sample of n = 6,208 observations and p = 4 categorical variables. The variables gender, income, and age are considered insensitive and un-synthesized, whereas the county label variable is considered sensitive and synthesized. As discussed in Section 1.1, the county labels carry little geographic information due to the small number of observed counties in the CE data sample. Furthermore, although not all 133 counties are observed within every pattern formed by {Gender, Income, Age}, it is reasonable to believe that those unobserved counties among the 133 counties can be sampled in another quarterly CE data sample. Therefore, the synthesizers for the CE data sample will allow all 133 counties for any pattern, even though some counties are not observed for some patterns in the sample. In other words, the synthesizers will allow non-zero probabilities for unobserved combinations of county labels and patterns, by design. We proceed to describe two models for synthesizing the county label attribute.

2.1 Dirichlet Process mixtures of product of multinomials (DPMPM)

The DPMPM is a Bayesian version of latent class models for unordered categorical data. The Dirichlet Process prior allows infinite number of mixtures, allows the data to learn the effective number of mixture components and provides support to all distributions of multivariate categorical variables (Dunson and Xing, 2009). Si and Reiter (2013) used the DPMPM as a missing data imputation engine, and Hu et al. (2014) first used it as a synthesizer for a sample of American Community Survey (ACS). Drechsler and Hu (2018+) also used the DPMPM synthesizer for simulating geolocations of a German census.

The description of the DPMPM synthesizer follows that in Hu and Hoshino (2018). Consider a sample , that consists of records, and each record has categorical variables. For the CE data, , including non-geographic variables (gender, income and age), and geographic variable, the county label. The basic assumption of the DPMPM is that every record, , belongs to one of underlying latent classes, which is unobserved, thus latent. Given the latent class assignment of record , as in Equation (2), the value for record and attribute, , , independently follows its own Multinomial distribution, as in Equation (1). denotes the number of categories of variable .

(1)
(2)

The DPMPM clusters records with similar characteristics based on all attributes. Relationships among these categorical attributes are induced by integrating out the latent class assignment . To empower the DPMPM to pick the effective number of occupied latent classes, the truncated stick-breaking representation (Sethuraman, 1994) is used as in Equation (3) through Equation (6),

(3)
(4)
(5)
(6)

and a blocked Gibbs sampler is implemented for the Markov chain Monte Carlo sampling procedure

(Ishwaran and James, 2001; Si and Reiter, 2013; Hu et al., 2014; Drechsler and Hu, 2018+).

To generate synthetic county label of each record using the DPMPM synthesizer, we first generate sample values of from the posterior distribution , where contains the sample values of the county label variable at MCMC iteration

. We can generate the vector of latent class assignments

through a Multinomial draw with the samples of , as in Equation (1). We next generate synthetic county label, , through a Multinomial draw with samples of , as in Equation (2). Let denote a partially synthetic dataset at MCMC iteration . Then we repeat the process for times, creating independent partially synthetic datasets .

2.2 Areal models with Dirichlet Process prior on random effect (DP-areal)

The DP-areal synthesizer is built upon areal level spatial models, also known as disease mapping models (Clayton and Kaldor, 1987; Besag et al., 1991; Clayton and Bernardinelli, 1992). Paiva et al. (2014) developed extensions of the areal level spatial models as engines to generate simulated locations. Specifically, they i) created pre-defined areal based on non-geographic variables, ii) aggregated counts of geographic locations to the pre-defined areal, iii) estimated areal level spatial models that predict observed, areal-level counts with spatial random effects using a Conditional Autoregressive (CAR) prior, and iv) simulated new locations for each individual from the estimated models. Crucial to the setup in Paiva et al. (2014) are the pre-defined patterns formed by the intersection of non-arial attributes. Recall that a pattern for the CE sample is determined by the composition of {Gender, Income, Age}, and there are 40 patterns in the CE data.

However, as discussed before, due to the little geographic information carried in county labels in the CE data as a result of the geographic sparsity of county labels, the use of spatial random effects and CAR priors on them in Paiva et al. (2014) is not appropriate. Instead, we include non-spatial random effects from other sources and specify Dirichlet Process (DP) priors for them in our application. We now turn to the description of our DP-areal synthesizer.

Let denote a unique pattern of non-geographic variables, and , where is the total number of unique patterns ( in the CE data). Let be the count of observations in county within pattern . When there is no observation of a particular county for pattern , zeros are inserted so that . For clarity, we reserve the word “combination” for non-geographic variables and the geographic attribute county label, and the word “pattern” for only the non-geographic variables.

(7)
(8)
(9)

In the regression in Equation (8), is the overall intercept for , the logarithm of Poisson rate for county and pattern . This set-up specifies a Poisson-lognormal model where precision parameter, , allows for over-dispersion. Note that we let , where , represents the total sum of the number of categories of all non-geographic categorical variables. Subsequently, is an vector comprising ones at positions (the attribute values at positions for all non-geographic attributes in pattern ) and zeros elsewhere.

Two types of random effects are considered: i) combination-specific random effect, denoted by , and ii) county-specific and variable-specific random effect, denoted by . To adequately model these random effects, the truncated DP priors are specified on ’s and ’s to allow flexible clustering counties of similar characteristics. Here, , denotes the cluster indicator for each combination, . Then, represents the combination-specific random effect, where all combinations in the same mixture component (i.e. when ) share the same unique random effect value or “location”, . Similarly, is a county-specific and value-specific random effect, where all counties in the same mixture component (i.e., when ) share the same random effect, . The cluster assignment, , for each combination is generated from a Multinomial draw with cluster probabilities, in Equation (9) with cluster-indexed coefficients or locations, given by . Moreover, the total number of mixture components or clusters is truncated at . The current truncated mixture model becomes arbitrarily close to a Dirichlet process mixture as in Equation (9).

We use the truncated stick-breaking representation for the prior distribution of (Sethuraman, 1994). We specify i.i.d. normal priors for ’s and multivariate normal priors for locations, ’s, and a univariate normal prior for the overall mean .

(10)
(11)
(12)

To generate the synthetic county label of each record, we follow the general approach of Paiva et al. (2014). At MCMC iteration , we, firstly, gather all records with the same pattern. Secondly, we collect all the ’s from the unique combination of pattern and the county . Thirdly, we compute

(13)

where is the number of all county labels within pattern (e.g. G = 133 in the CE data). Finally, we take a Multinomial draw with sample size 1,

(14)

where

is the random variable representing the county label of record

, . We repeat this process for all records in the sample, creating a partially synthetic dataset . Then the entire process is repeated for times, creating independent partially synthetic datasets, .

2.3 Computation

Computation of the DPMPM synthesizer is done by the NPBayesImpute R package. We run the DPMPM synthesizer on the CE sample for 10,000 iterations with 5,000 burn-in. We follow the recommendations of Dunson and Xing (2009); Si and Reiter (2013); Hu et al. (2014); Drechsler and Hu (2018+) and set , and set uninformative priors for by for . We set K = 40 and track the number of occupied latent classes with 95% interval (28, 36), indicating K = 40 is sufficiently large. We generate m = 20 synthetic datasets by using parameters in iterations that are far away from each other to guarantee independence. We label the m = 20 synthetic datasets generated by the DPMPM synthesizer as .

Computation of the DP-areal synthesizer is done using Stan programming language (Stan Development Team, 2016). We ran the DP-areal synthesizer on the CE sample for 4,000 iterations with 2,000 burn-in. Since Rstan employ a Hamiltonian Monte Carlo sampler that suppresses the usual random walk behavior of the Metropolis-Hastings sampler, posterior sampling iterations are far less correlated than under a Gibbs sampler, which permits use of far fewer iterations. We set , and specify prior distributions for , and . For the multivariate covariance matrix in the prior distribution for ’s, we decompose , into the

vector of variances,

that are diagonalized into an matrix in , and an correlation matrix, . We select a truncated prior distribution for the components of . We choose the prior distribution for

, which has a single hyperparameter,

, that controls how tightly the prior distribution is centered on the identity matrix (meaning independence). We select

, which denotes a uniform distribution over the space of correlation matrices, which is the most weakly informative prior possible, such that we let the data learn the values. We set K = 50, and generate m = 20 synthetic datasets. They are labeled as

.

3 Utility measure

When synthetic microdata is released to the public, data analysts would use it to conduct inferences based on their research interests. Useful synthetic data would provide high quality inference results, as if the data analysts had access to the original and confidential data. The level of quality depends on the level of closeness between the inference results from the original data and from the synthetic data, respectively.

Measures of the level of such closeness, commonly referred to as utility measures, are therefore needed to evaluate the usefulness of the synthetic data. We focus on measuring the preservation of distributional characteristics of the synthetic data in the synthetic datasets. Since only the county labels are synthesized in the synthetic CE data, we examine the preservation of the distributional characteristics of the county label, and its relationships with other un-synthesized variables. We do so by conducting the same analysis on the original dataset, and on (synthetic datasets generated from the DPMPM synthesizer) and (synthetic datasets generated from the DP-areal synthesizer) and compare results to the original data, for context. Since we have defined patterns in the CE data, we will consider two categories of utility measures: i) globally utility measure, which focuses on the distributional characteristics of county label and its relationships with other variables at the file level; and ii) within pattern utility measure, which focuses on the distributional characteristics of county label at the pre-defined pattern level.

3.1 Global utility measure

Table DPMPM DP-areal
one-way 9.087 17.105
two-way 47.908 64.165
three-way 80.826 88.843
Table 3: Sum of deviations for each ol one-way, two-way, and three-way tables of the synthetic datasets and from those of the original dataset . Results are averages of m = 20 partially synthetic datasets, divided by 100 for readability.

We proceed to formulate measures of utility for our two synthesizers based on a typical manner in which data analysts use the CE data. In the CE data, three non-geographic variables (gender, income, and age) and one geographic variable (county label) are all constructed to be categorical. Furthermore, only the county label is synthesized. It is therefore useful to calculate one-way, two-way, and three-way tables of counts of observations for the entire file, and compare these computed counts from the original data, to those from the synthetic datasets. Comparing the accuracy of the synthetic data in reproducing table counts provides a deviance measure of the synthetic datasets from the original dataset. Since these tables are constructed for the entire files consisting all records, this utility measure is regarded as global utility measure.

Without loss of generality, let denote the synthetic datasets generated from a synthesizer and be the original CE data (that is kept confidential and not released to the public).

For one-way tables, we compute the counts of observations of the 133 categories of county label in , as well as the counts of observations of the 133 categories of county label in . We next calculate the differences in the counts between the original and synthetic datasets, and report the sum of the absolute differences to avoid cancellation of positive and negative differences. This process is repeated for every ,

. For the two-way tables, we compute the counts of observations in the contingency tables formed by county label and another non-geographic variable and follow the same procedure as for the one-way tables. Similarly for the three-way tables, counts of observations in the contingency tables formed by county label and two other non-geographic variables are computed in

and . Table 3 gives summary of absolute deviance of the synthetic data from the original data. Results are averages of partially synthetic datasets, with results on in column DPMPM, and results on in column DP-areal. These summaries show that the DPMPM synthesizer produces smaller absolute deviance than the DP-areal synthesizer, especially in the one-way and two-way tables.

Figure 1: One-way table of deviation, DPMPM vs DP-areal.

Figure 1, Figure 1(a), and Figure 1(b) visualize the one-way, two-way and three-way deviations in all synthetic datasets through density plots. The actual deviations, not the absolute deviations, are plotted. Especially evident in Figure 1, the one-way deviation in is more concentrated around 0 than that in , indicating less overall deviation of the synthetic datasets generated by the DPMPM synthesizer from the original dataset. The findings are in accordance with Table 3, showing higher level of preservation of distributional characteristics of county label (i.e. the utility of the synthetic data) by the DPMPM synthesizer than that by the DP-areal synthesizer. The DP-areal synthesizer, nevertheless, produces good utility. The density plots of actual deviations in two-way and three-way tables in Figure 1(a) and Figure 1(b) show better utility performance of the DPMPM synthesizer, overall, although the differences in performance is smaller compared to one-way tables. Overall, the global utility evaluation shows higher utility for synthetic data generated by the DPMPM synthesizer than that by the DP-areal synthesizer, though the utility of both synthesizers is good.

(a) Two-way table.
(b) Three-way table.
Figure 2: Two-way and three-way tables of deviation, DPMPM vs DP-areal.

3.2 Within pattern utility measure

Another approach for evaluating the synthetic data utility is to compare the induced distributions over the county labels within each pattern to the distribution in the original data. Recall that we define a pattern as a unique composition of non-geographic variables, as {Gender, Income, Age}, and there are 40 patterns in the CE data sample. Since the partially synthetic CE datasets have only the county label synthesized, evaluating the preservation of distributional characteristics of county label within each pattern is of particular interest. This utility measure is within pattern, or local.

Figure 3: Counties in Pattern 1 to Pattern 4.

To visualize the distribution of county label within each pattern, and the preservation of its distribution by the DPMPM synthesizer and the DP-areal synthesizer, we plot density plots of these three distributions and put them on the same graph. In Figure 3: the red curve represents the original data distribution of observed county labels, the green curve represents the distribution of synthetic county labels by the DPMPM synthesizer, and the blue curve represents the distribution of synthetic county labels by the DP-areal synthesizer. The green and blue curves are based on one of synthetic dataset generated by the two synthesizers respectively for readability and brevity. Plots based on remaining synthetic datasets show similar results. For brevity, density plots in Pattern 1 to Pattern 4 are included in the main text. The remaining 36 density plots are presented in a Supplement.

Overall, in most of the patterns, the distribution of county label in synthetic datasets generated by the DP-areal synthesizer (blue curve) is closer to that of the original data (red curve), than is the distribution generated by the DPMPM synthesizer (green curve). The DPMPM better reproduces peaked behavior in the original data distribution, while the DP-areal model induces more smoothing. The DPMPM also better reproduces local features in the county label distribution that are smoothed over by the DP-areal. Both synthesizers, however, induce an equally high degree of smoothing in patterns with a small number of observations/records.

4 Disclosure risks measure

As we have seen in Section 3, the DPMPM and the DP-areal synthesizers induce smoothing in the distribution of the county label as compared to the original data, both globally and within each pattern. The induced smoothing attempts to maintain the utility of the synthetic data, making it useful to data analysts for their analysis interests. At the same time, the induced smoothing provides privacy protection in the synthetic data. With county labels synthesized in the CE data, an intruder can no longer know the true category of the synthesized county label of any record. Moreover, she can no longer know the identification of any record with 100% certainty, even though she could have access to the un-synthesized pattern (gender, age, and income) of that record. Nevertheless, an intruder could still make guesses about the true category of the synthesized county label, and make guesses about the identity of any record, using un-synthesized variables that might be available to her. The first type of risk, where the intruder seeks the value of the county label for a record, is commonly known as attribute disclosure. The second type of risk, where the intruder seeks the identity of a record, is commonly referred to as identification disclosure. We proceed to construct measures for attribute and identification disclosure risks.

Typically, both types of disclosure risks are measured for the simulated synthetic datasets, indicating level of protection (or the lack of) by the release of synthetic data. If multiple synthesizers have been proposed, as in our current application for synthetic CE data, evaluations of disclosure risks measures and their comparisons could inform the CE program about which synthesizer provides higher privacy protection. In the final analysis, the data disseminating agencies, such as the CE program, are able to make a decision among the synthesizers through evaluating their relative usefulness (i.e. utility measures) and the level of privacy protection that they encode (i.e. the disclosure risk measures). If disclosure risk measures can be developed for the original data, agencies will have much more information when deciding among synthesizers based on not only comparison of their disclosure risk measures to each other, but also on comparison to those in the original data. We next describe our general approach to measure disclosure risks in the original data.

4.1 Inherent disclosure risks in the CE data

A synthesizer replaces the county label for every record in the original data with a synthesized value. In the limit as a synthesizer becomes more accurate, the best synthesizer may be imagined as one that generates the synthesized values of the county label for all records by using the original data distribution. We label this type of synthesizer that uses the original data distribution as “perfect”. Generating synthetic values under the original data distribution is both independent of any synthesizing model and may produce synthesized values that differ from the original data. A perfect synthesizer provides the highest possible utility, because there is no deviation from the original data distribution; however, it increases disclosure risks at the same time. Although the original data is not released to the public, as data disseminators, we can use it to construct maximum disclosure risks measures and create an upper bound of the acceptable disclosure risks.

To mimic the behavior of an intruder with the highest amount of information (i.e. the exact distribution of county label), we can sample a new draw of the county label for every record based on its original distribution within each pattern. We approximate the distribution of county labels in the original data with the empirical distribution, which we sample under a weighted re-sampling scheme. Given this set of newly sampled county labels, we can then calculate the identification disclosure risks and attribute disclosure risks. We repeat this process for a large number of times, and obtain sampling distributions of two types of disclosure risks, fully capturing the variability in the sampling processes. Disclosure risks computed based on this repeated sampling procedure provide an upper abound of the acceptable disclosure risks, because this procedure uses maximum amount of information that may be published - the original data distribution. Therefore, we label this scenario as the maximum disclosure risks scenario.

By contrast, we may establish a lower bound on the risk that may be achieved by a synthesizer. This requires us to go to the other extreme, where an intruder has the least amount of information about the distribution of the county label in the original data. We employ a uniform distribution over among all possible county labels within each pattern as the minimally-informative scenario; i.e., we can sample a new draw of the county label of every record from a uniform distribution over the 133 observed county labels within each pattern in the CE data sample. Given this set of newly sampled county labels, identification and attribute disclosure risks can be calculated, and this process is repeated for a large number of times to obtain the sampling distributions of two types of disclosure risks. Because this repeated sampling procedure uses the minimum amount of information, disclosure risks computed based on this procedure provide a lower bound of the acceptable disclosure risks, and we label this scenario as the minimum disclosure risks scenario. Similar to the maximum disclosure risks scenario, the minimum scenario is a type of risk that is inherent in the original dataset and independent of the synthesizer.

4.2 Identification disclosure risks

4.2.1 Three summaries of identification disclosure probabilities

Identification disclosure risks measure how likely it is for an intruder to correctly identify (e.g the name of) a record by matching with available information from external files. In our current application, the released synthetic datasets contain three un-synthesized variables (gender, age, and income), and one synthesized variable (county label). Suppose an intruder has access to an external file that includes gender, age and county label of every record, as well as their identities. The attribute values, but not the identity of the records, also appear in the released synthetic datasets. With access to such external information, the intruder may attempt to identify a record by performing matches within each pattern. The matching is performed within pattern because the intruder knows the values of the pattern attribute value are not synthesized.

Without loss of generality, assume that the intruder has external information about every record’s gender, age and county label. Let be the number of records with the highest match probability for record, (i.e. the number of records having the exact same values of gender, age, and county label as record in the original data); let if the true match is among the units and , otherwise. We recall that will be set to if the synthesized value for the county label for record, , is the same as that for the original data. Let when and otherwise (i.e., indicates a true unique match exists), and let denote the total number of target records (i.e. every record in the CE data). Finally, let when and otherwise (i.e., indicates a false unique match), and let equal the number of records with (i.e. the number of records that are uniquely matched among target records). There are three widely used file-level summaries of identification disclosure probabilities using the notations and definitions given above (Reiter and Mitra, 2009; Drechsler and Reiter, 2010; Hu and Hoshino, 2018; Hu, 2018+; Drechsler and Hu, 2018+).

  1. The expected match risk:

    (15)

    When and , the contribution of unit to the expected match risk reflects the intruder randomly guessing at the correct match from the candidates, where the intruder probability of a correct guess is . In general, the higher the expected match risk, the higher the identification disclosure risks.

  2. The true match rate:

    (16)

    which is the percentage of true unique matches among the target records. In general, the higher the true match rate, the higher the identification disclosure risks.

  3. The false match rate:

    (17)

    which is the percentage of false matches among unique matches. In general, the lower the false match rate, the higher the identification disclosure risks.

When evaluating the identification disclosure risks under the maximum disclosure risks scenario, we repeatedly conduct a weighted draw from the 133 counties within each pattern, given its original distribution under an empirical distribution approximation. For a record, , we first gather all records sharing the same pattern with record , i.e., all records with the same (gender, age, income) composition. Secondly, we generate a new county label for record, , from the original distribution of the county labels of all records with the same pattern. This process is done for every record to obtain a new “synthetic” dataset with synthetic county labels updated according to its original distribution. Thirdly, we compute the three summaries of the identification risk probabilities for this “synthetic” dataset. In the end, we perform Steps 1 through 3 for a large number of times, and sampling distributions of expected risk, true match rate, and false match rate. For the minimum disclosure risks scenario, keep Steps 1 and 3 and change Step 2 to generate a new county label for record from a uniform distribution from 1 to 133.

4.2.2 Results

When performing matching with external files, there are different assumptions about the intruder’s knowledge of the un-synthesized variables. We consider three cases of assumption of intruder’s knowledge, encoded in column Known in Table 4: i) only gender; ii) gender and age; iii) gender, age, and income. We collapse across patterns in case iii) to achieve case ii) and case i). We would expect identification risks to be generally lower as we collapse across patterns since would be expected to increase. Such is not always the case, however, as we observe for the DP-areal results, presented below.

For each of the three cases, we report summaries of expected risk, true match rate, and false match rate of identification disclosure risks. The column DPMPM reports average summaries of synthetic datasets, generated by the DPMPM synthesizer. The column DP-areal reports average summaries of synthetic datasets, generated by the DP-areal synthesizer. The column Max reports average summaries of repeated sampling under the maximum identification disclosure risks scenario. We exclude the column Min for brevity, and it has 0 for expected risk and true match rate for all cases, and NaN for false match rate for all cases (due to , i.e. no unique matches, in the denominator).

A subtle point about the Max for the false match rate is that because of the definition of the false match rate (the percentage of false matches among unique matches; and higher false match rate means lower privacy protection), the computed identification risk measures in the Max column is actually the lower bound of the acceptable range of false match rate; however, the Max for the expected risk and the true match rate serve as upper bounds, as discussed before.

Known Summary DPMPM DP-areal Max
gender expected risk 2.497 0.952 4.708
true match rate 0.000 0 0.000
false match rate 0.997 1* 0.989
gender, expected risk 10.474 0.851 20.073
and age true match rate 0.000 0.000 0.001
false match rate 0.991 1.000 0.978
gender, expected risk 29.466 0.755 38.295
age, true match rate 0.001 0.000 0.002
and income false match rate 0.991 1.000 0.984
Table 4: Expected risk, true match rate, and false match rate of identification disclosure risks of the synthetic datasets. Results are averages of partially synthetic datasets for the columns DPMPM and DP-areal, and averages of repeated sampling iterations for the Min column. (*computed based on non cases)

As evident in Table 4, for every case of assumption of intruder’s knowledge, summaries of identification disclosure risks indicate significantly lower expected risk in synthetic data generated by the DP-areal synthesizer. As the intruder’s knowledge increases, the expected risk in the DPMPM synthetic data increases and approaches to the Max, while the expected risk in the DP-areal synthetic data slightly decreases. On average, expected risk in the DPMPM synthetic data and the DP-areal synthetic data is bounded below by Min and bounded above by Max. The results of the true match rate and the false match rate show much smaller difference between the two synthesizers and between each synthesizer with the Max. Overall the DPMPM synthesizer and the DP-areal synthesizer has 0 or close to 0 true match rate, and 1 or close to 1 false match rate, suggesting high level of privacy protection.

To take a closer look at the expected risk, consider case iii) where exact matching is done assuming intruder’s knowledge of gender, age, and income. Recall that there are observations in the CE sample. For the DPMPM synthetic datasets, an average 29.466 expected risk indicates a record-level average 0.00475 expected risk when dividing by . The corresponding record-level average expected risk from the DP-areal synthetic datasets is 0.00014, and that from the repeated sampling under the maximum risk scenario is 0.00606. Overall, the expected risk is very low with both synthesizers, and it is low even under the maximum risk scenario, which suggests low inherent identification disclosure risks in the original CE data.

To visualize the variability among the summaries of synthetic datasets for each of and , Figure 4 presents a histogram of repeated samples of the expected risk under the maximum disclosure risks scenario is plotted. In addition, the minimum, mean, and maximum expected risk among DPMPM synthetic datasets (dashed and blue) and those among DP-areal synthetic datasets (dotted and red) are also co-plotted. Hidden in the average summaries of expected risk in Table 4 is that although the average expected risk among synthetic datasets is smaller than the average upper bound formed by the average expected risk among repeated samples under the maximum disclosure risks scenario, the expected risk computed for the DPMPM synthetic dataset (dashed and blue) shows significant overlap with the expected risk computed for repeatedly sampled “synthetic” datasets under the maximum disclosure risks scenario (filled histogram). The maximum expected risk among DPMPM synthetic datasets appears to be larger than expected risk for almost half of “synthetic” datasets under the maximum disclosure risks scenario, which may be cause for concern about DPMPM synthetic datasets. By contrast, the variability in expected risk among DP-areal synthetic datasets (dotted and red) is much smaller, and the expected risk for the DP-areal synthetic dataset is overall much smaller than those computed under the maximum disclosure risks scenario, as shown in Table 4, suggesting acceptable identification disclosure risks in DP-areal synthetic datasets. Similar plots for cases i) gender and ii) gender and age of intruder’s knowledge assumptions suggest overall acceptable identification disclosure risks for both the DPMPM and the DP-areal synthesizers, which agree with the average summaries in Table 4. The plots are placed in a Supplement for brevity.

Figure 4: Histogram of expected risks under the maximum risk scenario. Vertical lines include the min, mean, and max among the synthetic datasets.

4.3 Attribute disclosure risks

4.3.1 The summary of exact attribute disclosure risks

Attribute disclosure risks measure how likely it is for an intruder to correctly infer the true value of a synthesized variable or attribute in the original data from the synthetic datasets. Such attacks usually make use of all un-synthesized attributes, therefore we only consider the case where the intruder uses the available pattern of each record (i.e. gender, age, and income), to infer the true county label in the original dataset. That is, we assume gender, age and income are available when mimicking the intruder’s behavior to conduct attribute attacks. Let if the synthesized county label category is the same as the original county label for record , and , otherwise (i.e., indicates an exact attribute disclosure). The number of exact attribute disclosures is

(18)

We note that some attribute disclosure risks for variables containing geographic information proposed in previous works focus on distance between the synthesized location and the true location. For every record, Wang and Reiter (2012) reported a Euclidean distance between the intruder’s guess of the longitude and latitude and the actual longitude and latitude, and then reported the count recording the number of actual cases in circle centered at the actual longitude and latitude with radius . Paiva et al. (2014) reported a Euclidean distance measure between the true location and the guess with the maximum posterior probability of record . Because the county label in the CE data sample is treated as categorical, and moreover because of little geographic information this variable carries, as discussed in Section 1.1, measuring a Euclidean distance between the true county label and the synthesized county label for record is not feasible. We, therefore only consider the number of exact attribute disclosures in Equation (18). Our measure for attribute risk is composed as the sum of exact attribute disclosures for all records, and an exact attribute disclosure for record is declared when the synthesized county label category is the same as the county label category in the original data. We construct our definition for attribute risks to be consistent with that for identification risks, where both produce file-level measures of risk designed to help reporting agencies assess the overall risk associated with the potential release of the synthetic data. The same procedure is used to generate minimum and maximum attribute disclosure risks as for the identification disclosure risks.

4.3.2 Results

The first row of Table 5 presents the average numbers of exact attribute disclosures among DPMPM synthetic datasets , and the same for , in the DPMPM and DP-areal columns, respectively. It also reports the average numbers of exact attribute disclosures among repeated sampling iterations under the minimum disclosure risks scenario, and among repeated sampling iterations under the maximum disclosure risks scenario, in the Min and Max columns respectively. The second row presents corresponding percentages of exact attribute disclosures, by dividing the values in the first row with . These results show that, on average, the numbers of exact attribute disclosures in both the DPMPM synthetic data and in the DP-areal synthetic data are generally far away from the maximum scenario, with the latter lower than the former, consistent with results for identification disclosures risk. The inherent risks in the original data are not high when given the maximum amount of information, while it is not zero when given the minimum amount of information.

Summary Min DPMPM DP-areal Max
Number of exact attribute disclosures 47.25 115.25 88.15 175.33
Percentage of exact attribute disclosures 0.76% 1.86% 1.42% 2.82%
Table 5: The average numbers and percentages of of exact attribute disclosures. Results are averages of partially synthetic datasets for the columns DPMPM and DP-areal, and averages of repeated sampling iterations for the Min and Max columns.
Figure 5: Histogram of the number of exact attribute disclosures under the maximum risk scenario. Vertical lines include the min, mean, and max among the synthetic datasets.

Figure 5 plots one histogram of exact attribute disclosures based on iterations of repeated sampling under the minimum disclosure risks scenario (green), and another histogram under the maximum disclosure risks scenario (red). Additionally, vertical lines of the results from the DPMPM synthetic datasets (dashed and blue) and from the DP-areal synthetic datasets (dotted and red) are plotted to show comparisons. Each set of three vertical lines correspond to the minimum, the mean, and the maximum of the number of exact attribute disclosures among the synthetic datasets. Figure 5 agrees with the results in Table 5, showing that when considering the variability in sampling, the DP-areal synthetic datasets have generally lower attribute disclosure risks comparing to the DPMPM synthetic datasets, and both are well below the maximum disclosure risk scenarios.

5 Conclusion

We devised an end-to-end process for data synthesis, including formulating data synthesizers and measuring and comparing their utilities and disclosure risks in a fashion that promotes ease-of-interpretation to facilitate decision making by statistical agencies who may consider the release of respondent-level synthetic data. Our data synthesizers are constructed for the challenging case of generating geographic location labels. Our formulations replace spatial priors with more general nonparametric prior formulations due to geographic sparsity, with the result that they may broadly apply to the synthesis of any polytomous variable characterized by multiple levels. We leveraged the patterns formed from the intersection of known categorical variables to define a new local utility measure based on the distribution of the synthesized county label that makes intuitive the comparisons of usefulness among the synthesizers. We designed new minimum and maximum risk measures that characterize the data and are independent of the choice of synthesizer. These minimum and maximum risk bounds set context for evaluating the relative improvements in privacy protection provided by the synthesizers in a way that aids the agency decision-maker to evaluate whether synthetic data is sufficiently privacy protected for release.

An opportunity for future work is to construct a simple mechanism that regulates the amount of smoothing produced by a synthesizer to allow exploration of the utility-risk trade-offs.

6 Supplementary material A: Within Pattern Density Plots of County Labels among the Synthesizers

Figure 6 to Figure 14 are within pattern distribution plots of the county label synthesized from the DPMPM, DP-Areal synthesizers and the original data distribution, from Pattern 5 to Pattern 40.

Figure 6: Counties in Pattern 5 to Pattern 8.
Figure 7: Counties in Pattern 9 to Pattern 12.
Figure 8: Counties in Pattern 13 to Pattern 16.
Figure 9: Counties in Pattern 17 to Pattern 20.
Figure 10: Counties in Pattern 21 to Pattern 24.
Figure 11: Counties in Pattern 25 to Pattern 28.
Figure 12: Counties in Pattern 29 to Pattern 32.
Figure 13: Counties in Pattern 33 to Pattern 36.
Figure 14: Counties in Pattern 37 to Pattern 40.

7 Supplementary material B: Identification Disclosure Risk Comparisons under Partially Observed Patterns

Figure 15 and Figure 16 are histograms of identification disclosure risks for the DPMPM, DP-Areal and Maximum (from the original data) for the cases where only a subset of variables in the pattern are published.

Figure 15: Known variables: gender. Histogram of expected risks under the maximum risk scenario. Vertical lines include the min, mean, and max among the synthetic datasets; dashed for DPMPM, and dotted for DP-areal.
Figure 16: Known variables: gender and age. Histogram of expected risks under the maximum risk scenario. Vertical lines include the min, mean, and max among the synthetic datasets; dashed for DPMPM, and dotted for DP-areal.

References

  • Besag et al. (1991) Besag, J., York, J., and Molli, A. (1991). Bayesian image restoration, with two applications in spatial statistics. Annals of the Institute of Statistical Mathematics 43, 1–20.
  • Caiola and Reiter (2010) Caiola, G. and Reiter, J. P. (2010). Random forests for generating partially synthetic, categorical data. Transactions on Data Privacy 3, 27–42.
  • Clayton and Bernardinelli (1992) Clayton, D. G. and Bernardinelli, L. (1992). Bayesian methods for mapping disease risk. In P. Elliott, J. Cuzick, D. English, and R. Stern, eds., Geographical and Environmental Epidemiology: Methods for Small Area Studies, 205–220. Oxford University Press.
  • Clayton and Kaldor (1987) Clayton, D. G. and Kaldor, J. (1987). Empirical bayes estimates of age-standardized relative risks for use in disease mapping. Biometrics 43, 671–681.
  • Drechsler and Hu (2018+) Drechsler, J. and Hu, J. (2018+). Synthesizing geocodes to facilitate access to detailed geographical information in large scale administrative data arXiv:1803.05874.
  • Drechsler and Reiter (2009) Drechsler, J. and Reiter, J. P. (2009). Disclosure risk and data utility for partially synthetic data: An empirical study using the german iab establishment survey. Journal of Official Statistics 25, 589–603.
  • Drechsler and Reiter (2010) Drechsler, J. and Reiter, J. P. (2010). Sampling with synthesis: A new approach to releasing public use microdata samples of census data. Journal of the American Statistical Association 105, 1347–1357.
  • Duncan and Lambert (1986) Duncan, G. T. and Lambert, D. (1986). Disclosure-limited data dissemination. Journal of the American Statistical Association 10, 10–28.
  • Duncan and Lambert (1989) Duncan, G. T. and Lambert, D. (1989). The risk of disclosure for microdata. Journal of Business and Economic Statistics 7, 207–217.
  • Dunson and Xing (2009) Dunson, D. B. and Xing, C. (2009). Nonparametric bayes modeling of multivariate categorical data. Journal of the American Statistical Association 104, 1042–1051.
  • Hu (2018+) Hu, J. (2018+). Bayesian estimation of attribute and identification disclosure risks in synthetic data. arXiv:1804.02784.
  • Hu and Hoshino (2018) Hu, J. and Hoshino, N. (2018). The Quasi-Multinomial synthesizer for categorical data. In J. Domingo-Ferrer and F. Montes, eds., Privacy in Statistical Databases, vol. 11126 of Lecture Notes in Computer Science, 75–91. Springer.
  • Hu et al. (2014) Hu, J., Reiter, J. P., and Wang, Q. (2014). Disclosure risk evaluation for fully synthetic categorical data. In J. Domingo-Ferrer, ed., Privacy in Statistical Databases, vol. 8744 of Lecture Notes in Computer Science, 185–199. Springer.
  • Hu et al. (2018) Hu, J., Reiter, J. P., and Wang, Q. (2018). Dirichlet process mixture models for modeling and generating synthetic versions of nested categorical data. Bayesian Analysis 13, 183–200.
  • Ishwaran and James (2001) Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association 96, 161–173.
  • Kinney et al. (2014) Kinney, S. K., Reiter, J. P., and Miranda, J. (2014). Synlbd 2.0: Improving the synthetic longitudinal business database. Statistical Journal of the International Association for Official Statistics 30, 129–135.
  • Kinney et al. (2011) Kinney, S. K., Reiter, J. P., Reznek, A. P., Miranda, J., Jarmin, R. S., and Abowd, J. M. (2011). Towards unrestricted public use business microdata: The synthetic longitudinal business database. International Statistical Review 79, 363–384.
  • Lambert (1993) Lambert, D. (1993). Measures of disclosure risk and harm. Journal of Official Statistics 9, 313–331.
  • Liang et al. (2009) Liang, S., Carlin, B. P., and Gelfand, A. E. (2009). Analysis of minnesota colon and rectum cancer point patterns with spatial and nonspatial covariate information. Annals of Applied Statistics 3, 943–962.
  • Little (1993) Little, R. J. A. (1993). Statistical analysis of masked data. Journal of Official Statistics 9, 407–426.
  • Manrique-Vallier and Hu (2018) Manrique-Vallier, D. and Hu, J. (2018). Bayesian non-parametric generation of fully synthetic multivariate categorical data in the presence of structural zeros. Journal of the Royal Statistical Society, Series A 181, 635–647.
  • Paiva et al. (2014) Paiva, T., Chakraborty, A., Reiter, J. P., and Gelfand, A. E. (2014). Imputation of confidential data sets with spatial locations using disease mapping models. Statistics in Medicine 33, 1928–1945.
  • Quick et al. (2018) Quick, H., Holan, S. H., and Wikle, C. K. (2018). Generating partially synthetic geocoded public use data with decreased disclosure risk using differential smoothing. Journal of the Royal Statistical Society, Series A 181, 649–661.
  • Quick et al. (2015) Quick, H., Holan, S. H., Wikle, C. K., and Reiter, J. P. (2015). Bayesian marked point process modeling for generating fully synthetic public use data with point-referenced geography. Spatial Statistics 14, 439–451.
  • Reiter (2005a) Reiter, J. P. (2005a). Estimating risks of identification disclosure in microdata. Journal of the American Statistical Association 100, 1103–1112.
  • Reiter (2005b) Reiter, J. P. (2005b). Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. Journal of the Royal Statistical Society, Series A 168, 185–205.
  • Reiter (2005c) Reiter, J. P. (2005c). Using cart to generate partially synthetic public use microdata. Journal of Official Statistics 21, 441–462.
  • Reiter (2012) Reiter, J. P. (2012). Discussion: Bayesian perspectives and disclosure risk assessment. International Statistical Review 80, 373–375.
  • Reiter and Mitra (2009) Reiter, J. P. and Mitra, R. (2009). Estimating risks of identification disclosure in partially synthetic data. The Journal of Privacy and Confidentiality 1, 99–110.
  • Reiter et al. (2014) Reiter, J. P., Wang, Q., and Zhang, B. (2014). Bayesian estimation of disclosure risks in multiply imputed, synthetic data. Journal of Privacy and Confidentiality 6, Article 2.
  • Rubin (1993) Rubin, D. B. (1993). Discussion statistical disclosure limitation. Journal of Official Statistics 9, 461–468.
  • Sethuraman (1994) Sethuraman, J. (1994). A constructive definition of dirichlet priors. Statistica Sinica 4, 639–650.
  • Si and Reiter (2013) Si, Y. and Reiter, J. P. (2013). Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. Journal of Educational and Behavioral Statistics 38, 499–521.
  • Stan Development Team (2016) Stan Development Team (2016). RStan: the R interface to Stan. R package version 2.14.1.
  • Wang and Reiter (2012) Wang, H. and Reiter, J. P. (2012). Multiple imputation for sharing precise geographies in public use data. Annals of Applied Statistics 6, 229–252.
  • Wei and Reiter (2016) Wei, L. and Reiter, J. P. (2016). Releasing synthetic magnitude microdata constrained to fixed marginal totals. Statistical Journal of the IAOS 32, 93–108.
  • Zhou et al. (2010) Zhou, Y., Dominici, F., and Louis, T. A. (2010). A smoothing approach for masking spatial data. Annals of Applied Statistics 4, 1451–1475.