1 Introduction
Statistical agencies collect microdata of respondents (individuals or business establishments) through various censuses and surveys. The agencies then make some versions of the collected microdata public available, subject to privacy and confidentiality protection (e.g. Title 13 and Title 26, U. S. Code). The agencies need to protect the identity of the respondents, as well as the original attribute information of the respondents which are deemed sensitive. These correspond to identification disclosure and attribute disclosure respectively.
To provide such protection, at the very least, unique identifiers such as Social Security Number (SSN) for individuals and Employer Identification Number (EIN) for business establishments cannot be released in the public available microdata. Moreover, other seemingly safe attributes cannot be released at the same time either, because a combination of a small number of attributes would increase the chance of respondent identification greatly. Sweeney (2000) demonstrated that using 1990 U. S. Census summary data, 87% (216 million out of 248 million) of the US population had reported characteristics that likely made them unique based only on {5digit ZIP, gender, date of birth}, and about half (53%) are likely to be uniquely identified by only {place, gengder, date of birth}.
The statistical agencies thus need to mask the microdata before public release. These masking techniques are called Statistical Disclosure Limitation (SDL) techniques, which include i) data swapping, ii) adding random noise, and iii) microaggregation, among others. Hundepool et al. (2012)
provides a comprehensive review of the SDL techniques for microdata. Though these methods or combinations of them could provide some level of privacy protection, the utility of the masked data (e.g. results from a regression analysis using the masked data should be close to that using the original confidential data) are compromised
(Raghunathan et al., 2003). Moreover, for large and complex surveys, such SDL techniques need to be applied at a high intensity, which is timeconsuming and hurting the utility of final masked data.One alternative to the SDL techniques is synthetic data. Based on the theory and applications of multiple imputation methodology for missing data problems
(Rubin, 1987), multiplyimputed synthetic data can be generated from the statistical models estimated from the original confidential data. Carefully designed statistical models could produce high utility, low risks public microdata. Multiple synthetic datasets should be generated, and appropriate combining rules have been developed to provide accurate point estimates and variance estimates of parameters of interest. Refer to
Reiter and Raghunathan (2007); Drechsler (2011b) for details of the combining rules.Synthetic data comes in two flavors, i) partially synthetic, where only sensitive attributes of all or some of the records are synthesized (Little, 1993), and ii) fully synthetic, where all attributes of every record are synthesized (Rubin, 1993). Since their proposals, a great number of research has been done on developing synthesizers, evaluating the utility and risks of the synthetic data. Refer to Drechsler (2011b) for a comprehensive review of partially and fully synthetic data and their applications.
It is worth noting that the U. S. Census Bureau has been involved in providing access to microdata through synthetic data products. Examples of their synthetic data products include i) OnTheMap (based on Machanavajjhala et al. (2008)), ii) the synthetic Longitudinal Business Database  SynLBD (based on Kinney et al. (2011, 2014)), iii) synthetic Survey of Income and Program Participation  SIPP Synthetic Beta (based on Benedetto et al. (2013)), among others. Germany also has implemented synthetic versions of their German IAB Establishment Panel (based on Drechsler et al. (2008a, b)). More and more statistical agencies have started experimenting with synthetic data to for their microdata releases.
Among the published work on synthetic data, many focus on developing synthesizers and proposing utility measures (both the global utility measures (Karr et al., 2006)
for synthetic data in general, and outcome specific utility measures for the particular application). The risk measures, while important, have been paid less attention to overall. This is understandable for at least three reasons: i) for the attribute disclosure risks (i.e. measuring the probability of an intruder correctly guessing the original values of the synthesized attributes of a respondent), which exists both in partially synthetic data and fully synthetic data, the principled evaluation procedure came out only recently and is not a straightforward procedure to implement; ii) for the identification disclosure risks (i.e. measuring the probability of correctly identifying a respondent by matching with available information from elsewhere), which usually only exist in partially synthetic data, the evaluation procedure can be followed in a straightforward manner, encouraging no further development of the measures themselves; and iii) unlike utility measures, which can vary a lot in different applications, risk measures evaluation procedure largely depends on the type of risk measure.
In this paper, we want to present easytofollow construction of some Bayesian estimation methods for evaluating attribute disclosure risks and identification disclosure risks. These methods use Bayesian thinking in computing probabilities, which are generally natural and easy to understand. Bayesian probabilities are subjective and Bayesian methods are useful for modeling the beliefs of data intruders. Moreover in the estimation process, different assumptions of intruder’s knowledge and behavior can be incorporated at various stages. While flexible and intuitive, the actual implementation of the Bayesian estimation process can be complicated and difficult to execute. Therefore, we aim at highlighting key steps in the estimation process, and complementing with real applications with a focus on the risk evaluation aspect of each application. The readers will also see exciting synthetic data projects for various types of data and protection purposes, and the applicationspecific disclosure risk measures being considered in each application. Discussion of challenges and future directions are throughout the paper as well as at the end as a summary.
The remainder of paper is organized as follows. In Section 2 we give an overview of risk evaluation for synthetic data and lay out the two types of disclosure risks we consider, namely the attribute disclosure risks and identification disclosure risks. Section 3 presents the Bayesian estimation methods of attribute disclosure risks, from notations and step, to key estimating steps, then selected examples, and finally discussion and comments. Section 4 follows a similar structure, where we present the Bayesian estimation methods of identification disclosure risks. Finally in Section 5, we give a summary.
2 Overview of synthetic data risks evaluation
This paper considers two types of disclosure risks: i) attribute disclosure risks (i.e. the intruder correctly inferring the original values of the synthesized attributes), and ii) identification disclosure risks (i.e. the intruder correctly identifying a respondent in the sample by matching with other available information). Depending on the synthetic flavor (partially versus fully), one or both of these two types of disclosure risks potentially exist. In this section, we briefly discuss why for partially synthetic data, both attribute disclosure and identification disclosure risks potentially exist, whereas for fully synthetic data, only attribute disclosure risks are considered and evaluated.
We note that these two types of disclosure risks are generic, i.e. for any synthetic data product, one or both types should be considered and evaluated. Attribute disclosure risks in particular, come in various forms depending on the nature of the synthesized attributes and the purposes of the privacy protection. For some applications, such as Hu et al. (2014) where fully synthetic individual records were generated, the attribute disclosure risks are in the form of guessing correctly of the attributes of a record. In other applications, for example when generating synthetic geolocation applications, researchers had created attribute disclosure risk measures based on distance between synthesized geolocations and actual geolocations (Wang and Reiter, 2012; Paiva et al., 2014; Quick et al., 2015, 2018). Moreover, for synthetic business establishment data applications, researchers had created attribute disclosure risk measures based on percentages of closest match and their variations (DomingoFerrer et al., 2001; Kim et al., 2015), and other measures based on relative difference between the true largest value and the intruder’s estimate (Kim et al., 2018). We note that not all of these applicationspecific disclosure risk measures undergo similar estimation procedures as the ones we present in this paper.
2.1 Disclosure risks of releasing partially synthetic data
In partially synthetic data, only sensitive attributes of all or some of the records are synthesized and some other attributes are left unsynthesized. When some of the unsynthesized attributes are available to the intruder via external databases, a matching mechanism based on the common available attributes may allow the intruder to identify records in the released dataset, thus resulting in identification disclosure risks. For example, suppose a partially synthetic dataset contains 1000 individual records and 6 attributes. Among them, 3 are synthesized sensitive attributes {age, date of birth, annual income} and 3 are unsynthesized attributes {gender, marital status, county}. Suppose now an intruder knows that person is in the sample, and the intruder also knows the gender and the county of person . Since both gender and county are unsynthesized, the intruder will have a reasonable chance of identifying person in the sample. Such disclosure risks are identification disclosure.
In addition to potential identification disclosure risks, attribute disclosure risks exist in partially synthetic data. We can easily imagine an intruder trying to infer the true values of the synthesized attributes given the released synthetic data, the unsynthesized attributes and other information. The availability of the unsynthesized attributes may greatly increase the chance of accurate inference of the synthesized attributes. Continuing the example of person from earlier, with a possible identification, the intruder can now move to find out the original values of the synthesized age, date of birth and annual income of person . Such disclosure risks are attribute disclosure.
2.2 Disclosure risks of releasing fully synthetic data
Several authors had claimed that in fully synthetic data, identification disclosure risks are not applicable since there is no unique mapping of the records in the synthetic data to the records in the original data (Hu et al., 2014; Wei and Reiter, 2016). This is generally true because all attributes of all records are synthesized in fully synthetic data.
Although identification disclosure risks are treated as nonexisting in fully synthetic data, attribute disclosure risks potentially exist, as an intruder can try to use the released synthetic data and any other information to infer an entire records. For example, suppose a fully synthetic dataset contains 1000 individual records and 6 attributes, {age, date of birth, annual income, gender, marital status, county}, of which all are synthesized. Suppose that person is in the published synthetic data and the intruder knows this information. In addition, the intruder knows the real attribute values of the other 999 individuals, then the intruder might be able to find out the attributes of person with a reasonable chance. Such disclosure risks are attribute disclosure.
2.3 Disclosure probabilities and their summaries
The Bayesian estimation methods we present in this paper focus on calculating the probabilities of attribute disclosure and identification disclosure. For example, for attribute disclosure risks, we present how to estimate the probability of guessing the original values of the synthesized vector of attributes
of record to be given the synthetic data, the unsynthesized attributes, and any other information. This estimated probability is for record specifically, which is at the record level.In addition, synthetic data disseminators can provide filelevel summaries of these recordlevel probabilities. These filelevel summaries can be different depending on the applications. For example, when synthesizing fully categorical data as in Hu et al. (2014, 2018) in Section 3.3.1, ranks and renormalized probabilities are reported as filelevel summaries because of the nature of the synthesis (fully) and the nature of the attributes (all categorical). When synthesizing partially continuous data as in Wang and Reiter (2012) in Section 3.3.2, Euclidean distances between the intruder’s guess of the longitude and latitude and the actual longitude and latitude, and subsequent counts within a radius are reported as filelevel summaries because of the nature of the synthesis (partially) and the nature of the attributes (continuous, in particular, the longitude and latitude of a record).
For attribute disclosure, we focus on calculating the recordlevel disclosure probabilities in the key estimating steps (Section 3.2). Then in the application section (Sections 3.3), we present filelevel disclosure probability summaries for each application, though we touch on the recordlevel disclosure probabilities calculation assumptions briefly. For identification disclosure, a standard set of 3 filelevel summaries is reported in all selected examples. Therefore, we present both the methods to calculate the recordlevel disclosure probabilities and the methods to calculate the filelevel disclosure probability summaries in the key estimating steps (Section 4.2). We want to make the readers aware of the distinction between the recordlevel probabilities and filelevel summaries of disclosure risks.
3 Bayesian estimation of attribute disclosure risks
As discussed previously with the examples in Section 2.1 and Section 2.2, attribute disclosure risks potentially exist for fully synthetic data and partially synthetic data.
Evaluating the attribute disclosure risks in fully synthetic data had been a seemingly impossible task for a while. Empirical matching or misclassificationbased approaches (Shlomo and Skinner, 2010) cannot be used since there is no correspondence between the original and the synthetic datasets. Skinner (2012) called for further research on the existing Bayesian approaches to disclosure risk assessment, especially to emphasize the Bayesian thinking rather than simply using the Bayesian machinery in the assessment process. As a response to the call of Skinner (2012), Reiter (2012) tried to propose principled Bayesian estimation procedure for attribute disclosure risks for fully synthetic data. The general framework laid out in Reiter (2012) was extended by and further developed by Reiter et al. (2014) for both fully and partially synthetic data. This general framework gives interpretable probability statements of the attribute disclosure risks, and provides flexible incorporation of different assumptions of intruder’s knowledge and behavior.
In this section, we present the framework of Reiter et al. (2014) for Bayesian estimation of attribute disclosure risks. We use similar notations, highlight the key steps, and illustrate with selected examples. We have chosen these examples that are built upon the framework but tailored for specific purposes and needs of the applications. To be as comprehensive as possible, we focus on the following: i) fully synthetic categorical data (Hu et al., 2014, 2018), ii) partially synthetic continuous data (Wang and Reiter, 2012), and iii) fully synthetic count data (Wei and Reiter, 2016). In the end, we will discuss the challenges and future directions of this framework.
3.1 Notations and setup
Let be the vector response of observation in the original confidential dataset, where direct identifiers (such as name or SSN) are removed. When needed, we use as the variable index, and . Among the variables, i) some are synthesized, denoted by ; and ii) others are unsynthesized, denoted by . We have for the th observation with its original values, and for the entire dataset containing observations with their original values. We note that when fully synthesis is carried out, , therefore . Without loss of generality, we use when introducing the notations, setup and key estimating steps.
On the agency side, synthetic datasets are released, denoted by . Each synthetic dataset is denoted by where . See Drechsler (2011b) for review of synthetic data and the references therein.
On the intruder side, suppose the intruder intends to learn the original value of for some record in . Several pieces of information can be available to the intruder: i) , the unsynthesized values of all observations; ii) , any auxiliary information known by the intruder about records in ; and iii) denotes any information known by the intruder about the process of generating . We will discuss each piece in detail in Section 3.2.
Let
be the random variable representing the intruder’s uncertain knowledge of
. The intruder seeks the distribution of . Ifis a vector of categorical variables, then probabilities of accurately inferring the confidential values can be calculated through
, where is one plausible combination of categorical responses of those variables. The examples (Hu et al., 2014, 2018) in Section 3.3.1 on fully synthetic categorical data are illustrations for being a vector of categorial variables. If is one or multiple continuous or count variables, contextspecific filelevel attribute disclosure probability summaries should be developed to summarize the attribute disclosure risks. The examples of Wang and Reiter (2012) on partial synthetic continuous data in Section 3.3.2, and Wei and Reiter (2016) on fully synthetic count data in Section 3.3.3 provide illustrations for these types of .For the agency, it is paramount to model different intruder’s knowledge and behavior, i.e. assumptions on the level of knowledge of , and . The framework in Reiter et al. (2014) allows the incorporation of these different assumptions at multiple stages in the estimating process, thus gives extensive flexibility to parties trying to evaluate attribute disclosure risks.
3.2 Key estimating steps
The intruder seeks the distribution of , where is one possible guess of by the intruder. Recall that , , and are available to the intruder. According to Bayes rule,
(1) 
where is the synthetic data distribution given what the intruder knows, and represents the intruder’s prior on given , , and .
The estimation procedure of Equation (1) varies by the variable type of , and assumptions on the level of knowledge of , and , among other things. Here we go through each of these quantities and their implications in the estimating process, highlight several common practices that have been adopted, before we illustrate with a selection of attribute disclosure risk assessment demonstrations with real synthetic data applications in Section 3.3.
3.2.1 Knowledge of
Recall is the set of unsynthesized values of all observations. As mentioned before, when is partially synthetic, since intruder has access to , can be determined and thus available. When is fully synthetic, , therefore we can drop this term and further simplify the expression for fully synthetic as
(2) 
3.2.2 Assumptions about
We use to denote auxiliary information known by the intruder about records in . As is either known in partial synthetic or dropped in fully synthetic, specifically refers to information about , the synthesized values. When the intruder seeks , the distribution of record ’s synthesized variables, there are numerous possible scenarios of what the intruder knows about the synthesized values of every other record, denoted by the matrix . First proposed in (Reiter, 2012), a “worst case” scenario where the intruder knows the original values of the synthesized variables of all records except for record has been a common practice, i.e. . This practice has been recognized as strong intruder knowledge and conservative, as in many contexts, it is impossible for the intruder to know . However, it has also been argued that if disclosure risks under such conservative assumption are acceptable, disclosure risks should be acceptable for weaker (Reiter, 2012; Reiter et al., 2014). As Reiter (2012); Hu et al. (2014) noted, assuming the intruder knows all records but one is related to, but quite distinct from, the assumptions used in differential privacy (Dwork, 2006). McClure and Reiter (2012) designed simulation studies to compare the two paradigms.
3.2.3 Assumptions about
denotes any information known by the intruder about the process of generating . Examples include code for the synthesizer and descriptions of the synthesis model. Such information sometimes can be public available with great details. For example, the Census Bureau’s Survey of Income and Program Participation (SIPP) Synthetic Beta product has an accompanying document Benedetto et al. (2013)
describing their synthesizing process. From the document, we gather that they implemented a Sequential Regression Multivariate Imputation (SRMI) framework, with three main models (linear regression, logistic regression, and Bayesian bootstrap
(Rubin, 1981)) for missing data imputation and synthetic data generation. Such public available detailed information should be assumed known by the intruder.3.2.4 Choosing the prior
Determining the intruder’s prior beliefs of is another nearly impossible task. Skinner (2012) challenged the use of prior distributions being a more technical one for the Bayesian machinery to function, and advocated for prior distributions that should be defensible from the agency’s perspective. A common practice is to specify uniform prior distributions of for all possible guesses , given as proposed in Reiter (2012), but also consider a variety of prior distributions when possible, especially if more informative prior is available Wei and Reiter (2016).
3.2.5 The estimation of
We now go through the estimation of . The previously discussed assumptions of and are relevant in this part of the estimating process. The importance sampling techniques coupled with Monte Carlo simulation are adopted common practices, and we will present and discuss why and how they work.
Typically by the independence of different synthetic datasets, we work with each synthetic dataset separately, therefore we consider for our discussion. To ultimately obtain , we have
(3) 
For , under the “worst case” scenario where the intruder knows the original values of the synthesized variables of all reords except for record , i.e. , we come to
(4) 
which is very close to the distribution from which the synthetic data is generated, as in
(5) 
where is the true record in the original confidential dataset . As we can see, the only difference in the conditioned quantities in Equation (4) and Equation (5) is difference between (the random guess) and (the true record).
In fact, we could utilize the small difference between and for estimating . If we use to denote the parameters in the synthesis model , we could easily incorporate draws in our estimation of through a Monte Carlo step, as in
(6) 
The Monte Carlo step requires reestimation of the synthesis model for each , which could be computationally prohibitive if many possible guesses of need to be evaluated. To avoid the reestimation of to draw samples, a common procedure via importance sampling is adopted. In particular, available draws of from , the model used for generating the synthetic dataset , act as proposals for the importance sampling algorithm. Readers are referred to Paiva et al. (2014); Hu et al. (2014) for a review of importance sampling and its usage in the applications therein.
3.3 Selected examples
In this selected examples section, we want to show the readers in a few different applications, i) what are , , , , and ; ii) what are the risk scenarios (i.e. assumptions are made for and ), and their implications; and iii) what are the specific filelevel attribute disclosure probability summaries in each application. For each application, we give a brief overview of the dataset(s) and research questions to provide the background. We also mention the synthesizers, but the details of the synthesizers and the evaluation of the utility of the synthetic data are omitted, as we focus on the estimation of probabilities of attribute disclosure in this paper. Interested readers should refer to the cited papers for further information.
3.3.1 Fully synthetic categorical data
Hu et al. (2014) aimed at generating fully synthetic categorical data for a subset of individuals from the 2012 American Community Survey (ACS) public use microdata sample for the state of North Carolina. They considered unordered categorical variables, as listed in Table 1. We include the variables, the number of categories of each variable, and whether a variable is synthesized in this table.
Variable  Number of categories  Synthesized 

Sex  2  Yes 
Age  4  Yes 
Race  6  Yes 
Education level  4  Yes 
Marital status  5  Yes 
Language  2  Yes 
Birth place  7  Yes 
Military  3  Yes 
Work  3  Yes 
Disability  2  Yes 
Health insurance coverage  2  Yes 
Migration  3  Yes 
School  3  Yes 
Hispanic  2  Yes 
While the authors attempted fully synthetic data generation (Rubin, 1993), they followed the partially synthetic approach (Little, 1993) to replace all values in the data by synthetic values (Drechsler, 2011a). Unlike the approach of Rubin (1993), where no correspondence between a fully synthetic record and a real world record with subscript , their approach maintains such correspondence, which is a key assumption in their proposed attribute disclosure evaluation methods. For more discussion on the two approaches to generating fully synthetic data, refer to Drechsler (2018).
The authors used Dirichlet Process mixture of products of multinomial (DPMPM) synthesizer. The DPMPM is consisted of a set of flexible Bayesian latent class models that have been developed to capture complex relationships among multivariate unordered categorical variables (Dunson and Xing, 2009). In recent years, the DPMPM has been proposed as a multiple imputation engine for missing data problems, and a synthesizer for statistical disclosure control. Si and Reiter (2013) implemented the DPMPM as an missing data imputation engine and demonstrated its superior performance comparing to traditional sequential imputation models such as the multiple imputation with chained equations (MICE; Buuren and GroothuisOudshoorn (2011)). In addition to Hu et al. (2014), Drechsler and Hu (2018+); Hu and Savitsky (2018+) used the DPMPM synthesizer for generating partially synthetic data with geocoding information. Variations of the DPMPM include versions of it dealing with structural zeros (ManriqueVallier and Reiter, 2014; ManriqueVallier and Hu, 2018), dealing with extension of the multinomial synthesizer (Hu and Hoshino, 2018), and dealing with individuals nested within households (Hu et al., 2018; Akande et al., 2018+b, 2018+a).
The DPMPM synthesizer assigns an underlying latent class of each record. Conditioning on the latent class assignment, each attribute independently follows its own distribution. For unordered categorical variable, such distribution is usually multinomial distribution. To generate the synthetic vector of attributes of one record, we first sample the latent class assignment. For each attribute, we generate the value from its independent multinomial distribution with probabilities sampled from the DPMPM.
We use to represent the random attribute vector of record (the upper script is dropped because this is fully synthetic, i.e. and ), to represent the random attribute vectors of all records, to represent each fully synthetic dataset, where , and to represent all fully synthetic datasets.
Following the general setup and notations introduced earlier, we use to represent the intruder’s information on person’s attributes in the sample (i.e. auxiliary information), and to represent any metadata released by the agency about the synthesis model. The goal is to estimate for one or more target records in the sample. Specifically in this case, because all attributes are unordered categorical, we are able enumerate all possible combinations of the categorical attributes.
Then the expression of the probability of attribute disclosure of an entire record becomes Equation (7),
(7) 
where is a guess by the intruder.
Hu et al. (2014) set , which corresponds to the “worse case” scenario where the intruder knows the actual data for all records except for the record .
To estimate , the authors proposed to dramatically reduce the set of possible combinations that could take. Specifically, they consider the neighborhood near (the true record), which contains only feasible candidates where differ from in one variable. In their illustrative application, the subset contains only combinations, reduced from
cells in the contingency table. The authors commented that if risks of
being the true are acceptable in this reduced set, then the risks would be even lower when considering the full set. The risks in the reduced set are the upper bound. Importance sampling techniques were applied to avoid reevaluation of probability of obtaining the synthetic datasets given different combinations of , as in Equation (4).For the prior on , Hu et al. (2014) assumed a uniform prior, which sums to over the number of combinations in the reduced set (e.g.
in their illustration). The prior probabilities were canceled out in the computation process.
To summarize the calculated risks of attribute disclosure of all records, Hu et al. (2014) created two filelevel attribute disclosure probability summaries:

The ranking of the probability of the true record being disclosed, among the subset of combinations;

The renormalized probability of the true record being disclosed, among the subset of combinations.
In general, the higher the ranking and the renormalized probability, the higher the attribute disclosure risks.
Additionally, Hu et al. (2014) investigated scenarios where the intruder might know a subset of values in , for example, demographic variables. The authors then defined each as , where the additional subscript denotes the variables known by the intruder and denotes the variables unknown by the intruder. Subsequently, to evaluate risks for intruders seeking to estimate the distribution of , the authors defined , and Equation (LABEL:eq:AttriBayesRulerep2) becomes
(9)  
The estimation procedure works in a similar way, and we refer interested readers to Hu et al. (2014) for detail and discussion.
3.3.2 Partially synthetic continuous data
Wang and Reiter (2012) aimed at generating partially synthetic data for sharing precise geographies. The precise geographies were exact longitude and latitude of each death of a sample of North Carolina mortality records in 2002. Only exact longitude and latitude of each record were synthesized, and all nongeographic variables were kept unchanged. We include the variables, their descriptions, and whether a variable is synthesized in Table (2).
Variable  Description  Synthesized 

Longitude  Recoded (1  100)  Yes 
Latitude  Recoded (1  100)  Yes 
Sex  Male, female  No 
Race  White, black  No 
Age (years)  1699  No 
Autopsy performed  Yes, no, missing  No 
Autopsy findings  Yes, no, missing  No 
Marital status  5 categories  No 
Attendant  3 categories  No 
Hispanic  7 categories  No 
Education (years)  017  No 
Hospital type  8 categories  No 
Cause of death  Binary  No 
The authors used classification and regression trees (CART; refer to Reiter (2005b) for details of using CART to generate partially synthetic data) synthesizers for generating longitudes and latitudes. In particular, they first fit a regression tree of longitudes on all nongeographic attributes, and generated synthetic longitudes using the Bayesian bootstrap. After obtaining synthetic longitudes, they fit another regression tree of latitude on all nongeographic attributes and the true latitude, and generated synthetic latitudes using the Bayesian bootstrap. In the end, the partially synthetic precise geographies were simulated.
We use to represent the random longitude and latitude of record , to represent the random longitudes and latitudes of all records, to represent the unsynthesized value of all records, to represent each partially synthetic dataset, where , and to represent all partially synthetic datasets.
Following the general setup and notations introduced earlier, we use to represent the intruder’s information on person’s attributes in the sample (i.e. auxiliary information), and to represent any metadata released by the agency about the synthesis model. The goal is to estimate for one or more target records in the sample. We now express the probability of attribute disclosure of estimating the longitude and latitude of record , which is the same as in Equation (1) and restated below in Equation (10). Recall that is a possible original value of by the intruder.
(10) 
Wang and Reiter (2012) evaluated two scenarios regarding the choice of and .

Scenario #1 (highrisk): The intruder knows everything except for one target’s , i.e., , and includes everything about the CART except the individual geographies in the nodes. Note that is the same as the “worst case” scenario discussed in Section 3.2.

Scenario #2 (lowrisk): The intruder does not know any records’ geographies, i.e., , and includes everything about the CART except the individual geographies in the nodes.
For the highrisk scenario, to estimate , importance sampling techniques were applied to avoid reevaluation of probability of obtaining the synthetic datasets given different combinations of , as in Equation (4). For the intruder’s prior on , Wang and Reiter (2012)
assumed a uniform distribution on a grid over a small area containing the target’s true longitude and latitude.
The authors noted that the uniform prior for represents strong intruder prior information, because the prior was given on a grid over a small area. They also noted that the value of the risk measure would change if other prior specifications were given, though they did not consider other specifications in their attribute risk disclosure evaluation.
Specifically, Wang and Reiter (2012) developed two geographiesspecific attribute disclosure risk measures:

A Euclidean distance between the intruder’s guess of the longitude and latitude and the actual longitude and latitude;

The count recording the number of actual cases in circle centered at the actual longitude and latitude with radius .
In general, larger values of and correspond to smaller attribute disclosure risks.
We also want to note that Paiva et al. (2014) used a similar dataset with the goal of partially synthesizing geographies. Though their synthesis model was different from the one in Wang and Reiter (2012), they proposed three other filelevel attribute disclosure probability summaries for synthetic data applications involving geographic locations.

A filelevel risk measure of the percentage of records with the true location
being the maximum posterior probability of record
; 
A filelevel risk measure of the percentage of records with the true location being the maximum posterior probability of record , and record has unique patterns;

A Euclidean distance measure between the true location and the guess with the maximum posterior probability of record .
In general, smaller values of measures (i) and (ii) and larger values of measure (iii) correspond to smaller attribute disclosure risks.
3.3.3 Fully synthetic continuous data
Wei and Reiter (2016) aimed at generating fully synthetic data for sharing magnitude microdata from business establishments. The magnitude variables were the number of skilled laborers, the number of unskilled laborers, wages of skilled laborers, and wages of unskilled laborers of a sample of food manufacturing establishments in the country of Colombia in 1977. All 4 magnitude variables were synthesized, making it a fully synthetic microdata endeavor. We include the variables, their descriptions, and whether a variable is synthesized in Table (3).
Variable  Description  Synthesized 

Number of skilled laborers  Integer  Yes 
Number of unskilled laborers  Integer  Yes 
Wages of skilled laborers  Integer  Yes 
Wages of unskilled laborers  Integer  Yes 
The authors used three synthesizers based on finite mixtures of Poisson (MP) distributions. The class of finite mixtures of Poissons can i) capture complex multivariate associations among the variables; and ii) model count variables. In addition to the basic MP synthesizer, Wei and Reiter (2016)
proposed the mixture of Multinomial (MM) synthesizer, which ensures the synthetic values sum to marginal totals in the confidential data. The marginal totals constraints are satisfied by performing another layer of Multinomial draws of counts within each occupied Poisson mixture component. Specifically, the totals (e.g. the number of skilled laborers) and the number of cases (in running the MP, at each MCMC iteration, each records is assigned to a component) in each occupied Poisson mixture component are computed and stored. Based on the totals and the number of cases, a Multinomial sample is generated, distributing the totals into all levels. Within each occupied Poisson mixture, the marginal totals match in the synthetic and confidential data, therefore overall marginal totals also match. Furthermore, the authors proposed the tailcollapsed mixture of Multinomial (TCMM) synthesizer, which effectively performs a modelbased variation of microaggregation plus noise, by collapsing tails of individual variables (i.e. risky values). For TCMM, one needs to specify a parameter associated with the quantile, namely
, which acts as a threshold to control the amount of collapsing.We use to represent the random vector of the numbers of skilled and unskilled laborers and their corresponding wages of record , to represent the these 4 random magnitude variables of all records, to represent each fully synthetic dataset, where , and to represent all fully synthetic datasets. Note that we dropped the upper script in and use and directly because every variable is synthesized.
Specific to the business establishment survey data, we need to define a few other quantities before we can describe the risk scenarios and filelevel attribute disclosure probability summaries. We use and to represent the largest and second largest values of variable in , the original confidential dataset. When the intruder does not know these values, we use and as the random variables representing the intruder’s uncertain knowledge about them. Furthermore, we let be the total of variable in , and use and to represent the values of two entire records.
Wei and Reiter (2016) considered a variety of risk scenarios. As an illustration of evaluating attribute disclosure for fully synthetic continuous data, we present the scenario where the intruder, who has the second largest value of a certain variable, , attempts to use the released synthetic data to learn about the individual with the largest value of the variable, the random quantity . Such scenario is commonly used by official statistics agencies with business establishment data (Kim et al., 2015, 2018).
Recall that we use to represent the intruder’s information on person’s attributes in the sample (i.e. auxiliary information), and to represent any metadata released by the agency about the synthesis model. Translating the scenario above into choices of and , we come to Equation (LABEL:eq:AttriBayesRulerep6), which represents the attribute disclosure risk probability of guessing when is available.
where is a possible original value of by the intruder.
To estimate , techniques of using to approximate the set of records in the same component occupied by the target record were applied to simplify the computation. We refer the readers to Wei and Reiter (2016) for the details regarding the MM and TCMM synthesizers.
For the intruder’s prior on , Wei and Reiter (2016) discussed the choice of a nonuniform prior distribution, which could provide more accurate prior guesses and is worth noting here. In their empirical illustration of synthesizing fully magnitude data from the Colombia food manufacturing establishments dataset, the authors estimated the chance that the largest value of the number of skilled laborers falls into an interval, with the lower bound being the second largest value and the upper bound being predefined. Among the three different synthesizers, the attribute disclosure risks under the MP and the MM synthesizers are extremely high, whereas the risks under the TCMM synthesizer are overall much lower. Furthermore, the risks decrease as the threshold parameter decreases, which are expected because of the amount of tail collapsing increases as decreases. Interested readers are encouraged to consult Wei and Reiter (2016) for their explanations.
We also want to point out that Wei and Reiter (2016) evaluated two other sets of scenarios, both of which assume the intruder seeks to guess the values of variable of two records, and . The first set of scenario is the intruder knows all but one or two values in , which means the intruder seeks to estimate the probability of given , and different combinations of and . The second set of scenario is the intruder knows all data values except for one or two records, which means the intruder seeks to estimate the probability of given , and different combinations of and .
3.4 Discussion and comments
There are few common practices for evaluating attribute disclosure risks in the selected examples, as well as in other synthetic applications. The first one is on the assumptions about , the auxiliary information known by the intruder about records in . The “worse case” scenario of letting , i.e. the intruder knows all the original values of the synthesized variables of all records except for record , though provides an upper bound of the identification risks, is a very strong and probably unrealistic assumption. The scenario greatly simplifies the estimation of as in Section 3.2.5 and Equation (3), by setting . If the assumption is weaker, for example, the intruder only knows the synthesized values of next record , then , which means the approximation in Equation (4) to (6) will involve extra steps of imputing all the other synthesized values (see Paiva et al. (2014) for a potential solution of approximation). Such weaker assumptions are much more realistic, but almost computationally infeasible with the current setup. McClure and Reiter (2016) examined the effect on attribute disclosure risks in fully synthetic data by decreasing the number of observations knows (i.e. weakening the assumption of ). Future research in designing faster algorithm to estimate with weaker is desired.
The common practice of setting the prior as a uniform distribution has been adopted in various applications. Because of the cancellation in Bayes’ rule, using a uniform prior for essentially simplifies the estimation, as we only need to estimate in Equation (2). We should recognize not only its convenience in computation, but also its constraint. Using a uniform prior can be uninformative for some cases, but it might be strongly informative in other cases, as in Wang and Reiter (2012), which might not be realistic. Using a uniform prior can be realistic in some cases, but it might need to adjusted to reflect more realistic prior belief. For example, it is possible to argue that the 35 combinations in the reduced subset in Hu et al. (2014) should not really be treated equally likely (i.e. a uniform prior). Rather, some combinations might be more plausible than the others, thus carrying higher prior probability. The general advice is to consider a wide range of prior distribution for if possible. Also, do not choose uniform only for its simplicity. Choosing a more realistic prior distribution provides a more reasonable attribute disclosure risks measure (Wei and Reiter, 2016).
When estimating , importance sampling techniques are widely used to avoid reestimating the synthesis model for each . First of all, we should recognize that if is not as strong as , even the importance sampling techniques will not help much. See the discussion in the first paragraph of this section. Second of all, typically the set of guesses of , , is reduced to a much smaller set than the full set containing all possible combinations. Even though the reduction provides an upper bound of the attribute disclosure risks (Hu et al., 2014, 2018), it is really for computational feasibility that such reduction is applied. Further research paths include faster algorithm to expand the small reduced set, and new algorithm to search for that gives high probability estimation of in an efficient way, therefore enabling the data disseminator to check against the actual truth and determine its attribute disclosure risks level. Third of all, to use the Monte Carlo approximation coupled with importance sampling techniques in Equation (6), draws of
are necessary, which means the final synthetic data generation process involves parametric models. Among the selected examples,
Hu et al. (2014); Wei and Reiter (2016) had parametric models for the outcome (multinomial and poisson, respectively). Even though Wang and Reiter (2012)used nonparametric CART synthesizers, their ultimate synthetic data generation process involves Bayesian bootstrap sampling with mixture normal distributions. It is unclear how to estimate the attribute disclosure risks for true nonparametric synthesizers, which can be a fruitful research path.
There are additional possible difficulties in implementing the Bayesian estimation procedure of attribute disclosure risks evaluation. As noted in ManriqueVallier and Hu (2018), their proposed synthesizers for categorical variables with structural zeros had serious stability issues with the estimation of , as its values varied by several thousands in the logscale from one sample of to another, resulting enormous meansquared error. The authors then developed an indirect bootstrap hypothesis testing framework to approximate the ranking of in the reduced set. We refer the readers to ManriqueVallier and Hu (2018) for details.
One final comment to make is the work of McClure and Reiter (2012), where the authors compared the disclosure risk criterion of differential privacy with a criterion based on the attribute disclosure risk probabilities. The evaluation from their simulation studies was that the two paradigms are not easily reconciled. Moreover, sometimes attribute disclosure risks can be small even when is large. The authors proposed an alternative disclosure risk assessment approach, one integrates both paradigms, though great computation challenges were foreseeable. Further research on risk assessment integrating the two paradigms is desired.
4 Bayesian estimation of identification disclosure risks
As discussed previously, we only consider identification disclosure risks for partially synthetic data.
Researchers had worked on Bayesian probabilistic matching to estimate the probabilities of identifications of sampled units. Duncan and Lambert (1986, 1989); Lambert (1993) developed Bayesian approaches to i) model the behavior of intruders, and ii) quantify sources of uncertainty about those estimated probabilities. Their work is followed by Fienberg et al. (1997), who estimated probabilities of identification for continuous microdata, which had undergone SDL techniques by adding random noise.
Observing the lack of illustrative applications on genuine data, Reiter (2005a) extended the DuncanLambert framework using data from the Current Population Survey (CPS). Common SDL techniques (recoding, topcoding, swapping, adding random noise, and combinations of these techniques) were applied to genuine microdata in their illustrations. They also considered different assumptions of intruders’ knowledge and behavior and incorporated such information into the estimation of the identification probabilities.
The stepbystep probability estimation procedure in Reiter (2005a) has been standard practice for Bayesian probabilistic matching ever since, especially after the synthetic data approach has gained its momentum. Reiter and Mitra (2009) in particular first set up the framework for the Bayesian probabilistic matching for partially synthetic data.
We now turn to the framework in Reiter and Mitra (2009) for identification disclosure risks estimation for synthetic data, which was built on the more general framework for identification disclosure risks estimation of common SDL techniques in Reiter (2005a). We use similar notations, highlight the key steps, and illustrate with selected examples. We have chosen these examples that are built upon the framework but tailored for specific purposes and needs. To be as comprehensive as possible, we present two partially synthetic categorical data applications i) Reiter and Mitra (2009), ii) Drechsler and Hu (2018+), and iii) partially synthetic categorical and continuous data (Drechsler and Reiter, 2010). In the end, we will discuss the challenges and future directions of this framework.
4.1 Notations and setup
In the sample S of units and variables, the notation refers to the th variable of the th unit, where and . The column contains some unique identifiers (such as name or Social Security Number), which are never released. Among the recorded variables, i) some are available to users from external databases, denoted by , and ii) others are unavailable to users except in the released data, denoted by . We therefore have the vector response of the th unit, . We also have the matrix representing the original values of all units.
On the agency side, suppose it releases all units of the sample S. Similar to the split of , we have . Among the available variables, we further split them into i) the synthesized variables, and ii) the unsynthesized variables. We therefore have , and we let be the matrix of all released data. We also let be all units’ original values of the synthesized variables. We note that in some cases, the agency might only release units of the sample (Reiter, 2005a).
On the intruder side, let be the vector of information that the intruder has. may or may not be in , but we assume for some unit in the population. This vector only contains unsynthesized and synthesized variables (no unavailable variables as in and ), thus we have . The intruder’s goal is to match record in to the target when . Additionally, two other pieces of information can be available to the intruder. Let represent the metadata released about the simulation models used to generate the synthetic data, and let represent the metadata released about the reason why records were selected for synthesis. Either or could be empty.
There are released units in . Let be the random variable that equals to when for and equals when for some . The intruder intends to calculate for . The intruder is particularly interested in learning whether any of the calculated identification probabilities for are large enough to declare an identification.
For the agency, it is paramount to model different intruder’s knowledge and behavior when estimating identification risks from releasing synthetic dataset. The framework in Reiter and Mitra (2009) allows the incorporation of these different assumptions at multiple stages in the estimating process, thus gives extensive flexibility to parties trying to evaluate identification disclosure risks.
4.2 Key estimating steps
The intruder intends to calculate for . Based on the split of , we rewrite the probability as
(14) 
In fact, the intruder does not know the actual values in , all units’ original values of the synthesized variables. Therefore for the intruder, integrating over its possible values when computing the match probabilities is necessary, as in
(15)  
The estimation procedure of Equation (15) varies by the variable(s) in (e.g. whether in or in ), the variable types, assumptions on the level of knowledge of being in or not, of and , among other things. Here we go through each of these aspects/quantities and their implications in the estimating process, highlight several common practices that have been adopted, before we illustrate with a selection of identification disclosure risk assessment demonstration with real synthetic data applications in Section 4.3.
4.2.1 The variable(s) in
An immediate simplification of in Equation (15) is
(16) 
This is true because when is given, and are conditionally independent. That is, the intruder would use without the synthetic data , the unavailable variables , , or to attempt reidentification. Equation (16) will be used in Sections 4.2.2 and 4.2.3 as well.
Consider any variable in . Since it is an unsynthesized variable, for any unit in where the released value of , .
4.2.2 The variable(s) in
For categorical variables in the synthesized set , the intruder matches directly on . For numerical or continuous variables in , while exact match could be pursued, the nature of the numerical/continuous variables will result in zero probabilities for most if not all of the records. Therefore, it is advisable to match numerical components of within some acceptable distance (e.g. Euclidean or Mahalonobis) from the corresponding .
4.2.3 Whether is in or not
The overall assumption we have is that the vector of information that the intruder has, for some unit in the population, but not necessarily in . When is in , then the quantity in Equation (16) for is 0, i.e. . This simplifies calculating for . For example,
(17) 
where is the number of units in with consistent with .
When is not in , then . If we let be the number of units in the population that have consistent with which are also included in , then
(18) 
Determining can be done from census totals, or to be estimated from available sources. Reiter and Mitra (2009) discussed possible ways for estimation using survey weights. Modelbased approaches to estimating can be applied too, for example Elamir and Skinner (2006), among others. Additional approaches to accounting for intruder uncertainty due to sampling were proposed in Drechsler and Reiter (2008).
It is important to recognize that setting results in conservative measures of identification disclosure risks.
4.2.4 Assumptions about and
Previously, we let represent the metadata released about the simulation models used to generate the synthetic data, and represent the metadata released about the reason why records were selected for synthesis. We note that in practice, is usually dropped because reasons why records were selected for synthesis are difficult to come by. However, can be available in many cases. For example, in Section 3.2, information about the synthesis models of the SIPP Synthetic Beta is available online (Benedetto et al., 2013), which should be assumed by the intruder. Not only the SIPP, information about the synthesis process of the SynLBD is publicly available in Kinney et al. (2011, 2014).
4.2.5 Estimating through Monte Carlo
This description follows the description given in Drechsler and Hu (2018+). The construction in Equation (15) suggests a Monte Carlo approach to estimating each ) (note that is used in place of ; is dropped, assuming unavailable), and we rewrite it as
(19) 
For the Monte Carlo approach, perform the following twostep process.

Sample a value of from , and let represent one set of simulated values.

Compute using exact matching assuming are collected values.
This twostep process is iterated times, where ideally is large, and Equation (19) is estimated as
(20) 
where indicates one iteration of the twostep process.
When has no information, the intruder treats the simulated values as plausible draws of .
4.2.6 Three summaries of identification disclosure probabilities
For attribute disclosure risk measures in Section 3.3, summaries of attribute disclosure probabilities vary by variable types and contexts. For example, fully synthetic categorical data uses summaries of i) ranking, and ii) renormalized probability of the true record being disclosed, as in Hu et al. (2014) in Section 3.3.1. Partially synthetic continuous data, specifically in Wang and Reiter (2012) where synthetic precise geographies are released, summaries of i) a Euclidean distance between the intruder’s guess of the geographies and the actual geographies , and ii) the count of the actual cases in circle centered at the actual geographies within radius in i) are reported.
Unlike the summaries of attribute disclosure probabilities, summaries of identification disclosure probabilities are more generally applicable, regardless of the variable types and contexts. There are three summaries of identification disclosure probabilities, which now we describe, following Drechsler and Hu (2018+).
We need the following notations and definitions before we present the three summaries. Let be the number of records with the highest match probability for the target ; let if the true match is among the units and otherwise. Let when and otherwise, and let denote the total number of target records. Finally, let when and otherwise, and let equal the number of records with .
Now we can present the three widely used summaries (filelevel) of identification disclosure probabilities using the notations and definitions given above.

The expected match risk:
(21) When and , the contribution of unit to the expected match risk reflects the intruder randomly guessing at the correct match from the candidates. In general, the higher the expected match risk, the higher the identification disclosure risks.

The true match rate:
(22) which is the percentage of true unique matches among the target records. In general, the higher the true match rate, the higher the identification disclosure risks.

The false match rate:
(23) which is the percentage of false matches among unique matches. In general, the lower the false match rate, the higher the identification disclosure risks.
4.3 Selected examples
In this selected examples section, we want to show the readers a few different applications of partially synthetic data. We will illustrate the variables in for each application. All applications follow similar estimating procedure, and report the same three summaries as presented in Section 4.2.6: i) the expected match risk, ii) the true match rate, and iii) the false match rate.
For each application, we give a brief overview of the dataset(s) and research questions to provide the background. We also mention the synthesizers, but the details of the synthesizers and the evaluation of the utility of the synthetic data are omitted. Interested readers should refer to the cited papers for further information.
4.3.1 Partially synthetic categorical data 1
Reiter and Mitra (2009) aimed at partially synthesizing a sample of of the 1987 Survey of Youth in Custody. There are 23 variables on the file, and the authors illustrated partially synthesizing two categorical variables, facility and race. Table 4 gives a partial list of the variables with their description, synthesis information and whether known by the intruder. All other unlisted 20 variables are not known by the intruder during the identification disclosure risks evaluation.
Variable  Description  Synthesized  Known by intruder 

Facility  Categorical, 46 levels  Yes  Yes 
Race  Categorical, 5 levels  Yes  Yes 
Ethnicity  Categorical, 2 levels  No  Yes 
To synthesize the facility and race variables, the authors first use multinomial regressions to synthesize facility. All other variables except race and some variables causing multicollinearity are included in the multinomial regressions as predictors. Once all values of the facility variable are synthesized, the authors then synthesize race using multinomial regressions. The predictors in these multinomial regressions include all other variables plus indicator variables for facilities, except those causing multicollinearity. Reiter and Mitra (2009) note that the new values of race are simulated conditional on the values of the synthetic facility indicators.
For the identification disclosure risks evaluation, the authors considered facility and race in , and ethnicity in . They also assumed that all targets are in the sample, i.e. .
4.3.2 Partially synthetic categorical data 2
Drechsler and Hu (2018+) aimed at comparing a few existing synthesizers on a large German administrative database called the Integrated Employment Biographies (IEB) to provide access to detailed geocoding information. There are approximately 22 million records in the IEB. The authors considered 11 variables as listed in Table 5. We include the variables, the description, whether a variable is synthesized, and whether the variable is known by the intruder in the identification disclosure risks estimation (i.e. whether in ) in this table. The authors in fact experimented with different number of variables to be synthesized in order to provide higher protection. However, Table 5 considers the main synthesis approach, on which the authors presented most of the utility and risks results too.
Variable  Description  Synthesized  Known by intruder 

Exact geocoding info  Longitude and latitude  Yes  Yes 
Sex  Male, female  No  Yes 
Foreign  Yes, no  No  Yes 
Age  6 categories  No  Yes 
Education  6 categories  No  No 
Occupation level  7 categories  No  No 
Occupation  12 categories  No  Yes 
Industry of the employer  15 categories  No  Yes 
Wage  10 categories (quantiles)  No  No 
Distance to work  5 categories  No  No 
ZIP code  2,063 ZIP code levels  No  No 
The authors considered three synthesizers. The first synthesizer is the DPMPM synthesizer used in Hu et al. (2014), where the exact geocoding information was discretized into one unordered categorical variable, and the Dirichlet Process mixture model on the joint of the 11 unordered categorical variables was estimated and used to generate synthetic data. The second synthesizer is the CART synthesizer used in Wang and Reiter (2012), where the exact geocoding information (the latitude and longitude) were treated as continuous and synthesized sequentially. We call this CART synthesizer the CART continuous. The third synthesizer is also a CART synthesizer, but similar to the DPMPM synthesizer, the exact geocoding information was discretized into one unordered categorical variable. We call this CART synthesizer the CART categorical. All three synthesizers were applied to generate partially synthetic IEB, where only the geocoding information was synthesized (either as categorical or as continuous).
For the identification disclosure risks evaluation, the authors considered the exact geocoding information in , and sex, foreign, age, occupation, and industry of the employer in . Because the IEB is a census, the authors also assumed that all targets are in , i.e. . They reported the expected match risk, the true match rate, and the false match rate for different synthesizers. While CART categorical synthesizer produced synthetic data with the highest utility, the identification disclosure risks may be deemed too high, therefore the authors recommended two approaches for increasing the level of protection: i) aggregate the geocoding information to a higher level, and ii) synthesizes additional variables in the dataset. Drechsler and Hu (2018+) preferred ii) over i), and interested readers are referred to the paper for their discussion and general recommendations.
4.3.3 Partially synthetic categorical and continuous data
Drechsler and Reiter (2010) aimed at partially synthesizing a sample of of the March 2000 U.S. CPS. The authors in fact treated the sample as a census to illustrate their sampling with synthesis methodology, but for our illustration purpose, we will ignore the differences. There are 10 variables on the file, and the authors illustrated partially synthesizing three variables (2 are categorical and 1 is continuous). Table 6 gives the list of the variables with their description, synthesis information and whether known by the intruder.
Variable  Description  Synthesized  Known by intruder 

Comments
There are no comments yet.