Bayesian Estimation of Attribute and Identification Disclosure Risks in Synthetic Data

04/09/2018
by   Jingchen Hu, et al.
Vassar College
0

The synthetic data approach to data confidentiality has been actively researched on, and for the past decade or so, a good number of high quality work on developing innovative synthesizers, creating appropriate utility measures and risk measures, among others, have been published. Comparing to a large volume of work on synthesizers development and utility measures creation, measuring risks has overall received less attention. This paper focuses on the detailed re-construction of some Bayesian methods proposed for estimating disclosure risks in synthetic data. In the processes of presenting attribute and identification disclosure risks evaluation methods, we highlight key steps, emphasize Bayesian thinking, illustrate with real application examples, and discuss challenges and future research directions. We hope to give the readers a comprehensive view of the Bayesian estimation procedures, enable synthetic data researchers and producers to use these procedures to evaluate disclosure risks, and encourage more researchers to work in this important growing field.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/17/2021

Bayesian Estimation of Attribute Disclosure Risks in Synthetic Data with the R Package

Synthetic data is a promising approach to privacy protection in many con...
09/26/2018

Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys

The release of synthetic data generated from a model estimated on the da...
05/22/2022

Privacy Protection for Youth Risk Behavior Using Bayesian Data Synthesis: A Case Study to the YRBS

The large number of publicly available survey datasets of wide variety, ...
09/26/2021

Assessing, visualizing and improving the utility of synthetic data

The synthpop package for R https://www.synthpop.org.uk provides tools to...
12/12/2017

Guidelines for Producing Useful Synthetic Data

We report on our experiences of helping staff of the Scottish Longitudin...
02/05/2021

Measuring Utility and Privacy of Synthetic Genomic Data

Genomic data provides researchers with an invaluable source of informati...
06/01/2020

Identification Risk Evaluation of Continuous Synthesized Variables

We propose a general approach to evaluating identification risk of conti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Statistical agencies collect microdata of respondents (individuals or business establishments) through various censuses and surveys. The agencies then make some versions of the collected microdata public available, subject to privacy and confidentiality protection (e.g. Title 13 and Title 26, U. S. Code). The agencies need to protect the identity of the respondents, as well as the original attribute information of the respondents which are deemed sensitive. These correspond to identification disclosure and attribute disclosure respectively.

To provide such protection, at the very least, unique identifiers such as Social Security Number (SSN) for individuals and Employer Identification Number (EIN) for business establishments cannot be released in the public available microdata. Moreover, other seemingly safe attributes cannot be released at the same time either, because a combination of a small number of attributes would increase the chance of respondent identification greatly. Sweeney (2000) demonstrated that using 1990 U. S. Census summary data, 87% (216 million out of 248 million) of the US population had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}, and about half (53%) are likely to be uniquely identified by only {place, gengder, date of birth}.

The statistical agencies thus need to mask the microdata before public release. These masking techniques are called Statistical Disclosure Limitation (SDL) techniques, which include i) data swapping, ii) adding random noise, and iii) micro-aggregation, among others. Hundepool et al. (2012)

provides a comprehensive review of the SDL techniques for microdata. Though these methods or combinations of them could provide some level of privacy protection, the utility of the masked data (e.g. results from a regression analysis using the masked data should be close to that using the original confidential data) are compromised

(Raghunathan et al., 2003). Moreover, for large and complex surveys, such SDL techniques need to be applied at a high intensity, which is time-consuming and hurting the utility of final masked data.

One alternative to the SDL techniques is synthetic data. Based on the theory and applications of multiple imputation methodology for missing data problems

(Rubin, 1987)

, multiply-imputed synthetic data can be generated from the statistical models estimated from the original confidential data. Carefully designed statistical models could produce high utility, low risks public microdata. Multiple synthetic datasets should be generated, and appropriate combining rules have been developed to provide accurate point estimates and variance estimates of parameters of interest. Refer to

Reiter and Raghunathan (2007); Drechsler (2011b) for details of the combining rules.

Synthetic data comes in two flavors, i) partially synthetic, where only sensitive attributes of all or some of the records are synthesized (Little, 1993), and ii) fully synthetic, where all attributes of every record are synthesized (Rubin, 1993). Since their proposals, a great number of research has been done on developing synthesizers, evaluating the utility and risks of the synthetic data. Refer to Drechsler (2011b) for a comprehensive review of partially and fully synthetic data and their applications.

It is worth noting that the U. S. Census Bureau has been involved in providing access to microdata through synthetic data products. Examples of their synthetic data products include i) OnTheMap (based on Machanavajjhala et al. (2008)), ii) the synthetic Longitudinal Business Database - SynLBD (based on Kinney et al. (2011, 2014)), iii) synthetic Survey of Income and Program Participation - SIPP Synthetic Beta (based on Benedetto et al. (2013)), among others. Germany also has implemented synthetic versions of their German IAB Establishment Panel (based on Drechsler et al. (2008a, b)). More and more statistical agencies have started experimenting with synthetic data to for their microdata releases.

Among the published work on synthetic data, many focus on developing synthesizers and proposing utility measures (both the global utility measures (Karr et al., 2006)

for synthetic data in general, and outcome specific utility measures for the particular application). The risk measures, while important, have been paid less attention to overall. This is understandable for at least three reasons: i) for the attribute disclosure risks (i.e. measuring the probability of an intruder correctly guessing the original values of the synthesized attributes of a respondent), which exists both in partially synthetic data and fully synthetic data, the principled evaluation procedure came out only recently and is not a straightforward procedure to implement; ii) for the identification disclosure risks (i.e. measuring the probability of correctly identifying a respondent by matching with available information from elsewhere), which usually only exist in partially synthetic data, the evaluation procedure can be followed in a straightforward manner, encouraging no further development of the measures themselves; and iii) unlike utility measures, which can vary a lot in different applications, risk measures evaluation procedure largely depends on the type of risk measure.

In this paper, we want to present easy-to-follow construction of some Bayesian estimation methods for evaluating attribute disclosure risks and identification disclosure risks. These methods use Bayesian thinking in computing probabilities, which are generally natural and easy to understand. Bayesian probabilities are subjective and Bayesian methods are useful for modeling the beliefs of data intruders. Moreover in the estimation process, different assumptions of intruder’s knowledge and behavior can be incorporated at various stages. While flexible and intuitive, the actual implementation of the Bayesian estimation process can be complicated and difficult to execute. Therefore, we aim at highlighting key steps in the estimation process, and complementing with real applications with a focus on the risk evaluation aspect of each application. The readers will also see exciting synthetic data projects for various types of data and protection purposes, and the application-specific disclosure risk measures being considered in each application. Discussion of challenges and future directions are throughout the paper as well as at the end as a summary.

The remainder of paper is organized as follows. In Section 2 we give an overview of risk evaluation for synthetic data and lay out the two types of disclosure risks we consider, namely the attribute disclosure risks and identification disclosure risks. Section 3 presents the Bayesian estimation methods of attribute disclosure risks, from notations and step, to key estimating steps, then selected examples, and finally discussion and comments. Section 4 follows a similar structure, where we present the Bayesian estimation methods of identification disclosure risks. Finally in Section 5, we give a summary.

2 Overview of synthetic data risks evaluation

This paper considers two types of disclosure risks: i) attribute disclosure risks (i.e. the intruder correctly inferring the original values of the synthesized attributes), and ii) identification disclosure risks (i.e. the intruder correctly identifying a respondent in the sample by matching with other available information). Depending on the synthetic flavor (partially versus fully), one or both of these two types of disclosure risks potentially exist. In this section, we briefly discuss why for partially synthetic data, both attribute disclosure and identification disclosure risks potentially exist, whereas for fully synthetic data, only attribute disclosure risks are considered and evaluated.

We note that these two types of disclosure risks are generic, i.e. for any synthetic data product, one or both types should be considered and evaluated. Attribute disclosure risks in particular, come in various forms depending on the nature of the synthesized attributes and the purposes of the privacy protection. For some applications, such as Hu et al. (2014) where fully synthetic individual records were generated, the attribute disclosure risks are in the form of guessing correctly of the attributes of a record. In other applications, for example when generating synthetic geolocation applications, researchers had created attribute disclosure risk measures based on distance between synthesized geolocations and actual geolocations (Wang and Reiter, 2012; Paiva et al., 2014; Quick et al., 2015, 2018). Moreover, for synthetic business establishment data applications, researchers had created attribute disclosure risk measures based on percentages of closest match and their variations (Domingo-Ferrer et al., 2001; Kim et al., 2015), and other measures based on relative difference between the true largest value and the intruder’s estimate (Kim et al., 2018). We note that not all of these application-specific disclosure risk measures undergo similar estimation procedures as the ones we present in this paper.

2.1 Disclosure risks of releasing partially synthetic data

In partially synthetic data, only sensitive attributes of all or some of the records are synthesized and some other attributes are left un-synthesized. When some of the un-synthesized attributes are available to the intruder via external databases, a matching mechanism based on the common available attributes may allow the intruder to identify records in the released dataset, thus resulting in identification disclosure risks. For example, suppose a partially synthetic dataset contains 1000 individual records and 6 attributes. Among them, 3 are synthesized sensitive attributes {age, date of birth, annual income} and 3 are un-synthesized attributes {gender, marital status, county}. Suppose now an intruder knows that person is in the sample, and the intruder also knows the gender and the county of person . Since both gender and county are un-synthesized, the intruder will have a reasonable chance of identifying person in the sample. Such disclosure risks are identification disclosure.

In addition to potential identification disclosure risks, attribute disclosure risks exist in partially synthetic data. We can easily imagine an intruder trying to infer the true values of the synthesized attributes given the released synthetic data, the un-synthesized attributes and other information. The availability of the un-synthesized attributes may greatly increase the chance of accurate inference of the synthesized attributes. Continuing the example of person from earlier, with a possible identification, the intruder can now move to find out the original values of the synthesized age, date of birth and annual income of person . Such disclosure risks are attribute disclosure.

2.2 Disclosure risks of releasing fully synthetic data

Several authors had claimed that in fully synthetic data, identification disclosure risks are not applicable since there is no unique mapping of the records in the synthetic data to the records in the original data (Hu et al., 2014; Wei and Reiter, 2016). This is generally true because all attributes of all records are synthesized in fully synthetic data.

Although identification disclosure risks are treated as non-existing in fully synthetic data, attribute disclosure risks potentially exist, as an intruder can try to use the released synthetic data and any other information to infer an entire records. For example, suppose a fully synthetic dataset contains 1000 individual records and 6 attributes, {age, date of birth, annual income, gender, marital status, county}, of which all are synthesized. Suppose that person is in the published synthetic data and the intruder knows this information. In addition, the intruder knows the real attribute values of the other 999 individuals, then the intruder might be able to find out the attributes of person with a reasonable chance. Such disclosure risks are attribute disclosure.

2.3 Disclosure probabilities and their summaries

The Bayesian estimation methods we present in this paper focus on calculating the probabilities of attribute disclosure and identification disclosure. For example, for attribute disclosure risks, we present how to estimate the probability of guessing the original values of the synthesized vector of attributes

of record to be given the synthetic data, the un-synthesized attributes, and any other information. This estimated probability is for record specifically, which is at the record level.

In addition, synthetic data disseminators can provide file-level summaries of these record-level probabilities. These file-level summaries can be different depending on the applications. For example, when synthesizing fully categorical data as in Hu et al. (2014, 2018) in Section 3.3.1, ranks and re-normalized probabilities are reported as file-level summaries because of the nature of the synthesis (fully) and the nature of the attributes (all categorical). When synthesizing partially continuous data as in Wang and Reiter (2012) in Section 3.3.2, Euclidean distances between the intruder’s guess of the longitude and latitude and the actual longitude and latitude, and subsequent counts within a radius are reported as file-level summaries because of the nature of the synthesis (partially) and the nature of the attributes (continuous, in particular, the longitude and latitude of a record).

For attribute disclosure, we focus on calculating the record-level disclosure probabilities in the key estimating steps (Section 3.2). Then in the application section (Sections 3.3), we present file-level disclosure probability summaries for each application, though we touch on the record-level disclosure probabilities calculation assumptions briefly. For identification disclosure, a standard set of 3 file-level summaries is reported in all selected examples. Therefore, we present both the methods to calculate the record-level disclosure probabilities and the methods to calculate the file-level disclosure probability summaries in the key estimating steps (Section 4.2). We want to make the readers aware of the distinction between the record-level probabilities and file-level summaries of disclosure risks.

3 Bayesian estimation of attribute disclosure risks

As discussed previously with the examples in Section 2.1 and Section 2.2, attribute disclosure risks potentially exist for fully synthetic data and partially synthetic data.

Evaluating the attribute disclosure risks in fully synthetic data had been a seemingly impossible task for a while. Empirical matching or misclassification-based approaches (Shlomo and Skinner, 2010) cannot be used since there is no correspondence between the original and the synthetic datasets. Skinner (2012) called for further research on the existing Bayesian approaches to disclosure risk assessment, especially to emphasize the Bayesian thinking rather than simply using the Bayesian machinery in the assessment process. As a response to the call of Skinner (2012), Reiter (2012) tried to propose principled Bayesian estimation procedure for attribute disclosure risks for fully synthetic data. The general framework laid out in Reiter (2012) was extended by and further developed by Reiter et al. (2014) for both fully and partially synthetic data. This general framework gives interpretable probability statements of the attribute disclosure risks, and provides flexible incorporation of different assumptions of intruder’s knowledge and behavior.

In this section, we present the framework of Reiter et al. (2014) for Bayesian estimation of attribute disclosure risks. We use similar notations, highlight the key steps, and illustrate with selected examples. We have chosen these examples that are built upon the framework but tailored for specific purposes and needs of the applications. To be as comprehensive as possible, we focus on the following: i) fully synthetic categorical data (Hu et al., 2014, 2018), ii) partially synthetic continuous data (Wang and Reiter, 2012), and iii) fully synthetic count data (Wei and Reiter, 2016). In the end, we will discuss the challenges and future directions of this framework.

3.1 Notations and setup

Let be the vector response of observation in the original confidential dataset, where direct identifiers (such as name or SSN) are removed. When needed, we use as the variable index, and . Among the variables, i) some are synthesized, denoted by ; and ii) others are un-synthesized, denoted by . We have for the -th observation with its original values, and for the entire dataset containing observations with their original values. We note that when fully synthesis is carried out, , therefore . Without loss of generality, we use when introducing the notations, setup and key estimating steps.

On the agency side, synthetic datasets are released, denoted by . Each synthetic dataset is denoted by where . See Drechsler (2011b) for review of synthetic data and the references therein.

On the intruder side, suppose the intruder intends to learn the original value of for some record in . Several pieces of information can be available to the intruder: i) , the un-synthesized values of all observations; ii) , any auxiliary information known by the intruder about records in ; and iii) denotes any information known by the intruder about the process of generating . We will discuss each piece in detail in Section 3.2.

Let

be the random variable representing the intruder’s uncertain knowledge of

. The intruder seeks the distribution of . If

is a vector of categorical variables, then probabilities of accurately inferring the confidential values can be calculated through

, where is one plausible combination of categorical responses of those variables. The examples (Hu et al., 2014, 2018) in Section 3.3.1 on fully synthetic categorical data are illustrations for being a vector of categorial variables. If is one or multiple continuous or count variables, context-specific file-level attribute disclosure probability summaries should be developed to summarize the attribute disclosure risks. The examples of Wang and Reiter (2012) on partial synthetic continuous data in Section 3.3.2, and Wei and Reiter (2016) on fully synthetic count data in Section 3.3.3 provide illustrations for these types of .

For the agency, it is paramount to model different intruder’s knowledge and behavior, i.e. assumptions on the level of knowledge of , and . The framework in Reiter et al. (2014) allows the incorporation of these different assumptions at multiple stages in the estimating process, thus gives extensive flexibility to parties trying to evaluate attribute disclosure risks.

3.2 Key estimating steps

The intruder seeks the distribution of , where is one possible guess of by the intruder. Recall that , , and are available to the intruder. According to Bayes rule,

(1)

where is the synthetic data distribution given what the intruder knows, and represents the intruder’s prior on given , , and .

The estimation procedure of Equation (1) varies by the variable type of , and assumptions on the level of knowledge of , and , among other things. Here we go through each of these quantities and their implications in the estimating process, highlight several common practices that have been adopted, before we illustrate with a selection of attribute disclosure risk assessment demonstrations with real synthetic data applications in Section 3.3.

3.2.1 Knowledge of

Recall is the set of un-synthesized values of all observations. As mentioned before, when is partially synthetic, since intruder has access to , can be determined and thus available. When is fully synthetic, , therefore we can drop this term and further simplify the expression for fully synthetic as

(2)

Often times is further simplified to , as in Hu et al. (2014, 2018). However, without loss of generality, we keep in the following discussion. Readers should keep in mind that with fully synthetic data, ; while with partially synthetic data, is available.

3.2.2 Assumptions about

We use to denote auxiliary information known by the intruder about records in . As is either known in partial synthetic or dropped in fully synthetic, specifically refers to information about , the synthesized values. When the intruder seeks , the distribution of record ’s synthesized variables, there are numerous possible scenarios of what the intruder knows about the synthesized values of every other record, denoted by the matrix . First proposed in (Reiter, 2012), a “worst case” scenario where the intruder knows the original values of the synthesized variables of all records except for record has been a common practice, i.e. . This practice has been recognized as strong intruder knowledge and conservative, as in many contexts, it is impossible for the intruder to know . However, it has also been argued that if disclosure risks under such conservative assumption are acceptable, disclosure risks should be acceptable for weaker (Reiter, 2012; Reiter et al., 2014). As Reiter (2012); Hu et al. (2014) noted, assuming the intruder knows all records but one is related to, but quite distinct from, the assumptions used in differential privacy (Dwork, 2006). McClure and Reiter (2012) designed simulation studies to compare the two paradigms.

3.2.3 Assumptions about

denotes any information known by the intruder about the process of generating . Examples include code for the synthesizer and descriptions of the synthesis model. Such information sometimes can be public available with great details. For example, the Census Bureau’s Survey of Income and Program Participation (SIPP) Synthetic Beta product has an accompanying document Benedetto et al. (2013)

describing their synthesizing process. From the document, we gather that they implemented a Sequential Regression Multivariate Imputation (SRMI) framework, with three main models (linear regression, logistic regression, and Bayesian bootstrap

(Rubin, 1981)) for missing data imputation and synthetic data generation. Such public available detailed information should be assumed known by the intruder.

3.2.4 Choosing the prior

Determining the intruder’s prior beliefs of is another nearly impossible task. Skinner (2012) challenged the use of prior distributions being a more technical one for the Bayesian machinery to function, and advocated for prior distributions that should be defensible from the agency’s perspective. A common practice is to specify uniform prior distributions of for all possible guesses , given as proposed in Reiter (2012), but also consider a variety of prior distributions when possible, especially if more informative prior is available Wei and Reiter (2016).

3.2.5 The estimation of

We now go through the estimation of . The previously discussed assumptions of and are relevant in this part of the estimating process. The importance sampling techniques coupled with Monte Carlo simulation are adopted common practices, and we will present and discuss why and how they work.

Typically by the independence of different synthetic datasets, we work with each synthetic dataset separately, therefore we consider for our discussion. To ultimately obtain , we have

(3)

For , under the “worst case” scenario where the intruder knows the original values of the synthesized variables of all reords except for record , i.e. , we come to

(4)

which is very close to the distribution from which the synthetic data is generated, as in

(5)

where is the true record in the original confidential dataset . As we can see, the only difference in the conditioned quantities in Equation (4) and Equation (5) is difference between (the random guess) and (the true record).

In fact, we could utilize the small difference between and for estimating . If we use to denote the parameters in the synthesis model , we could easily incorporate draws in our estimation of through a Monte Carlo step, as in

(6)

The Monte Carlo step requires re-estimation of the synthesis model for each , which could be computationally prohibitive if many possible guesses of need to be evaluated. To avoid the re-estimation of to draw samples, a common procedure via importance sampling is adopted. In particular, available draws of from , the model used for generating the synthetic dataset , act as proposals for the importance sampling algorithm. Readers are referred to Paiva et al. (2014); Hu et al. (2014) for a review of importance sampling and its usage in the applications therein.

3.3 Selected examples

In this selected examples section, we want to show the readers in a few different applications, i) what are , , , , and ; ii) what are the risk scenarios (i.e. assumptions are made for and ), and their implications; and iii) what are the specific file-level attribute disclosure probability summaries in each application. For each application, we give a brief overview of the dataset(s) and research questions to provide the background. We also mention the synthesizers, but the details of the synthesizers and the evaluation of the utility of the synthetic data are omitted, as we focus on the estimation of probabilities of attribute disclosure in this paper. Interested readers should refer to the cited papers for further information.

3.3.1 Fully synthetic categorical data

Hu et al. (2014) aimed at generating fully synthetic categorical data for a subset of individuals from the 2012 American Community Survey (ACS) public use microdata sample for the state of North Carolina. They considered unordered categorical variables, as listed in Table 1. We include the variables, the number of categories of each variable, and whether a variable is synthesized in this table.

Variable Number of categories Synthesized
Sex 2 Yes
Age 4 Yes
Race 6 Yes
Education level 4 Yes
Marital status 5 Yes
Language 2 Yes
Birth place 7 Yes
Military 3 Yes
Work 3 Yes
Disability 2 Yes
Health insurance coverage 2 Yes
Migration 3 Yes
School 3 Yes
Hispanic 2 Yes
Table 1: Variables used in the Hu et al. (2014). Data taken from the 2012 ACS public use microdata samples.

While the authors attempted fully synthetic data generation (Rubin, 1993), they followed the partially synthetic approach (Little, 1993) to replace all values in the data by synthetic values (Drechsler, 2011a). Unlike the approach of Rubin (1993), where no correspondence between a fully synthetic record and a real world record with subscript , their approach maintains such correspondence, which is a key assumption in their proposed attribute disclosure evaluation methods. For more discussion on the two approaches to generating fully synthetic data, refer to Drechsler (2018).

The authors used Dirichlet Process mixture of products of multinomial (DPMPM) synthesizer. The DPMPM is consisted of a set of flexible Bayesian latent class models that have been developed to capture complex relationships among multivariate unordered categorical variables (Dunson and Xing, 2009). In recent years, the DPMPM has been proposed as a multiple imputation engine for missing data problems, and a synthesizer for statistical disclosure control. Si and Reiter (2013) implemented the DPMPM as an missing data imputation engine and demonstrated its superior performance comparing to traditional sequential imputation models such as the multiple imputation with chained equations (MICE; Buuren and Groothuis-Oudshoorn (2011)). In addition to Hu et al. (2014), Drechsler and Hu (2018+); Hu and Savitsky (2018+) used the DPMPM synthesizer for generating partially synthetic data with geocoding information. Variations of the DPMPM include versions of it dealing with structural zeros (Manrique-Vallier and Reiter, 2014; Manrique-Vallier and Hu, 2018), dealing with extension of the multinomial synthesizer (Hu and Hoshino, 2018), and dealing with individuals nested within households (Hu et al., 2018; Akande et al., 2018+b, 2018+a).

The DPMPM synthesizer assigns an underlying latent class of each record. Conditioning on the latent class assignment, each attribute independently follows its own distribution. For unordered categorical variable, such distribution is usually multinomial distribution. To generate the synthetic vector of attributes of one record, we first sample the latent class assignment. For each attribute, we generate the value from its independent multinomial distribution with probabilities sampled from the DPMPM.

We use to represent the random attribute vector of record (the upper script is dropped because this is fully synthetic, i.e. and ), to represent the random attribute vectors of all records, to represent each fully synthetic dataset, where , and to represent all fully synthetic datasets.

Following the general setup and notations introduced earlier, we use to represent the intruder’s information on person’s attributes in the sample (i.e. auxiliary information), and to represent any meta-data released by the agency about the synthesis model. The goal is to estimate for one or more target records in the sample. Specifically in this case, because all attributes are unordered categorical, we are able enumerate all possible combinations of the categorical attributes.

Then the expression of the probability of attribute disclosure of an entire record becomes Equation (7),

(7)

where is a guess by the intruder.

Hu et al. (2014) set , which corresponds to the “worse case” scenario where the intruder knows the actual data for all records except for the record .

To estimate , the authors proposed to dramatically reduce the set of possible combinations that could take. Specifically, they consider the neighborhood near (the true record), which contains only feasible candidates where differ from in one variable. In their illustrative application, the subset contains only combinations, reduced from

cells in the contingency table. The authors commented that if risks of

being the true are acceptable in this reduced set, then the risks would be even lower when considering the full set. The risks in the reduced set are the upper bound. Importance sampling techniques were applied to avoid re-evaluation of probability of obtaining the synthetic datasets given different combinations of , as in Equation (4).

For the prior on , Hu et al. (2014) assumed a uniform prior, which sums to over the number of combinations in the reduced set (e.g.

in their illustration). The prior probabilities were canceled out in the computation process.

To summarize the calculated risks of attribute disclosure of all records, Hu et al. (2014) created two file-level attribute disclosure probability summaries:

  1. The ranking of the probability of the true record being disclosed, among the subset of combinations;

  2. The re-normalized probability of the true record being disclosed, among the subset of combinations.

In general, the higher the ranking and the re-normalized probability, the higher the attribute disclosure risks.

Additionally, Hu et al. (2014) investigated scenarios where the intruder might know a subset of values in , for example, demographic variables. The authors then defined each as , where the additional subscript denotes the variables known by the intruder and denotes the variables unknown by the intruder. Subsequently, to evaluate risks for intruders seeking to estimate the distribution of , the authors defined , and Equation (LABEL:eq:AttriBayesRule-rep2) becomes

(9)

The estimation procedure works in a similar way, and we refer interested readers to Hu et al. (2014) for detail and discussion.

3.3.2 Partially synthetic continuous data

Wang and Reiter (2012) aimed at generating partially synthetic data for sharing precise geographies. The precise geographies were exact longitude and latitude of each death of a sample of North Carolina mortality records in 2002. Only exact longitude and latitude of each record were synthesized, and all non-geographic variables were kept unchanged. We include the variables, their descriptions, and whether a variable is synthesized in Table (2).

Variable Description Synthesized
Longitude Recoded (1 - 100) Yes
Latitude Recoded (1 - 100) Yes
Sex Male, female No
Race White, black No
Age (years) 16-99 No
Autopsy performed Yes, no, missing No
Autopsy findings Yes, no, missing No
Marital status 5 categories No
Attendant 3 categories No
Hispanic 7 categories No
Education (years) 0-17 No
Hospital type 8 categories No
Cause of death Binary No
Table 2: Variables used in the Wang and Reiter (2012). Data taken from the 2012 North Carolina mortality records dataset.

The authors used classification and regression trees (CART; refer to Reiter (2005b) for details of using CART to generate partially synthetic data) synthesizers for generating longitudes and latitudes. In particular, they first fit a regression tree of longitudes on all non-geographic attributes, and generated synthetic longitudes using the Bayesian bootstrap. After obtaining synthetic longitudes, they fit another regression tree of latitude on all non-geographic attributes and the true latitude, and generated synthetic latitudes using the Bayesian bootstrap. In the end, the partially synthetic precise geographies were simulated.

We use to represent the random longitude and latitude of record , to represent the random longitudes and latitudes of all records, to represent the un-synthesized value of all records, to represent each partially synthetic dataset, where , and to represent all partially synthetic datasets.

Following the general setup and notations introduced earlier, we use to represent the intruder’s information on person’s attributes in the sample (i.e. auxiliary information), and to represent any meta-data released by the agency about the synthesis model. The goal is to estimate for one or more target records in the sample. We now express the probability of attribute disclosure of estimating the longitude and latitude of record , which is the same as in Equation (1) and restated below in Equation (10). Recall that is a possible original value of by the intruder.

(10)

Wang and Reiter (2012) evaluated two scenarios regarding the choice of and .

  1. Scenario #1 (high-risk): The intruder knows everything except for one target’s , i.e., , and includes everything about the CART except the individual geographies in the nodes. Note that is the same as the “worst case” scenario discussed in Section 3.2.

  2. Scenario #2 (low-risk): The intruder does not know any records’ geographies, i.e., , and includes everything about the CART except the individual geographies in the nodes.

For the high-risk scenario, to estimate , importance sampling techniques were applied to avoid re-evaluation of probability of obtaining the synthetic datasets given different combinations of , as in Equation (4). For the intruder’s prior on , Wang and Reiter (2012)

assumed a uniform distribution on a grid over a small area containing the target’s true longitude and latitude.

The authors noted that the uniform prior for represents strong intruder prior information, because the prior was given on a grid over a small area. They also noted that the value of the risk measure would change if other prior specifications were given, though they did not consider other specifications in their attribute risk disclosure evaluation.

Specifically, Wang and Reiter (2012) developed two geographies-specific attribute disclosure risk measures:

  1. A Euclidean distance between the intruder’s guess of the longitude and latitude and the actual longitude and latitude;

  2. The count recording the number of actual cases in circle centered at the actual longitude and latitude with radius .

In general, larger values of and correspond to smaller attribute disclosure risks.

We also want to note that Paiva et al. (2014) used a similar dataset with the goal of partially synthesizing geographies. Though their synthesis model was different from the one in Wang and Reiter (2012), they proposed three other file-level attribute disclosure probability summaries for synthetic data applications involving geographic locations.

  1. A file-level risk measure of the percentage of records with the true location

    being the maximum posterior probability of record

    ;

  2. A file-level risk measure of the percentage of records with the true location being the maximum posterior probability of record , and record has unique patterns;

  3. A Euclidean distance measure between the true location and the guess with the maximum posterior probability of record .

In general, smaller values of measures (i) and (ii) and larger values of measure (iii) correspond to smaller attribute disclosure risks.

3.3.3 Fully synthetic continuous data

Wei and Reiter (2016) aimed at generating fully synthetic data for sharing magnitude microdata from business establishments. The magnitude variables were the number of skilled laborers, the number of unskilled laborers, wages of skilled laborers, and wages of unskilled laborers of a sample of food manufacturing establishments in the country of Colombia in 1977. All 4 magnitude variables were synthesized, making it a fully synthetic microdata endeavor. We include the variables, their descriptions, and whether a variable is synthesized in Table (3).

Variable Description Synthesized
Number of skilled laborers Integer Yes
Number of unskilled laborers Integer Yes
Wages of skilled laborers Integer Yes
Wages of unskilled laborers Integer Yes
Table 3: Variables used in the Wei and Reiter (2016). Data taken from the 1977 Colombia food manufacturing establishments sample.

The authors used three synthesizers based on finite mixtures of Poisson (MP) distributions. The class of finite mixtures of Poissons can i) capture complex multivariate associations among the variables; and ii) model count variables. In addition to the basic MP synthesizer, Wei and Reiter (2016)

proposed the mixture of Multinomial (MM) synthesizer, which ensures the synthetic values sum to marginal totals in the confidential data. The marginal totals constraints are satisfied by performing another layer of Multinomial draws of counts within each occupied Poisson mixture component. Specifically, the totals (e.g. the number of skilled laborers) and the number of cases (in running the MP, at each MCMC iteration, each records is assigned to a component) in each occupied Poisson mixture component are computed and stored. Based on the totals and the number of cases, a Multinomial sample is generated, distributing the totals into all levels. Within each occupied Poisson mixture, the marginal totals match in the synthetic and confidential data, therefore overall marginal totals also match. Furthermore, the authors proposed the tail-collapsed mixture of Multinomial (TCMM) synthesizer, which effectively performs a model-based variation of microaggregation plus noise, by collapsing tails of individual variables (i.e. risky values). For TCMM, one needs to specify a parameter associated with the quantile, namely

, which acts as a threshold to control the amount of collapsing.

We use to represent the random vector of the numbers of skilled and unskilled laborers and their corresponding wages of record , to represent the these 4 random magnitude variables of all records, to represent each fully synthetic dataset, where , and to represent all fully synthetic datasets. Note that we dropped the upper script in and use and directly because every variable is synthesized.

Specific to the business establishment survey data, we need to define a few other quantities before we can describe the risk scenarios and file-level attribute disclosure probability summaries. We use and to represent the largest and second largest values of variable in , the original confidential dataset. When the intruder does not know these values, we use and as the random variables representing the intruder’s uncertain knowledge about them. Furthermore, we let be the total of variable in , and use and to represent the values of two entire records.

Wei and Reiter (2016) considered a variety of risk scenarios. As an illustration of evaluating attribute disclosure for fully synthetic continuous data, we present the scenario where the intruder, who has the second largest value of a certain variable, , attempts to use the released synthetic data to learn about the individual with the largest value of the variable, the random quantity . Such scenario is commonly used by official statistics agencies with business establishment data (Kim et al., 2015, 2018).

Recall that we use to represent the intruder’s information on person’s attributes in the sample (i.e. auxiliary information), and to represent any meta-data released by the agency about the synthesis model. Translating the scenario above into choices of and , we come to Equation (LABEL:eq:AttriBayesRule-rep6), which represents the attribute disclosure risk probability of guessing when is available.

where is a possible original value of by the intruder.

To estimate , techniques of using to approximate the set of records in the same component occupied by the target record were applied to simplify the computation. We refer the readers to Wei and Reiter (2016) for the details regarding the MM and TCMM synthesizers.

For the intruder’s prior on , Wei and Reiter (2016) discussed the choice of a non-uniform prior distribution, which could provide more accurate prior guesses and is worth noting here. In their empirical illustration of synthesizing fully magnitude data from the Colombia food manufacturing establishments dataset, the authors estimated the chance that the largest value of the number of skilled laborers falls into an interval, with the lower bound being the second largest value and the upper bound being pre-defined. Among the three different synthesizers, the attribute disclosure risks under the MP and the MM synthesizers are extremely high, whereas the risks under the TCMM synthesizer are overall much lower. Furthermore, the risks decrease as the threshold parameter decreases, which are expected because of the amount of tail collapsing increases as decreases. Interested readers are encouraged to consult Wei and Reiter (2016) for their explanations.

We also want to point out that Wei and Reiter (2016) evaluated two other sets of scenarios, both of which assume the intruder seeks to guess the values of variable of two records, and . The first set of scenario is the intruder knows all but one or two values in , which means the intruder seeks to estimate the probability of given , and different combinations of and . The second set of scenario is the intruder knows all data values except for one or two records, which means the intruder seeks to estimate the probability of given , and different combinations of and .

3.4 Discussion and comments

There are few common practices for evaluating attribute disclosure risks in the selected examples, as well as in other synthetic applications. The first one is on the assumptions about , the auxiliary information known by the intruder about records in . The “worse case” scenario of letting , i.e. the intruder knows all the original values of the synthesized variables of all records except for record , though provides an upper bound of the identification risks, is a very strong and probably unrealistic assumption. The scenario greatly simplifies the estimation of as in Section 3.2.5 and Equation (3), by setting . If the assumption is weaker, for example, the intruder only knows the synthesized values of next record , then , which means the approximation in Equation (4) to (6) will involve extra steps of imputing all the other synthesized values (see Paiva et al. (2014) for a potential solution of approximation). Such weaker assumptions are much more realistic, but almost computationally infeasible with the current setup. McClure and Reiter (2016) examined the effect on attribute disclosure risks in fully synthetic data by decreasing the number of observations knows (i.e. weakening the assumption of ). Future research in designing faster algorithm to estimate with weaker is desired.

The common practice of setting the prior as a uniform distribution has been adopted in various applications. Because of the cancellation in Bayes’ rule, using a uniform prior for essentially simplifies the estimation, as we only need to estimate in Equation (2). We should recognize not only its convenience in computation, but also its constraint. Using a uniform prior can be un-informative for some cases, but it might be strongly informative in other cases, as in Wang and Reiter (2012), which might not be realistic. Using a uniform prior can be realistic in some cases, but it might need to adjusted to reflect more realistic prior belief. For example, it is possible to argue that the 35 combinations in the reduced subset in Hu et al. (2014) should not really be treated equally likely (i.e. a uniform prior). Rather, some combinations might be more plausible than the others, thus carrying higher prior probability. The general advice is to consider a wide range of prior distribution for if possible. Also, do not choose uniform only for its simplicity. Choosing a more realistic prior distribution provides a more reasonable attribute disclosure risks measure (Wei and Reiter, 2016).

When estimating , importance sampling techniques are widely used to avoid re-estimating the synthesis model for each . First of all, we should recognize that if is not as strong as , even the importance sampling techniques will not help much. See the discussion in the first paragraph of this section. Second of all, typically the set of guesses of , , is reduced to a much smaller set than the full set containing all possible combinations. Even though the reduction provides an upper bound of the attribute disclosure risks (Hu et al., 2014, 2018), it is really for computational feasibility that such reduction is applied. Further research paths include faster algorithm to expand the small reduced set, and new algorithm to search for that gives high probability estimation of in an efficient way, therefore enabling the data disseminator to check against the actual truth and determine its attribute disclosure risks level. Third of all, to use the Monte Carlo approximation coupled with importance sampling techniques in Equation (6), draws of

are necessary, which means the final synthetic data generation process involves parametric models. Among the selected examples,

Hu et al. (2014); Wei and Reiter (2016) had parametric models for the outcome (multinomial and poisson, respectively). Even though Wang and Reiter (2012)

used non-parametric CART synthesizers, their ultimate synthetic data generation process involves Bayesian bootstrap sampling with mixture normal distributions. It is unclear how to estimate the attribute disclosure risks for true non-parametric synthesizers, which can be a fruitful research path.

There are additional possible difficulties in implementing the Bayesian estimation procedure of attribute disclosure risks evaluation. As noted in Manrique-Vallier and Hu (2018), their proposed synthesizers for categorical variables with structural zeros had serious stability issues with the estimation of , as its values varied by several thousands in the log-scale from one sample of to another, resulting enormous mean-squared error. The authors then developed an indirect bootstrap hypothesis testing framework to approximate the ranking of in the reduced set. We refer the readers to Manrique-Vallier and Hu (2018) for details.

One final comment to make is the work of McClure and Reiter (2012), where the authors compared the disclosure risk criterion of -differential privacy with a criterion based on the attribute disclosure risk probabilities. The evaluation from their simulation studies was that the two paradigms are not easily reconciled. Moreover, sometimes attribute disclosure risks can be small even when is large. The authors proposed an alternative disclosure risk assessment approach, one integrates both paradigms, though great computation challenges were foreseeable. Further research on risk assessment integrating the two paradigms is desired.

4 Bayesian estimation of identification disclosure risks

As discussed previously, we only consider identification disclosure risks for partially synthetic data.

Researchers had worked on Bayesian probabilistic matching to estimate the probabilities of identifications of sampled units. Duncan and Lambert (1986, 1989); Lambert (1993) developed Bayesian approaches to i) model the behavior of intruders, and ii) quantify sources of uncertainty about those estimated probabilities. Their work is followed by Fienberg et al. (1997), who estimated probabilities of identification for continuous microdata, which had undergone SDL techniques by adding random noise.

Observing the lack of illustrative applications on genuine data, Reiter (2005a) extended the Duncan-Lambert framework using data from the Current Population Survey (CPS). Common SDL techniques (recoding, topcoding, swapping, adding random noise, and combinations of these techniques) were applied to genuine microdata in their illustrations. They also considered different assumptions of intruders’ knowledge and behavior and incorporated such information into the estimation of the identification probabilities.

The step-by-step probability estimation procedure in Reiter (2005a) has been standard practice for Bayesian probabilistic matching ever since, especially after the synthetic data approach has gained its momentum. Reiter and Mitra (2009) in particular first set up the framework for the Bayesian probabilistic matching for partially synthetic data.

We now turn to the framework in Reiter and Mitra (2009) for identification disclosure risks estimation for synthetic data, which was built on the more general framework for identification disclosure risks estimation of common SDL techniques in Reiter (2005a). We use similar notations, highlight the key steps, and illustrate with selected examples. We have chosen these examples that are built upon the framework but tailored for specific purposes and needs. To be as comprehensive as possible, we present two partially synthetic categorical data applications i) Reiter and Mitra (2009), ii) Drechsler and Hu (2018+), and iii) partially synthetic categorical and continuous data (Drechsler and Reiter, 2010). In the end, we will discuss the challenges and future directions of this framework.

4.1 Notations and setup

In the sample S of units and variables, the notation refers to the -th variable of the -th unit, where and . The column contains some unique identifiers (such as name or Social Security Number), which are never released. Among the recorded variables, i) some are available to users from external databases, denoted by , and ii) others are unavailable to users except in the released data, denoted by . We therefore have the vector response of the -th unit, . We also have the matrix representing the original values of all units.

On the agency side, suppose it releases all units of the sample S. Similar to the split of , we have . Among the available variables, we further split them into i) the synthesized variables, and ii) the un-synthesized variables. We therefore have , and we let be the matrix of all released data. We also let be all units’ original values of the synthesized variables. We note that in some cases, the agency might only release units of the sample (Reiter, 2005a).

On the intruder side, let be the vector of information that the intruder has. may or may not be in , but we assume for some unit in the population. This vector only contains un-synthesized and synthesized variables (no unavailable variables as in and ), thus we have . The intruder’s goal is to match record in to the target when . Additionally, two other pieces of information can be available to the intruder. Let represent the meta-data released about the simulation models used to generate the synthetic data, and let represent the meta-data released about the reason why records were selected for synthesis. Either or could be empty.

There are released units in . Let be the random variable that equals to when for and equals when for some . The intruder intends to calculate for . The intruder is particularly interested in learning whether any of the calculated identification probabilities for are large enough to declare an identification.

For the agency, it is paramount to model different intruder’s knowledge and behavior when estimating identification risks from releasing synthetic dataset. The framework in Reiter and Mitra (2009) allows the incorporation of these different assumptions at multiple stages in the estimating process, thus gives extensive flexibility to parties trying to evaluate identification disclosure risks.

4.2 Key estimating steps

The intruder intends to calculate for . Based on the split of , we re-write the probability as

(14)

In fact, the intruder does not know the actual values in , all units’ original values of the synthesized variables. Therefore for the intruder, integrating over its possible values when computing the match probabilities is necessary, as in

(15)

The estimation procedure of Equation (15) varies by the variable(s) in (e.g. whether in or in ), the variable types, assumptions on the level of knowledge of being in or not, of and , among other things. Here we go through each of these aspects/quantities and their implications in the estimating process, highlight several common practices that have been adopted, before we illustrate with a selection of identification disclosure risk assessment demonstration with real synthetic data applications in Section 4.3.

4.2.1 The variable(s) in

An immediate simplification of in Equation (15) is

(16)

This is true because when is given, and are conditionally independent. That is, the intruder would use without the synthetic data , the unavailable variables , , or to attempt re-identification. Equation (16) will be used in Sections 4.2.2 and 4.2.3 as well.

Consider any variable in . Since it is an un-synthesized variable, for any unit in where the released value of , .

4.2.2 The variable(s) in

For categorical variables in the synthesized set , the intruder matches directly on . For numerical or continuous variables in , while exact match could be pursued, the nature of the numerical/continuous variables will result in zero probabilities for most if not all of the records. Therefore, it is advisable to match numerical components of within some acceptable distance (e.g. Euclidean or Mahalonobis) from the corresponding .

4.2.3 Whether is in or not

The overall assumption we have is that the vector of information that the intruder has, for some unit in the population, but not necessarily in . When is in , then the quantity in Equation (16) for is 0, i.e. . This simplifies calculating for . For example,

(17)

where is the number of units in with consistent with .

When is not in , then . If we let be the number of units in the population that have consistent with which are also included in , then

(18)

Determining can be done from census totals, or to be estimated from available sources. Reiter and Mitra (2009) discussed possible ways for estimation using survey weights. Model-based approaches to estimating can be applied too, for example Elamir and Skinner (2006), among others. Additional approaches to accounting for intruder uncertainty due to sampling were proposed in Drechsler and Reiter (2008).

It is important to recognize that setting results in conservative measures of identification disclosure risks.

4.2.4 Assumptions about and

Previously, we let represent the meta-data released about the simulation models used to generate the synthetic data, and represent the meta-data released about the reason why records were selected for synthesis. We note that in practice, is usually dropped because reasons why records were selected for synthesis are difficult to come by. However, can be available in many cases. For example, in Section 3.2, information about the synthesis models of the SIPP Synthetic Beta is available online (Benedetto et al., 2013), which should be assumed by the intruder. Not only the SIPP, information about the synthesis process of the SynLBD is publicly available in Kinney et al. (2011, 2014).

4.2.5 Estimating through Monte Carlo

This description follows the description given in Drechsler and Hu (2018+). The construction in Equation (15) suggests a Monte Carlo approach to estimating each ) (note that is used in place of ; is dropped, assuming unavailable), and we re-write it as

(19)

For the Monte Carlo approach, perform the following two-step process.

  1. Sample a value of from , and let represent one set of simulated values.

  2. Compute using exact matching assuming are collected values.

This two-step process is iterated times, where ideally is large, and Equation (19) is estimated as

(20)

where indicates one iteration of the two-step process.

When has no information, the intruder treats the simulated values as plausible draws of .

4.2.6 Three summaries of identification disclosure probabilities

For attribute disclosure risk measures in Section 3.3, summaries of attribute disclosure probabilities vary by variable types and contexts. For example, fully synthetic categorical data uses summaries of i) ranking, and ii) re-normalized probability of the true record being disclosed, as in Hu et al. (2014) in Section 3.3.1. Partially synthetic continuous data, specifically in Wang and Reiter (2012) where synthetic precise geographies are released, summaries of i) a Euclidean distance between the intruder’s guess of the geographies and the actual geographies , and ii) the count of the actual cases in circle centered at the actual geographies within radius in i) are reported.

Unlike the summaries of attribute disclosure probabilities, summaries of identification disclosure probabilities are more generally applicable, regardless of the variable types and contexts. There are three summaries of identification disclosure probabilities, which now we describe, following Drechsler and Hu (2018+).

We need the following notations and definitions before we present the three summaries. Let be the number of records with the highest match probability for the target ; let if the true match is among the units and otherwise. Let when and otherwise, and let denote the total number of target records. Finally, let when and otherwise, and let equal the number of records with .

Now we can present the three widely used summaries (file-level) of identification disclosure probabilities using the notations and definitions given above.

  1. The expected match risk:

    (21)

    When and , the contribution of unit to the expected match risk reflects the intruder randomly guessing at the correct match from the candidates. In general, the higher the expected match risk, the higher the identification disclosure risks.

  2. The true match rate:

    (22)

    which is the percentage of true unique matches among the target records. In general, the higher the true match rate, the higher the identification disclosure risks.

  3. The false match rate:

    (23)

    which is the percentage of false matches among unique matches. In general, the lower the false match rate, the higher the identification disclosure risks.

4.3 Selected examples

In this selected examples section, we want to show the readers a few different applications of partially synthetic data. We will illustrate the variables in for each application. All applications follow similar estimating procedure, and report the same three summaries as presented in Section 4.2.6: i) the expected match risk, ii) the true match rate, and iii) the false match rate.

For each application, we give a brief overview of the dataset(s) and research questions to provide the background. We also mention the synthesizers, but the details of the synthesizers and the evaluation of the utility of the synthetic data are omitted. Interested readers should refer to the cited papers for further information.

4.3.1 Partially synthetic categorical data 1

Reiter and Mitra (2009) aimed at partially synthesizing a sample of of the 1987 Survey of Youth in Custody. There are 23 variables on the file, and the authors illustrated partially synthesizing two categorical variables, facility and race. Table 4 gives a partial list of the variables with their description, synthesis information and whether known by the intruder. All other un-listed 20 variables are not known by the intruder during the identification disclosure risks evaluation.

Variable Description Synthesized Known by intruder
Facility Categorical, 46 levels Yes Yes
Race Categorical, 5 levels Yes Yes
Ethnicity Categorical, 2 levels No Yes
Table 4: Selected variables used in the Reiter and Mitra (2009). Data taken from the 1987 Survey of Youth in Custody.

To synthesize the facility and race variables, the authors first use multinomial regressions to synthesize facility. All other variables except race and some variables causing multi-collinearity are included in the multinomial regressions as predictors. Once all values of the facility variable are synthesized, the authors then synthesize race using multinomial regressions. The predictors in these multinomial regressions include all other variables plus indicator variables for facilities, except those causing multi-collinearity. Reiter and Mitra (2009) note that the new values of race are simulated conditional on the values of the synthetic facility indicators.

For the identification disclosure risks evaluation, the authors considered facility and race in , and ethnicity in . They also assumed that all targets are in the sample, i.e. .

4.3.2 Partially synthetic categorical data 2

Drechsler and Hu (2018+) aimed at comparing a few existing synthesizers on a large German administrative database called the Integrated Employment Biographies (IEB) to provide access to detailed geocoding information. There are approximately 22 million records in the IEB. The authors considered 11 variables as listed in Table 5. We include the variables, the description, whether a variable is synthesized, and whether the variable is known by the intruder in the identification disclosure risks estimation (i.e. whether in ) in this table. The authors in fact experimented with different number of variables to be synthesized in order to provide higher protection. However, Table 5 considers the main synthesis approach, on which the authors presented most of the utility and risks results too.

Variable Description Synthesized Known by intruder
Exact geocoding info Longitude and latitude Yes Yes
Sex Male, female No Yes
Foreign Yes, no No Yes
Age 6 categories No Yes
Education 6 categories No No
Occupation level 7 categories No No
Occupation 12 categories No Yes
Industry of the employer 15 categories No Yes
Wage 10 categories (quantiles) No No
Distance to work 5 categories No No
ZIP code 2,063 ZIP code levels No No
Table 5: Variables used in the Drechsler and Hu (2018+). Data taken from the IEB database in Germany. Note that the exact geocoding information is recorded as distance in meters from the point 52 northern latitude and 10 eastern longitude. It is converted to categorical for two out of the three synthesizers.

The authors considered three synthesizers. The first synthesizer is the DPMPM synthesizer used in Hu et al. (2014), where the exact geocoding information was discretized into one unordered categorical variable, and the Dirichlet Process mixture model on the joint of the 11 unordered categorical variables was estimated and used to generate synthetic data. The second synthesizer is the CART synthesizer used in Wang and Reiter (2012), where the exact geocoding information (the latitude and longitude) were treated as continuous and synthesized sequentially. We call this CART synthesizer the CART continuous. The third synthesizer is also a CART synthesizer, but similar to the DPMPM synthesizer, the exact geocoding information was discretized into one unordered categorical variable. We call this CART synthesizer the CART categorical. All three synthesizers were applied to generate partially synthetic IEB, where only the geocoding information was synthesized (either as categorical or as continuous).

For the identification disclosure risks evaluation, the authors considered the exact geocoding information in , and sex, foreign, age, occupation, and industry of the employer in . Because the IEB is a census, the authors also assumed that all targets are in , i.e. . They reported the expected match risk, the true match rate, and the false match rate for different synthesizers. While CART categorical synthesizer produced synthetic data with the highest utility, the identification disclosure risks may be deemed too high, therefore the authors recommended two approaches for increasing the level of protection: i) aggregate the geocoding information to a higher level, and ii) synthesizes additional variables in the dataset. Drechsler and Hu (2018+) preferred ii) over i), and interested readers are referred to the paper for their discussion and general recommendations.

4.3.3 Partially synthetic categorical and continuous data

Drechsler and Reiter (2010) aimed at partially synthesizing a sample of of the March 2000 U.S. CPS. The authors in fact treated the sample as a census to illustrate their sampling with synthesis methodology, but for our illustration purpose, we will ignore the differences. There are 10 variables on the file, and the authors illustrated partially synthesizing three variables (2 are categorical and 1 is continuous). Table 6 gives the list of the variables with their description, synthesis information and whether known by the intruder.

Variable Description Synthesized Known by intruder