Strategies to facilitate access to detailed geocoding information using synthetic data

03/15/2018
by   Joerg Drechsler, et al.
Manfred Antoni
Vassar College
0

In this paper we investigate if generating synthetic data can be a viable strategy for providing access to detailed geocoding information for external researchers without compromising the confidentiality of the units included in the database. This research was motivated by a recent project at the Institute for Employment Research (IAB) in Germany that linked exact geocodes to the Integrated Employment Biographies, a large administrative database containing several million records. Based on these data we evaluate the performance of several synthesizers in terms of addressing the trade-off between preserving analytical validity and limiting the risk of disclosure. We propose strategies for making the synthesizers scalable for such large files, present analytical validity measures for the generated data and provide general recommendations for statistical agencies considering the synthetic data approach for disseminating detailed geographical information.We also illustrate that the commonly used disclosure avoidance strategy of providing geographical information only on an aggregated level will not offer substantial improvements in disclosure protection if coupled with synthesis. As we show in the online supplement accompanying this manuscript that synthesizing additional variables should be preferred if the level of protection from synthesizing only the geocodes is not considered sufficient.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 24

page 25

05/22/2022

Privacy Protection for Youth Risk Behavior Using Bayesian Data Synthesis: A Case Study to the YRBS

The large number of publicly available survey datasets of wide variety, ...
03/17/2021

Bayesian Estimation of Attribute Disclosure Risks in Synthetic Data with the R Package

Synthetic data is a promising approach to privacy protection in many con...
09/17/2021

Data Privacy Protection and Utility Preservation through Bayesian Data Synthesis: A Case Study on Airbnb Listings

When releasing record-level data containing sensitive information to the...
01/19/2019

Bayesian Pseudo Posterior Synthesis for Data Privacy Protection

Statistical agencies utilize models to synthesize respondent-level data ...
09/26/2018

Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys

The release of synthetic data generated from a model estimated on the da...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years more and more statistical agencies started collecting detailed geocoding information for some of their surveys or administrative databases. Using this information researchers no longer depend on pre-specified administrative geographical levels such as municipalities or counties when analyzing spatial effects. Instead they can define their own geographical areas of interest by aggregating over the individual geocodes. Furthermore, the detailed geocoding information can also be used to facilitate the linkage of data from different sources. However, these additional research opportunities come at a price: the detailed geographical information makes it very easy to identify individuals in the database. For this reason external researchers usually cannot get access to the detailed geocodes.

In this paper we evaluate different strategies to generate synthetic geocodes that could be disseminated to the public without violating confidentiality guarantees. With the synthetic data approach sensitive records or records that have a high risk of disclosure are replaced with draws from a model fitted to the original data. If the synthesis models are carefully developed, important features of the data are still maintained while risks of disclosure can be substantially reduced (Drechsler, 2011).

Several strategies have been proposed in the literature for synthesizing data containing detailed geographical information (Machanavajjhala et al., 2008; Sakshaug and Raghunathan, 2010; Wang and Reiter, 2012; Paiva et al., 2014; Quick et al., 2015, 2018). We discuss in Section 2

why most of these strategies are not suitable for our application. Instead we propose two alternative approaches: The first approach concatenates the information on latitude and longitude for each record, treating the resulting variable as an unordered categorical variable, and uses Dirichlet process mixtures of products of multinomials (DPMPM) for the synthesis. The usefulness of the DPMPM approach in the context of imputation for nonresponse was illustrated in

Si and Reiter (2013). They found that the DPMPM outperforms MICE – a very popular imputation tool based on the sequential regression approach of Raghunathan et al. (2001) – for multiple imputation of missing categorical variables in large-scale assessment surveys. Hu et al. (2014) first applied the DPMPM approach in the synthetic data context to generate synthetic data from a subset of the 2012 American Community Survey with decent utility and risks results. The second approach generates synthetic values by using CART models (Reiter, 2005) for the categorical geocoding variable. These two approaches are compared to an approach previously proposed by Wang and Reiter (2012), which is also based on CART models, but treats the information on the latitude and longitude as two separate continuous variables. We evaluate the three different approaches in terms of their disclosure protection as well as their ability to preserve the analytical validity of the data. Assuming the data disseminating agency is not satisfied with the level of protection offered by synthesizing the geocodes, we evaluate two strategies to further improve the level of protection: The first strategy follows traditional approaches for disclosure limitation by aggregating the geocoding information before the synthesis. The second approach synthesizes additional variables in the dataset. We summarize the findings from these additional studies in the main text. Detailed results can be found in the online supplement accompanying this paper.

Our research was motivated by a recent project at the Institute of Employment Research (IAB) in Germany that added detailed geographical information to the Integrated Employment Biographies (IEB), a rich administrative data source at the IAB. The institute is currently investigating strategies for providing access to these data for external researchers without compromising the confidentiality of the individuals included in the database (whether the datasets will be disseminated to the scientific community or access will be granted onsite only still needs to be decided at this stage). The results presented in this paper are part of this endeavor and all evaluations of the synthesizers are conducted using a subset of variables from these data.

The remainder of the paper is organized as follows: In Section 2 we review alternative synthetic data approaches for protecting data containing detailed geographical information and argue why these approaches are not suitable for our application. Section 3 introduces the IEB. In Section 4 we discuss the two different synthesizers that we deem suitable for our context: the DPMPM synthesizer and the CART synthesizer. Section 5 provides some details on the implementation of the different synthesizers for the IEB synthesis. In Section 6 the analytical validity and the disclosure risk of the generated synthetic datasets are evaluated under various assumptions regarding the synthesis strategy. The paper concludes with a general discussion of the three synthesis approaches, implications for the dissemination project at the IAB, and some ideas for future research.

2 Related approaches

The idea of synthesizing the geographical information to protect the individuals in the data while still allowing analysis on a detailed geographical level is not new. In fact, several papers appeared in recent years proposing different strategies for synthesizing the geographical information. In this section we present a brief review of the different proposals and discuss the limitations of the proposals for our context.

The first successful implementation of geographical synthesis was discussed in Machanavajjhala et al. (2008). The authors propose a strategy for synthesizing the place of living for all individuals working in the U.S. The synthesizer is used to generate the underlying data for an application called OnTheMap provided by the U.S. Census Bureau. This application graphically visualizes commuting patterns on a detailed geographical level. The authors used a Dirichlet/Multinomial model for synthesis and adjusted the Dirichlet priors such that they were able to prove that their synthesizer guaranteed some formal level of privacy called differential privacy (see Machanavajjhala et al. (2008)

for details). While the formal privacy guarantees are a very attractive feature of this approach, this synthesis strategy would not be suitable for our application. For OnTheMap only two variables needed to be considered: the place of living and the place of work. But even in this setting the authors had to come up with special techniques for dealing with the sparsity of the matrix when cross-classifying the two variables. We are considering 11 variables in our application and a multinomial model would provide poor results when fitted to such a high dimensional sparse matrix. Furthermore, the DPMPM can be seen as an extension to the multinomial model (it consists of a mixture of products of multinomal models), and thus is generally expected to give better results in terms of analytical validity than the classical multinomial model.

Another synthesis strategy proposed by Wang and Reiter (2012) is to treat the detailed geocoding information as a continuous variable and use CART models to synthesize these geocodes. The synthesizer proposed by the authors is one of the synthesizers that we evaluate in this paper. However, we show that improved results can be achieved if a different synthesis strategy is selected.

Sakshaug and Raghunathan (2010) propose using mixed effects modeling strategies for preserving the geographical information in the synthesized data. Mixed effects synthesis models are a natural way to preserve the geographical clustering effect. However, the aim of this modeling strategy is to preserve the geographical information when synthesizing other variables in the data. It cannot be used to synthesize the geographical information itself which is the goal for our application.

Paiva et al. (2014) use areal level spatial models (often called disease mapping models in the literature) to synthesize the geographical information. Although they start with exact geographies, their methods require defining fine grids over the spatial domain, then using the conditional autoregressive (CAR) model of Besag et al. (1991) to model the distribution of grid-counts. When synthesizing exact geographies, they recommend first to synthesize grid cells for each individual, and second to randomly assign each individual a location within the grid cells. Furthermore, Paiva et al. (2014) requires a small number of categorical variables with small number of levels. We find their approach too computationally intensive for the IEB data. Moreover, their models can at most preserve counts at the grid level, and randomly sampling exact geographies within a grid is ad hoc. Therefore, we did not include this approach in our evaluations.

Perhaps the most sophisticated synthesis strategy specifically tailored to preserving spatial statistics is presented in Quick et al. (2015)

. The authors use marked point process models for synthesis, which are especially useful if exact geocoding information is available. Specifically, they propose to model the data in three steps: (i) specify multinomial models for the categorical variables in the data, (ii) use a log-Gaussian Cox process to model the geographical location within each cell specified by cross classifying all categorical variables, and (iii) specify a normal regression for continuous variables given the categorical variables and location. The authors point out that estimating this model can be computationally intractable and suggest several steps and simplifying assumptions to reduce the computational burden for their real data application comprising five attributes on 6,294 records. We did not include this approach in our evaluations because implementing it will be too computationally intensive for the IEB data consisting of 10 attributes for more than 3.5 million records.

Yet another synthesis strategy is described in Quick et al. (2018)

, who used a differential smoothing synthesizer for locations of home sale in San Francisco. Their approach is a two-step process. First, they model the log-transformed home sale prices using an unrestricted hierarchical model. Second, they identify spatial outliers based on the distances to their nearest neighbors, then fit a restricted hierarchical model to provide additional smoothing for higher protection. Similar to

Sakshaug and Raghunathan (2010) the goal in this paper is to preserve the spatial structure of the variables contained in the dataset and not to synthesize the geocodes directly, which is the goal in our application.

3 The IEB data

The IEB integrate five different sources of information collected by the Federal Employment Agency through different administrative procedures: the Employment History, the Benefit Recipient History, the Participants-in-Measures History, the Unemployment Benefit II Recipient History, and the Jobseeker History. We refer to Jacobebinghaus and Seth (2010) for a detailed description of the different data sources and of the IEB. The schematic overview of the data sources of the IEB is summarized in Figure 1.

Social security
notifications

Data from the business processes of the
Federal Employment Agency in Germany

Employ-
ment

Short Term
Benefit Recipients

Program
Participation

Jobseekers

Long Term
Benefit Recipients

 IEB

Figure 1: Sources of the IEB

Information available for individuals included in the IEB contains among other things: beginning and ending date of every employment, date of birth, gender, nationality, education, health status, employment status, monthly wages, and working place and place of residence at the ZIP code level. Establishment level information is provided in the IEB by aggregating individual level information to the establishment level.

Recently, exact geocoding information has been added for individuals and establishments included in the IEB at the reference date June 30, 2009. The geocoding information was obtained from a georeferenced address database for Germany provided by the Federal Agency for Cartography and Geodesy which contains approximately 22 million addresses of German buildings and their corresponding geographic coordinates. Based on exact matches regarding the address information it was possible to obtain geocoding information for 94.6% of the 36.2 million individuals and 93.2% of the 2.5 million establishments contained in the IEB on the reference date (see Scholz et al. (2012) for more details regarding the matching of the two data sources). Given the large number of variables and the sensitivity of the information contained, it is obvious that the linked data cannot be disseminated to the public using traditional statistical disclosure limitation techniques such as top coding, noise infusion, or swapping since the geocoding information would make it very easy to identify individuals in the database. In fact, the data currently available for external researchers only contain a 2% sample from the IEB with a limited set of non-sensitive variables and county level information as the lowest level of geographical detail. Counties with less than 100,000 observations in the full IEB are collapsed. Several additional steps such as aggregating some variables or dropping employment information outside a given age range are taken to ensure a sufficient level of confidentiality protection (see Section 3.4 in Ganzer et al. (2017) for further details regarding the anonymisation measures).

To be able to provide access to the detailed geocoding information for external researchers, the IAB is looking for innovative ways to generate a sufficiently protected version of the linked data that still contains useful information at least on some of the variables. It was decided to start with a limited set of variables initially and extend the set in the future if reasonable results, both in terms of disclosure risk and data utility, could be achieved in the first round. For simplicity, the variables included in this preliminary dataset are chosen to avoid structural zeros; that is, all cells in the implied contingency table have non-zero probability. We note that structural zeros can be incorporated in the DPMPM model using the methodology presented in

Manrique-Vallier and Reiter (2014). Structural zeros are automatically maintained if the CART approach is used. The selected variables are listed in Table 1.

variable characteristics
exact geocoding information recorded as distance in meters from the point
of the place of living 52°northern latitude, 10°eastern longitude
sex male/female
foreign yes/no
age 6 categories
(20, 20–30, 30–40, 40–50, 50–60, 60 years)
education 6 categories
occupation level 7 categories
occupation 12 categories
industry of the employer 15 categories
wage

10 categories defined by quantiles

distance to work 5 categories (1, 1–5, 5–10, 10–20, 20 km)
ZIP code 2,063 ZIP code levels
Table 1: Variables included in the data set used for the evaluations

4 Modeling assumptions for the different synthesizers

In this section we briefly describe the DPMPM and the CART approaches for generating synthetic data that we considered as potential candidates for protecting the geocoding information in the data. More details on both synthesizers can be found in the appendix. The DPMPM is a Bayesian semi-parametric procedure. It uses a Dirichlet process mixture of products of multinomial distributions, which is a Bayesian version of a latent class model. The basic idea is to assume that each observation belongs to one of a potentially infinite number of latent classes and conditional on class assignment the variables contained in the data can be considered independent. It can be shown that the DPMPM provides full support on the space of distributions for multiple unordered categorical variables (Dunson and Xing, 2009). See Appendix A for more details on how the DPMPM can be turned into an engine for data synthesis.

The CART approach is nonparametric and is based on classification and regression trees from the machine learning literature. As outlined in

Drechsler and Reiter (2011), the approach seeks to approximate the conditional distribution of a univariate outcome from multiple predictors. The CART algorithm partitions the predictor space so that subsets of units formed by the partitions have relatively homogeneous outcomes. The partitions are found by recursive binary splits of the predictors. The series of splits can be effectively represented by a tree structure, with leaves corresponding to the subsets of units. The values in each leaf represent the conditional distribution of the outcome for units in the data with predictors that satisfy the partitioning criteria that define the leaf. CART has been adapted for generating partially synthetic data (Reiter, 2005) and we refer to Appendix B for more details how this can be achieved. In our evaluations, we use a continuous CART synthesizer and a categorical CART synthesizer.

In this article we assume that the aim is to generate partially synthetic data, i.e., only some of the variables in the released data will be synthesized. We note that both approaches can also be used to synthesize all variables, i.e., to generate fully synthetic data. See Drechsler (2011) for a detailed discussion of the different approaches to generating synthetic data.

5 Synthesis of the IEB

For our evaluations, we selected individuals living in Bavaria and deleted all observations with missing information in any of the variables. Since most of the variables such as wage or distance to work are only observed if the individual is employed at the reference date, the final dataset consisting of 3,333,998 records represents the working population in Bavaria. We initially only synthesized the geocoding information for the place of living. If the disclosure risks based on this strategy are deemed too high it is straightforward to use the synthesis models described above to synthesize additional variables in the dataset. We explore the impacts of additional synthesis on risk and utility in Section 6.3.

Running the synthesis models on the entire dataset would be prohibitive due to the size of the data, so we clustered all the observations based on their geographic locations into 222 similar-sized clusters, containing 15,000 records each (except for the last cluster which contains 18,998 records) and ran the synthesis models separately for each cluster. We achieved the clustering using the MDAV (maximum distance to average record) algorithm (Domingo-Ferrer and Mateo-Sanz, 2012). Since the synthesis models are independent between the clusters it is possible to run the synthesis for each cluster in parallel. This way the synthesis procedure becomes scalable even if the entire dataset for Germany with more than 36 million records should be synthesized. Of course running the synthesis only within clusters also implies an increased risk of disclosure since the outcome space of the synthetic geocodes is bounded by the geocodes observed within the cluster. We take this into account when evaluating the risks of disclosure in Section 6.2.

We ran the MCMC sampler for the DPMPM synthesizer for 10,000 iterations, treating the first 5,000 iterations as burn-in and storing only every 10th iteration to reduce the correlation between the successive draws. The 5,000 iterations from the burn-in period (i.e. the first 5,000 iterations at the beginning of the MCMC chain) are not used to ensure that the MCMC sampler converged before any parameters are drawn for synthesis. To monitor convergence and autocorrelation, we focused on the parameter which is not subject to label switching. We used the Geweke and Heidelberger and Welch’s diagnostics and inspected the autocorrelation function to evaluate the behaviour of the MCMC sampler for this parameter. For those clusters for which our evaluation criteria indicated that the successive draws of the sampler were still correlated we reran the MCMC sampler storing only every 50th iteration. We did not find any problematic cases based on our evaluation criteria after this extra step. We set the maximum number of allowed latent classes to . The posterior mean of the number of occupied classes is 57.8 with a 95% central interval of (51,64) and a maximum of 73.

For the CART synthesizer we used the default values implemented in the rpart package in R (Therneau et al., 2015) for two of the three tuning parameters that control the size of the trees that are grown: The default for the minimum number of observations that must exist in a node in order for a split to be attempted is 20 and the default for the minimum number of observations in any terminal leaf is seven. However, we set the the third tuning parameter – the complexity parameter – to a very low value of (the default is ). Any split that does not decrease the overall lack of fit by a factor of is not attempted. Since we are interested in preserving the relationships between the geocodes and the other variables in the dataset as closely as possible to obtain high analytical validity of the synthetic data, we choose a very low level for this parameter. We note that this parameter is the most useful of the three tuning parameters for balancing the analytical validity and the disclosure risk of the generated data. If the risks are considered too high the synthesis could be repeated using a larger value.

For the continuous synthesizer we initially evaluated two synthesis sequences: synthesizing the longitude before synthesizing the latitude conditional on the longitude and vice versa. However, we found that the analytical validity and disclosure risks were almost identical and thus we only report the results for the longitude-then-latitude synthesis order below.

We generated synthetic datasets for each cluster for all synthesizers.

6 Analytical validity and disclosure risk

6.1 Evaluation of the analytical validity of the generated data

Ideally, evaluations of the analytical validity should provide a single measure for the utility of the synthetic data. However, global measures that have been suggested in the literature to compare the original and the synthetic data such as the Kullback-Leibler distance are often not very informative. On the one hand it is not clear which value of the measure indicates an acceptable validity level. On the other hand, even if the global utility measure indicates high analytical validity, this does not necessary hold for a specific model a potential analyst is interested in. Thus, outcome specific utility measures such as the confidence interval overlap suggested in

Karr et al. (2006) are usually employed. The downside of these measures is that a high level of data utility for one model does not necessarily imply high levels of utility for other models of interest.

In this paper we try to address the limitations of both approaches by first providing some results for model specific utility measures and then moving on to global measures of the utility of the synthetic data. We note that the confidence interval overlap measure proposed by Karr et al. (2006) is not suitable for our application since the data comprise the entire population and thus there is no uncertainty in the results based on the original data.

6.1.1 Outcome specific utility measures

To evaluate the validity for specific outcomes we assume that the analyst is interested in looking at the data at a very detailed geographical level. Figures 2 and 3 provide results for two different outcomes computed at the ZIP code level. We note that the synthetic data contain geocodes on the same level of detail as in the original data. Thus, the analyst would be very flexible in defining the geographical area she is interested in. We choose the ZIP code level for three reasons: First, we need a sensible strategy to cluster all the data for Bavaria. The ZIP code offers a natural way of clustering the data on a very detailed level without the necessity of specifying arbitrary boundaries for setting up the clusters. Second, the number of records vary considerably between ZIP codes ranging from just a few cases in rural areas to more than 10,000 cases for densely populated areas (the median size is 535 cases). Thus, the ZIP code level evaluation covers statistics which should be relatively easy to preserve (since they are based on a large number of cases) as well as statistics for rural areas which will be difficult to preserve. Finally, choosing the ZIP code level offers the convenience that maps can be generated relatively easily using any GIS software, since the geocodes of the ZIP code boundaries are directly available. We compute the ZIP codes for the synthetic data by identifying the closest geocode in the original data and transferring its ZIP code information. While there will always be an exact match for the synthesizers which treat the geocode as categorical and thus, the ZIP code will always be correct, this does not hold for the continuous synthesizer. For this approach it might happen that the closest observed geocode might actually belong to another ZIP code area. However, given the detailed grid provided by the original data, we believe these cases are rare and thus do not justify the more labor intensive task of identifying the correct ZIP code directly from the synthetic geocodes.

Figure 2 depicts the share of high wage earners in Bavaria, where high wages are defined as wages above the seventieth quantile of the wage distribution in Bavaria. Figure 3 depicts the share of foreigners. In both figures the results based on the original data and the results based on the three different synthesizers are depicted.


Figure 2: Share of high wage earners in Bavaria by ZIP code level.

Figure 3: Share of foreigners in Bavaria by ZIP code level.

All synthesizers preserve the wage distribution rather well. The results are more diverse for the foreigners. This is not surprising given that by definition 30% of the records in the data are high wage earners whereas only 6.17% of the individuals are foreigners, i.e., the distribution is far more skewed and thus more difficult to preserve. For both figures it is evident that the DPMPM synthesizer does not preserve the geographical heterogeneity as well as the CART synthesizers. The DPMPM model seems to smooth the distributions so that the shares are overestimated in those regions in which the shares are low in the original data. This is especially evident in Figure

3. The two CART models preserve the wage distribution similarly well. However, the continuous CART model fails to preserve the distribution of foreigners. As with the DPMPM models the distribution is smoothed out compared to the distribution in the original data although to a lesser extent than with the DPMPM model. For the categorical CART model we do not find any substantial differences between the original data and the synthetic data, indicating a very high level of analytical validity.

We also consider Ripley’s -function (Ripley, 1976, 1977), a measure of spatial dependence that calculates the expected number of events of an arbitrary event, within a given radius (Quick et al., 2015; Shirota and Gelfand, 2017). In particular, we consider the multitype -function, which counts the expected number of other points of the process within a given distance of a point of type (Lotwick and Silverman, 1982).

where is the area of spatial domain, is the total number of points and is the number of points of type .

To measure spatial dependence, it is common to compute the -function

where positive values of indicate spatial clustering.

We use the R package (Baddeley et al., 2015) to calculate the functions for the distributions of 1) individuals with no vocational training, and 2) individuals aged 60 and older, in Nuremberg, Bavaria. For each synthesizer, we calculate the functions for the same range of values in each of the synthetic datasets and average them across datasets. Figure 4 compares the resulting curves to the curve obtained using the original data.

All synthesizers preserve the spatial information rather well. The difference between the estimated values of from the original data and the synthetic data are small for both variables for all levels of considered here. Still, the DPMPM results diverge most from the results based on the original data. The continuous CART synthesizer offers some improvements, while for the categorical CART synthesizer the results are basically indistinguishable from the results based on the original data. The results for Nuremberg are in line with the findings for Bavaria discussed above: In terms of preserving the analytical validity, the categorical CART synthesizer should always be preferred while the DPMPM synthesizer provides the least favorable results.

Figure 4: functions for individuals without vocational training and individuals aged 60 or older in the city of Nuremberg.

6.1.2 Global utility measures

Our approach to evaluating the global utility of the protected dataset is to compare relative frequencies for various cross tabulations of the variables contained in the original data and the synthetic data. Again, we choose the ZIP code as the level of geographical detail for the reasons laid out in Section 6.1.1. Specifically, for each ZIP code we first compute the relative frequencies for each cell entry for various cross classifications of all variables, i.e., all marginal distributions, two-way interactions, and three-way interactions. Then we evaluate how much these relative frequencies differ between the original data and the synthetic data.

Figure 5 contains the distribution of the differences across all clusters for different interaction levels. Since the utility measure effectively measures the difference in relative frequencies between the original data and the protected data, the higher the density of the distribution around zero the better the analytical validity. To get a one-number measure for the loss in data utility we also compute the average absolute values of the differences across all cells. These numbers are reported for each synthesizer in the upper left corner of each panel in Figure 5. The smaller the number, the better the utility.


Figure 5: Global utility results. Distributions of differences in relative frequencies between the original data and the protected data. The numbers in the upper left corner represent the average absolute values of the differences for the different synthesizers.

The results are similar to the results regarding the outcome specific utility measures: The synthesizer based on the categorical CART model provides the best result followed by the CART model that treats the geocodes as continuous variables. The DPMPM model consistently performs the worst. Looking at the one-number measure the differences are quite dramatic: The values of the DPMPM model are always more than 1.7 times the values for the categorical CART model.

6.2 Evaluation of the disclosure risk for the generated data

The results from the previous section indicate that from a utility perspective the categorical CART model should always be preferred. However, when disseminating confidential data to the public, there is always a trade-off between data utility and disclosure risk. The categorical CART synthesizer will only be useful if the increase in the risks of disclosure that results from the increase of the utility are deemed acceptable. Thus, we will compare the disclosure risks of the various synthesizers in this section.

To evaluate disclosure risks, we compute probabilities of identification using methods developed in Reiter and Mitra (2009). A detailed description of the methodology is given in Appendix C. Here we only summarize the main ideas. Suppose the intruder has information on some target records which she will use in a record linkage attack to identify the targets in the released data. Similar to the concept of probabilistic record linkage, the idea is to estimate the probability of a match between each target record and each record in the released file. The record with highest average matching probability across the synthetic datasets is the declared match.

In the evaluation, we use three risk measures that are summaries of these matching probabilities. The expected match risk, which computes the expected number of correctly declared matches, the true match rate, which computes the number of correct single matches among the target records, and the false match rate, which computes the number of erroneously declared single matches among all declared single matches.

We note that the proposed risk measures assume that a potential intruder knows that the target records he or she is looking for are included in the database. This is reasonable in our application, since the IEB basically covers the entire population. If the IAB decides that only a sample of the synthetic IEB should be released in the end to further increase the confidentiality protection, the risk measures could easily be adjusted to account for the sampling uncertainty using the extensions proposed in Drechsler and Reiter (2008).

For our evaluations we assume that the intruder knows the exact geocode, sex, age category, industry of the employer, occupation and the information whether the individual is a foreigner or not and uses this information to try to identify the individuals in the database. We sample 100 records from each cluster and assume that these 22,200 records are the target records that the intruder tries to find in the data. For the geocode we assume that the intruder constructs grids of different size and considers all records that fall in the same grid as matches. We evaluate the risks for four different grids: 100x100, 1,000x1,000, 10,000x10,000, and 20,000x20,000 square meter grids. We also compute the risk measures if the intruder would match on the exact geocodes. To account for the increased risk from clustering the data before the synthesis, we block on the cluster when matching the target records. This is a conservative approach since the intruder would not be able to identify the clusters in the released data. Still, we believe most agencies would prefer being conservative instead of underestimating the risks of a data release strategy.

The estimated risks for the original data without the geocodes are as follows: expected risk: 1821.16, true rate: 2.56%, and false rate: 0. These risks serve as a benchmark since under the current regulations external researchers can access a 2% sample of the IEB (without geocodes) through the Research Data Center of the IAB already. So any synthetic data with similar risks should be considered sufficiently protected as long as the data can only be accessed under similar conditions.

Grid ER TR FR ER TR FR ER TR FR
Exact 4665.74 17.33 64.64 1.48 0.00 100.00 25.70 0.10 98.47
4459.14 16.49 67.07 172.59 0.66 93.97 95.78 0.36 97.98
3167.14 11.55 82.58 1361.57 4.81 90.36 769.03 2.46 94.69
2226.34 7.32 85.82 2111.10 6.76 86.38 1585.18 5.16 91.20
2054.49 5.82 81.18 2009.45 5.42 81.26 1733.98 5.04 86.76
Table 2: Expected match risk (ER), true match rate (TR in %), and false match rate (FR in %) for various grid sizes

Table 2 presents the results for the different scenarios. As expected the risks increase with increasing analytical validity. The DPMPM synthesizer that showed the lowest analytical validity is associated with lowest disclosure risks for all matching strategies, except for the matching on exact geocodes, where the risks associated with the continuous CART are even lower. The categorical CART synthesizer with very high analytical validity also leads to the highest risk of disclosure.

For the DPMPM and the continuous CART synthesizer the risks are increasing with increasing grid size (except for a decease for the continuous CART when going from the 10,000x10,000 grids to the 20,000x20,000 grids). Arguably, for these two synthesizers the risks do not increase substantially compared to the risks for the data without geocodes. The expected risks are slightly larger for the continuous CART if the intruder uses very large grids ( square meters) for the matching with a maximum risk of 2111.1 compared to 1821.16 for the data without geocodes. The percentage of correct unique matches (the true match rate) is also larger for both the continuous CART and the DPMPM if the intruder chooses grid sizes square meters with a maximum at 6.76% for the continuous CART and 5.16% for the DPMPM (2.56% for the original data without geocodes). However, these increased risks are balanced by the fact that (a) the intruder would not know which grid size to pick and (b) that most of the unique matches would actually be false matches (always more than 81% for the continuous CART and more than 86% for the DPMPM synthesizer). For the original data without geocodes the false match rate has to be zero by definition since the intruder can always be sure that he or she identified the correct individual if a unique match is found.

The results are different for the categorical CART synthesizer. Here, the risks decrease monotonically with increasing grid size and both the expected risk measure (with a maximum of 4665.74) and the true match rate (with a maximum of 17.33) indicate a substantial increase in risk compared to releasing the data without geocoding information.

6.3 Strategies to further increase the level of protection

If the agency feels that these risk increases are not counterbalanced by the fact that the intruder will be wrong in at least 64% of the cases when he or she declares a unique match to be the correct match, the agency would have three options: (a) use one of the other synthesizers and accept the loss in analytical validity, (b) repeat the synthesis for the categorical CART synthesizer increasing the complexity parameter to grow smaller trees, or (c) use additional measures to further protect the data.

We evaluated the last option based on two different approaches. The first approach aggregates the geographical information to a higher level. The second approach synthesizes additional variables in the dataset. We only present the main findings for the categorical CART synthesizer here for brevity. Detailed results can be found in the online supplement.

Aggregating the geographical information is a classical data protection measure that is commonly employed by statistical agencies before data dissemination. Thus, we evaluated how the risk-utility profile changes if the detailed geographical information is aggregated to a higher level before the synthesis. Specifically, we looked at the impacts of using 10, 100, and 1,000 square meter grids instead of the exact geocodes. We found that risks remain constant for the 10 square meter grids and actually increase if 100 square meter grids are used. The risks decrease for the 1,000 square meter grids but are still substantially higher than the risks from the original data without geocodes (for example, the expected risk is 3717.65 compared to 1821.16 for the original data). At the same time the loss in analytical validity as expressed by the measure would increase for such large grid sizes (the UL measure increases from when synthesizing the exact geocodes to when using 1,000 square meter grids).

As an alternative the IAB could decide to synthesize more variables, if the level of protection from synthesizing the geocodes is not deemed to be sufficient. We looked at different synthesis strategies starting with synthesizing one additional variable (age), successively adding occupation and foreign status to the list of synthesized variables. In the final setting we synthesized all variables except for wage and sex. As expected both risk and utility will decrease with increasing amounts of synthesis. However, the decreases in risk are more pronounced. While the UL measure only increases by 4.1% to 7.9% for the first three scenarios and 14.8% for the scenario in which almost all variables are synthesized, the expected match risk decreases between 60.9% and 85.6% for the first three scenarios and 99.7% for the final scenario. We also find that the analytical validity of the categorical CART synthesizer when synthesizing almost all variables in the dataset is still higher than if any of the other synthesizers considered in the previous section would be used for synthesizing the geocode alone. Again, we refer to the online supplement for a detailed discussion of the results.

7 Discussion

Access to detailed geographical information is desirable in many situations. However, providing this access is challenging in practice since detailed geographical information will substantially increase the risks of reidentification for two reasons: First, the geographical information for the target records the intruder is looking for is usually easily available. Second, if the information is sufficiently detailed, only few individuals will share the same value. In fact, if detailed geocodes are released unaltered, they can be viewed as direct identifiers since they will often uniquely identify individuals in the database. For these reasons, innovative data protection procedures offering a very high level of protection are required if detailed geographical information should be released. In our view, the synthetic data approach is the only viable solution for this endeavor.

We used three different synthesizers in this paper for synthesizing the geocoding information and evaluated how well they address the common dilemma in data confidentiality: the trade-off between disclosure risk and analytical validity. The continuous CART synthesizer has been suggested before to protect detailed geocoding information (Wang and Reiter, 2012). The DPMPM synthesizer gained a lot of attention recently due to its improved performance compared to other parametric approaches for categorical variables (Si and Reiter, 2013; Hu et al., 2014)

. However, we found that using CART models and treating the geocodes as categorical variables provided the best results. The DPMPM model generally failed to preserve the analytical validity and while the continuous CART synthesizer performed better than the DPMPM model the validity still was considerably lower than the validity of the categorical CART synthesizer. The relatively poor performance of the DPMPM might be due to the fact that the DPMPM approach tries to model the full joint distribution of all the variables although we actually only need a good model for the conditional distribution of the geocodes given all the other variables in our synthesis application. This is exactly what the CART-based approaches try to model and it might be easier to capture this conditional distribution instead of the full joint distribution. Comparing the two CART approaches, the continuous CART synthesizer might be inferior in terms of analytical validity because geographical proximity might not be a good measure for setting up the splitting rules (with continuous variables CART will try to minimize the variance in each leaf, which might not be a useful criterion with geocodes).

The continuous CART has the additional disadvantage that it can produce implausible geocodes, such as places of living in the middle of a lake, in a forest, or in industrial areas. This cannot happen with the categorical CART since the geocodes are modeled as categorical and thus only geocodes that were observed in the original data could appear in the synthetic data. This could be problematic in some applications in which the collected data only comprises a sample of the population and the information that an individual participated in the survey is already confidential. Releasing unaltered geocodes would put some individuals at risk even if the synthetic geocodes were attached to different units since the information would be revealed that someone living at a specific geographical location included in the data must have participated. However, since the IEB covers the entire population, this is less problematic in our context. All individuals from the population should be included in this database and the information that someone lives at a specific location is generally not confidential.

Nevertheless, unsurprisingly the increased validity of the categorical CART approach comes at the price of increased risks of disclosure. However, we found that at least in this application the agency would be better off synthesizing more variables instead of relying on one of the other synthesizers or aggregating the geocodes. We note that our risk evaluations did not consider the fact that samples of the IEB without detailed geographical information have been disseminated previously. Potentially intruders could use this information when trying to re-identify units in the synthetic data. However, the data are only available to the scientific community and underwent several anonymization procedures. We feel that our assumptions that the intruder knows the exact geocode, sex, age category, industry of the employer, occupation and the information whether the individual is a foreigner or not is already a conservative assumption. The only additional variables available in the previously released data are education and wage, but with the low level of geographical detail in these data it would not be possible for the intruder to obtain record level information for these variables for the target records. We thus expect that the increase in risk would be negligible.

Based on our findings we provide the following general recommendations for statistical agencies considering to use the synthetic data approach to disseminate detailed geocoding information:

  1. The categorical CART synthesizer should be used in preference to the DPMPM and the continuous CART since it better preserves the analytical validity than the other synthesizers considered in this paper.

  2. If the risks based on the synthetic geocodes are deemed too high, the agency should consider synthesizing more variables since disclosure risks drop quickly with the amount of synthesis often with a small sacrifice in analytical validity.

  3. Aggregating the geographical information before the synthesis is not recommended as it offers only little protection unless the level of aggregation is large. The loss of information will be substantial in this case.

  4. If the database is large, clustering the data before the synthesis is recommended since the increases in risk are moderate if sufficiently large clusters are selected while huge efficiency gains are possible by parallelizing the synthesis of the clusters.

An alternative approach for increasing the level of protection that we did not evaluate in this paper would be to increase the level of the complexity parameter for the CART models. Increasing the value would result in smaller trees which will increase the uncertainty in the generated data since more records would end up in the same terminal leaf from which the synthetic records would be drawn. However, we expect that the trade-off between risk and validity will be less favorable compared to increasing the amount of synthesis. If smaller trees are grown some relationships in the data necessarily are no longer reflected in the synthetic data. If more variables are synthesized based on accurate synthesis models we expect the information loss to be less substantial. However, evaluating the advantages and disadvantages of the two strategies would be an interesting area of future research.

We also note that we arbitrarily picked the additional variables to be synthesized in Section 6.3. This strategy could certainly be improved depending on the goals of the agency. If the agency wants to maximize the level of protection offered it would be advisable to start with those variables that impose the highest risks of disclosure, for example, variables with large numbers of categories or sparsely populated categories. If the goal is to minimize the negative impacts on validity, the agency should pick the variable for which the CART model provides the best fit. Ideally, the agency would evaluate various synthesis combinations and pick the one that best addresses the risk-utility tradeoff according to the requirements set up by the agency. Only this model would then be used to generate the synthetic datasets which would then be disseminated to the public.

Regarding the actual implementation of the synthetic data approach for the IEB data, the initial results presented in this paper are encouraging enough to consider the approach as a realistic strategy for disseminating the data. Still, several problems need to be tackled before the approach can be used in practice: The set of variables will have to be expanded to satisfy user needs (the original data contain more than 50 variables). In this case it might no longer be sufficient to only protect the geographical information. In principal, the synthesis approach could easily be extended by simply synthesizing additional variables as illustrated in Section 6.3. For example, if income information would be included on the original scale and not discretized, regression trees could be used to protect the detailed income information (see Reiter (2005) for details). However, modeling some of the variables included in the IEB, such as beginning and ending date of each employment spell is a challenging task. Furthermore, the exact geocodes are not only available for the place of living but also for the place of work. Providing access to both geocodes would allow addressing many additional research questions for example regarding commuting behaviour. The synthesis methods would have to be adapted in this case to deal with this additional information. Simply fitting classification trees using the place of living as an additional predictor when synthesizing the place of work will not be an option given the amount of detail contained in the former. At the same time releasing both geocodes will substantially increase the risk of disclosure. Addressing all these issues will be an interesting area of future research. Whether the synthesizer should be extended first to address all these problems or whether a substantially reduced subset of variables should be released at this stage currently still needs to be decided.

References

  • Baddeley et al. (2015) Baddeley, A., Rubak, E., and Turner, R. (2015). Spatial Point Patterns: Methodology and Applications with R. Chapman and Hall/CRC Press, London.
  • Berk (2008) Berk, R. (2008). Statistical Learning from a Regression Perspective. New York: Springer.
  • Besag et al. (1991) Besag, J., York, J., and Molli, A. (1991). Bayesian image restoration, with two applications in spatial statistics. Annals of the Institute of Statistical Mathematics 43, 1–20.
  • Domingo-Ferrer and Mateo-Sanz (2012) Domingo-Ferrer, J. and Mateo-Sanz, J. (2012). Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. on Knowl. and Data Eng 14, 1, 189–201.
  • Drechsler (2011) Drechsler, J. (2011). Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation. Lecture Notes in Statistics 201. New York: Springer.
  • Drechsler and Reiter (2010) Drechsler, J. and Reiter, J. (2010). Sampling with synthesis: A new approach for releasing public use census microdata. Journal of the American Statistical Association 105, 1347––1357.
  • Drechsler and Reiter (2011) Drechsler, J. and Reiter, J. (2011). An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Computational Statistics & Data Analysis 55, 3232––3243.
  • Drechsler and Reiter (2008) Drechsler, J. and Reiter, J. P. (2008). Accounting for intruder uncertainty due to sampling when estimating identification disclosure risks in partially synthetic data. In J. Domingo-Ferrer and Y. Saygin, eds., Privacy in Statistical Databases, 227–238. New York: Springer.
  • Dunson and Xing (2009) Dunson, D. and Xing, C. (2009). Nonparametric bayes modeling of multivariate categorical data. Journal of the American Statistical Association 104, 487, 1042–1051.
  • Ganzer et al. (2017) Ganzer, A., Schmucker, A., vom Berge, P., Wurdack, A., et al. (2017). Sample of integrated labour market biographies-regional file 1975-2014:(siab-r 7514). Tech. rep., Institut für Arbeitsmarkt-und Berufsforschung (IAB), Nürnberg [Institute for Employment Research, Nuremberg, Germany].
  • Hu et al. (2014) Hu, J., Reiter, J. P., and Wang, Q. (2014). Disclosure risk evaluation for fully synthetic categorical data. In J. Domingo-Ferrer, ed., Privacy in Statistical Databases, no. 8744 in Lecture Notes in Computer Science, 185–199. Springer, Heidelberg.
  • Ishwaran and James (2001) Ishwaran, H. and James, L. F. (2001). Gibbs sampling for stick-breaking priors. Journal of the American Statistical Association 96, 161–173.
  • Jacobebinghaus and Seth (2010) Jacobebinghaus, P. and Seth, S. (2010). Linked-Employer-Employee-Daten des IAB: LIAB-Querschnittmodell 2, 1993-2008. Tech. rep., FDZ-Datenreport , 5.
  • Karr et al. (2006) Karr, A., Kohnen, C., Oganian, A., Reiter, J., and Sanil, A. (2006). A framework for evaluating the utility of data altered to protect confidentiality. The American Statistician 60, 224–232.
  • Lotwick and Silverman (1982) Lotwick, H. W. and Silverman, B. W. (1982). Methods for analysing spatial processes of several types of points. Journal of the Royal Statistical Society, Series B 44, 406–413.
  • Machanavajjhala et al. (2008) Machanavajjhala, A., Kifer, D., Abowd, J. M., Gehrke, J., and Vilhuber, L. (2008). Privacy: Theory meets practice on the map. In IEEE 24th International Conference on Data Engineering, 277–286.
  • Manrique-Vallier and Reiter (2014) Manrique-Vallier, D. and Reiter, J. P. (2014). Bayesian estimation of discrete multivariate latent structure models with structural zeros. Journal of Computational and Graphical Statistics 23, 4, 1061–1079.
  • Paiva et al. (2014) Paiva, T., Chakraborty, A., Reiter, J., and Gelfand, A. (2014). Imputation of confidential data sets with spatial locations using disease mapping models. Statistics in medicine 33, 11, 1928–1945.
  • Quick et al. (2018) Quick, H., Holan, S. H., and Wikle, C. K. (2018). Generating partially synthetic geocoded public use data with decreased disclosure risk using differential smoothing. Journal of the Royal Statistical Society, Series A 181, 649–661.
  • Quick et al. (2015) Quick, H., Holan, S. H., Wikle, C. K., and Reiter, J. P. (2015). Bayesian marked point process modeling for generating fully synthetic public use data with point-referenced geography. Spatial Statistics 14, 439–451.
  • Raghunathan et al. (2001) Raghunathan, T. E., Lepkowski, J. M., van Hoewyk, J., and Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a series of regression models. Survey Methodology 27, 85–96.
  • Reiter and Mitra (2009) Reiter, J. and Mitra, R. (2009). Estimating risks of identification disclosure in partially synthetic data. Journal of Privacy and Confidentiality 1, 99–110.
  • Reiter (2005) Reiter, J. P. (2005). Using CART to generate partially synthetic, public use microdata. Journal of Official Statistics 21, 441–462.
  • Ripley (1976) Ripley, B. D. (1976). The second-order analysis of stationary point processes. Journal of Applied Probability 13, 255–366.
  • Ripley (1977) Ripley, B. D. (1977). Modeling spatial patterns. Journal of the Royal Statistical Society, Series B 39, 172–212.
  • Sakshaug and Raghunathan (2010) Sakshaug, J. W. and Raghunathan, T. E. (2010). Synthetic data for small area estimation. In J. Domingo-Ferrer and E. Magkos, eds., Privacy in Statistical Databases, 162–173. Springer Berlin Heidelberg.
  • Scholz et al. (2012) Scholz, T., Rauscher, C., Reiher, J., and Bachteler, T. (2012). Geocoding of German administrative data - the case of the Institute for Employment Research. Tech. rep., FDZ-Methodenreport 9.
  • Sethuraman (1994) Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica 4, 639–650.
  • Shirota and Gelfand (2017) Shirota, S. and Gelfand, A. E. (2017). Approximate bayesian computation and model assessment for repulsive spatial point processes. Journal of Computational and Graphical Statistics 3, 646–657.
  • Si and Reiter (2013) Si, Y. and Reiter, J. P. (2013). Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. Journal of Educational and Behavioral Statistics 38, 5, 499–521.
  • Therneau et al. (2015) Therneau, T., Atkinson, B., and Ripley, B. (2015). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-10.
  • Wang and Reiter (2012) Wang, H. and Reiter, J. (2012). Multiple imputation for sharing precise geographies in public use data. Annals of Applied Statistics 6, 229–252.

Appendix A The DPMPM Synthesizer

Following Hu et al. (2014), let the confidential data comprise individuals measured on categorical variables. For and , let denote the value of variable for individual , and let . Without loss of generality, assume that each takes on values in , where is the total number of categories for variable . Effectively, the survey variables form a contingency table of cells defined by cross-classifications of the variables. Let and

be random variables defined respectively on the sample spaces for

and .

We generate synthetic data using a finite number of mixture components in the DPMPM. Paraphrasing from Si and Reiter (2013), the finite DPMPM assumes that each individual belongs to exactly one of latent classes; see Si and Reiter (2013) for advice on determining . For , let indicate the class of individual , and let . We assume that is the same for all individuals. Within any class, each of the variables independently follows a class-specific multinomial distribution, so that individuals in the same latent class have the same cell probabilities. For any value , let be the probability of given that individual is in class . Let be the collection of all . The finite mixture model can be expressed as

(1)
(2)

where each multinomial distribution has sample size equal to one and the number of levels is implied by the dimension of the corresponding probability vector.

For prior distributions on and , we use the truncated stick breaking representation of Sethuraman (1994). We have

(3)
(4)
(5)
(6)

We set for all

to correspond to uniform distributions. Following

Dunson and Xing (2009) and Si and Reiter (2013), we set

, which represents a small prior sample size and hence vague specification for the Gamma distribution. In practice, we find these specifications allow the data to dominate the prior distribution. We estimate the posterior distribution of all parameters using a blocked Gibbs sampler (

Ishwaran and James (2001), Si and Reiter (2013)).

We illustrate how to generate one synthetic dataset assuming that only the th variable should be synthesized. Thus, contains all multinomial probabilities associated with . To generate one partial synthetic dataset of size , we first sample a value of the parameters from their respective posterior distributions. Using the drawn value of , we sample values of independently from (2). Using the sampled , for each sampled , where , we then sample the th synthetic record, , from a multinomial distribution with probabilities for . The synthesis can be conveniently implemented inside the blocked Gibbs sampler – after each Gibbs updating step, we simply sample and save draws of for all records. To create synthetic datasets, one repeats this process times, using approximately independent draws of parameters. Approximately independent draws can be obtained by using iterations that are far apart in the estimated MCMC chain.

If more than one variable should be synthesized, the DPMPM synthesizer can be implemented by generating each synthetic variable independently at the desired iterations. Suppose there are () variables to be synthesized, and let () represent the th variable to be synthesized. Assume that the variables in the dataset are ordered so that the that remain unaltered appear first. After sampling , we can sample the th synthetic record, , from corresponding multinomial distributions with probabilities for . Note that the synthesis order does not matter, because each variable independently follows a multinomial distribution given the latent class assignment.

Appendix B The CART Synthesizer

The following description of the CART synthesizer borrows heavily from Drechsler and Reiter (2011) and the interested reader is referred to this paper for more details on CART synthesizers and other machine learning approaches for data synthesis. First the agency fits the tree of conditioning on all other variables in the dataset so that each leaf contains at least records; call this tree . In general, we have found that using the default specification which varies between and depending on the software, provides sufficient accuracy and reasonably fast running time. We cease splitting any particular leaf when the “impurity” in that leaf is less than some agency-specified threshold or when we cannot ensure at least records in each child leaf. The “impurity” basically measures the heterogeneity of the outcome variable in each leaf. For continuous variables the variance in each leaf is commonly used as an impurity measure. For categorical variables the Gini coefficient or the entropy are typically employed. We use the Gini coefficient in our application since it is recommended for categorical variables with more than two categories (Berk, 2008). See, for example, Berk (2008) for a more detailed discussion of impurity measures. For all records in the original data, we trace down the branches of until we find that record’s terminal leaf. Let be the th terminal leaf in , and let be the values of in leaf . For all records whose terminal leaf is , we generate replacement values of by drawing from using the Bayesian bootstrap (Rubin, 1981). Repeating the Bayesian bootstrap for each leaf of provides one synthetic dataset. We repeat this process times to generate datasets with synthetic values for .

If more than one variable should be synthesized, a sequential regression multivariate imputation approach (SRMI, Raghunathan et al. (2001)) can be used. In such cases, for an arbitrary ordering of the variables the agencies can proceed as follows. Let represent the th variable in the synthesis order and let be all variables with no values replaced.

  1. Run the CART algorithm to regress on only. Replace by synthetic values using the corresponding synthesizer for . Let be the replaced values of .

  2. Run the algorithm to regress on only. Replace with synthetic values using the corresponding synthesizer for . Use the values of and for predicting new values for . Let be the replaced values of .

  3. For each where , run the algorithm to regress on . Replace each using the appropriate synthesizer based on the values in .

The result is one synthetic dataset. These three steps are repeated for each of the synthetic datasets, and these datasets are released to the public.

Appendix C Methodology for Estimating the Risk of Disclosure

The description of the methodology follows the description given in Drechsler and Reiter (2010). Suppose the intruder has a vector of information, , on a particular target unit in the population . Let be the unique identifier of the target, and let be the (not released) unique identifier for record in , where denotes the synthetic data and . Let be any information released about the simulation models.

The intruder’s goal is to match unit in to the target when . Let be a random variable that equals when for . The intruder thus seeks to calculate the for . Because the intruder does not know the actual values in , he or she should integrate over its possible values when computing the match probabilities. Hence, for each record we compute

(7)

This construction suggests a Monte Carlo approach to estimating each . First, sample a value of from . Let represent one set of simulated values. Second, compute using exact matching assuming are collected values. This two-step process is iterated times, where ideally is large, and (7) is estimated as the average of the resultant values of . When has no information, the intruder treats the simulated values as plausible draws of .

Following Reiter (2005), we quantify disclosure risk with summaries of these identification probabilities. It is reasonable to assume that the intruder selects as a match for the record with the highest value of , if a unique maximum exists. We consider three risk measures: the expected match risk, the true match rate, and the false match rate. Let be the number of records with the highest match probability for the target ; let if the true match is among the units and otherwise. The expected match risk equals . When and , the contribution of unit to the expected match risk reflects the intruder randomly guessing at the correct match from the candidates. Let when and otherwise and let denote the total number of target records. The true match rate equals , which is the percentage of true unique matches among the target records. Finally, let when and otherwise and let equal the number of records with . The false match rate equals , which is the percentage of false matches among unique matches.

Supplementary Material

In this online supplement we present detailed results regarding the impacts on risk and analytical validity if additional measures beyond synthesizing the geocodes are taken to further protect the data. Specifically, we look at two possible strategies: Aggregating the geographical information to a higher level or synthesizing more variables. We only present the results for the categorical CART synthesizer in this supplement since the risk levels arguably are acceptable already for the other two synthesizers. However, we ran all simulations for all synthesizers an verified that the general findings regarding the relative performance of the three different synthesizers remained the same. Detailed results for the other synthesizers can be obtained from the authors upon request.

Appendix D Aggregating the geocoding information

Aggregation of detailed geographical information is the classical approach that most statistical agencies choose when disseminating data to the public. In this section we evaluate the impact of this strategy if used in combination with data synthesis, i.e., we assume the original data are aggregated first before the aggregated information is synthesized in a second step. We expect two counterbalancing effects from this strategy. On the one hand disclosure risks will generally decrease since with increasing aggregation the number of individuals that share the same geographical information increases and thus it will be more difficult for the intruder to uniquely identify individuals using this information. On the other hand aggregating the geographical information will imply that the number of levels for geography will decrease. Since the selected synthesizer treats geography as a categorical variable we expect that the fit of the CART model will increase and more terminal nodes of the tree will contain only one geography, which in turn will imply that the synthetic geography will match the true geography more often thus offering less protection.

The impacts on analytical validity are more difficult to measure. The improved fit of the CART model will generally imply higher analytical validity, but obviously analysis on a very detailed level will no longer be possible. Moreover, even on the aggregated level, the boundaries of the geographic area the analyst is interested in will not necessarily coincide with the boundaries based on the aggregation level chosen by the agency.

We evaluate the impacts on validity and risk using three different aggregation levels: Aggregation to 10, 100, and 1,000 square meter grids. The aggregated geocodes are obtained by flooring the value of latitude and longitude value according to the selected aggregation level. For measuring the analytical validity we use the measure defined above. Since the grid levels do not necessarily match with the ZIP code levels, we assume the analyst would simply assign the ZIP code that is closest to the released geocode. This strategy could be improved by sampling from all ZIP codes that fall in the same grid. However, we found that even for the 1,000 square meter grids less than 7.5% of the grids contained more than one ZIP code ( for the 100 square meter girds and only 4 out of more than 1.6 million grids for the 10 square meter grids). Thus, we do not expect that improving the ZIP code assignment will have major impacts on the results.

Results for the different aggregation levels are presented in Figure 6. For the risk measures we only present the results assuming the intruder would pick the level of aggregation that leads to the maximum risk. We also include the utility and risk measures for the exact geocodes from the previous sections for comparison.

Figure 6: Global utility loss(UL), Expected match risk, true match rate (in %), and false match rate (in %) for various aggregation levels

Figure 6 indicates that the utility and risk more or less remain constant when moving from the exact geocodes to the 10 square meter aggregation. The analytical validity is highest if the geocodes are aggregated to 100 square meter grids but the improvements in validity come at the price of an increased risk. Only for the 1,000 square meter grids we see a decrease in the risk measures compared to the exact geocodes (except for the false match rate which is slightly lower). But the risk levels are still considerably higher than the risk levels for the original data without geocodes (the expected match risk and true match rate are 1821.16 and 2.56%, respectively for these data). Thus, the agency would have to select even larger aggregation levels to sufficiently protect the data. However, the measure indicates that the analytical validity starts decreasing once the data are aggregated beyond the 100 square meters grids. We emphasize again that any analysis based on a finer level of detail than the ZIP code would likely be even more affected by this aggregation step.

Appendix E Synthesizing more variables

As an alternative approach for increasing the protection offered by the categorical CART synthesizer we evaluate synthesizing more variables in this section. We start by synthesizing the age variable in addition to the geocode variable. Then we successively add occupation and foreign status to the list of variables that are synthesized. Finally, we evaluate a scenario in which all variables except sex and wage are synthesized. Note that in this scenario some variables are synthesized that are not known by the intruder according to our risk scenario. Thus, synthesizing these variables will only have negative impacts on our utility measures but will not decrease the risks. This setting can be seen as a conservative approach in which the agency expects that additional variables might be available to the intruder in the future even if they are not available now. We note that in all scenarios the sensitive wage information is not synthesized. Not synthesizing wage implies that the synthesis is only conducted to prevent reidentification. Obviously, the synthesis could also be used to directly protect the sensitive information in the data. However, different risk measures would be necessary to evaluate the success of this strategy. Our risk measures focus on quantifying the risks of reidentification. The results for these measures would not change if the wage information would also be synthesized. Thus, we refrain from synthesizing the wage information in this evaluation. Results for the different amounts of synthesis are presented in Figure 7.

Figure 7: Global utility loss(UL), Expected match risk, true match rate (in %), and false match rate (in %) for various amounts of synthesis

As expected both, analytical validity and disclosure risk decrease with increasing amounts of synthesis. However, the risks decrease more substantially. For example, if age is synthesized in addition to the geocode, the measure increases by 4.1% while the expected risk and the true match rate decrease by 61% and 63% respectively and the false match rate increases by 29%. The loss in validity is generally minor, never more than 8% except for the last scenario in which almost all variables are synthesized leading to negligible risk. But even in this scenario, the measure is substantially lower than for the other two synthesizers even if we assume as in Section 6.1 that these synthesizers would only be used to synthesize the geocode ( compared to for the continuous CART synthesizer and for the DPMPM synthesizer if only the geocode is synthesized). We also note that the analytical validity if age, occupation, and foreign status are synthesized is comparable to the analytical validity if all geocodes are aggregated to 1,000 square meter grids. However, the risks of disclosure are substantially smaller. Finally, the expected risks for the scenario in which only age and geocodes are synthesized are comparable to the risks under the current access model, i.e. without geocodes but no alteration for any of the other variables (1820.46 versus 1821.16). Although the percentage of correct unique matches among the target records (the true rate) is larger (6.47% versus 2.56%) most of the declared unique matches (83.58%) would be wrong (the false rate). Given that the intruder would know under the current access strategy that all unique matches are correct matches, i.e. the false rate would be zero, we feel that the overall risks are lower if age and geocode would be synthesized. Thus, from a risk perspective it seems justified that the IAB could provide access to a 2% sample of the data in which only the geocodes and the age variable are synthesized, but the geocoding information would not be aggregated. In this scenario the estimated risks are lower than the risks for a dataset that can be accessed on the premises of the IAB already. Alternatively, the IAB could opt for synthesizing more variables and releasing a larger sample instead.