## 1 Introduction

Statistical agencies collect respondent-level data, also known as microdata, from households and business establishments through survey instruments. Based on the collected microdata, agencies disseminate public use microdata files. Quantitative researchers, who require suitable microdata for their research projects, could then gain access to the public use microdata files they need. Findings from these projects in turn help policy makers to make data-driven decisions, and help citizens to understand their communities better. In short, there are great benefits of disseminating microdata to the public by statistical agencies.

However, when disseminating public use microdata files, statistical agencies are under legal obligation to protect privacy and confidentiality of survey respondents (U.S. Title 13 and Title 26). Therefore, the collected microdata has to undergo masking or model smoothing procedures before being released to the public.

Differential privacy is a formal privacy guarantee to add random noise to offer privacy protection. It has been mostly developed on aggregated data format, such as tables (Dwork et al., 2006). For microdata dissemination, the synthetic data approach is a promising strategy (Rubin, 1993; Little, 1993; Raghunathan et al., 2003; Reiter and Raghunathan, 2007; Drechsler, 2011)

. Statistical agencies develop Bayesian statistical models on the original, confidential data. They simulate records (i.e., microdata) from the posterior predictive distribution of the estimated Bayesian models, and release the synthetic microdata to the public. The disseminated synthetic data could preserve high utility, and keep low level of disclosure risks, as demonstrated in the literature, recently by

Hu et al. (2018); Manrique-Vallier and Hu (2018); Drechsler and Hu (2018); Hu and Savitsky (2018). Public use synthetic microdata products include the synthetic Longitudinal Business Database (Kinney et al., 2011, 2014), the Survey of Income and Program Participation (Benedetto et al., 2013), and OnTheMap (Machanavajjhala et al., 2008) by the U.S. Census Bureau, the IAB Establishment Panel (Drechsler et al., 2008a, b) in Germany, and synthetic business microdata disseminated by the Canadian Research Data Centre Network.There has been a large amount of work on developing synthesis models to achieve high level of utility of synthetic data, and on developing utility measures of synthetic data (Karr et al., 2006; Snoke et al., 2018). The literature has largely shown that if the synthesis models, also known as synthesizers, are carefully designed and tailored to the confidential microdata, the simulated synthetic data will maintain high utility. However, synthetic data with high utility usually comes with high disclosure risks. When disclosure risks are deemed too high, commonly used strategies include synthesizing more variables, and synthesizing at aggregated level of the variables (Drechsler and Hu, 2018).

Hu and Savitsky (2019) proposed a new approach to offering further privacy protection when disclosure risks of the synthetic data are deemed too high. Starting with a synthesizer producing high utility, their method evaluates the identification risk of *each* record given the synthetic data. Let denote the identification risk for record . The is a marginal probability of identification risk for record, , and the closer the value to 1, the higher the identification risk (i.e. probability of re-identification) of record . The authors designed a record-indexed weight , based on , which is inversely proportional to

and bounded between 0 and 1. The vector weights

are subsequently applied to the likelihood function of all records to form the pseudo posterior,(1) |

where denotes the model parameters and

denotes the model hyperparameters. This construction surgically

*downweights*the likelihood contributions of records with high identification risks (i.e. high produces low ), and produces a risk-weighted synthesizer within a pseudo-posterior framework. The proposed risk-weighted pseudo posterior synthesizer successfully produces an overall lower identification risk profile for the entire dataset, compared to the unweighted synthesizer. Such risk reduction comes with a relatively minor utility reduction, which encourages use of a risk-weighted pseudo posterior synthesizer to balance the utility-risk trade-off of synthetic data. Hu and Savitsky (2019) demonstrated the performance of the risk-weighted pseudo posterior synthesizer in a simulation study, and in an application of synthesizing family income in a sample of the Consumer Expenditure Surveys (CE) at the U.S. Bureau of Labor Statistics (BLS).

However, while the risk-weighted pseudo posterior synthesizer provides further privacy protection for records with high identification risks by downweighting their likelihood contribution, this risk reduction is achieved by shrinking the synthetic data value for each high-risk record to the main modes of the distribution, which in turn reduces its relative isolation from other records (and we note that it is easier for a putative intruder to identify data records with relatively unique values). The shrinking of a high-risk record towards the main modes, however, may increase the isolation of a relatively moderate-risk record with the result that the identification risk may, actually, increase after re-estimation under the risk-weighted pseudo posterior synthesizer as compared to the unweighted synthesizer. We term this undesirable phenomenon as “whack-a-mole”, where pseudo posterior unintentionally increases the risk in the synthetic data value of a moderate-risk real data record by increasing its relative isolation. It then becomes difficult to control the overall risk profile of the synthetic data under use of the risk-weighted pseudo posterior synthesizer, since the marginal probability of identification risk of some records increases relative to the unweighted synthesizer, simultaneously with a decrease in the identification risk for previously high-risk data records. Figure 1 highlights records (in yellow) in the CE sample, whose marginal probability of identification risk has been increased by 0.25, from the synthetic data drawn under the unweighted synthesizer to the synthetic data drawn under the risk-weighted pseudo posterior synthesizer; that is, the risk-weighted pseudo posterior synthesizer increases records with moderate levels of identification risks under the unweighted synthesizer to high identification risks, even though it successfully provides higher privacy protection for records with high identification risks, and provides overall lower identification risk profile for the entire dataset.

Furthermore, Figure 1 also shows that the surgical downweight method proposed by Hu and Savitsky (2019) does not well control the maximum identification risk. Suppose the statistical agency sets 0.5 as a threshold that no synthetic records should possess identification risk greater than 0.5. As can be seen in the number of records with identification risk exceeding 0.5 on the y-axis, the synthetic data produced by the risk-weighted pseudo posterior synthesizer does not satisfy such requirement and, therefore, the set of synthetic data records cannot be released to the public.

In this work, we focus on mitigating the whack-a-mole phenomenon to achieve more satisfactory risk profiles of the simulated synthetic data. Our main strategy formulates the weight for each record by constructing a collection of joint, pairwise probabilities of identification risk for that record with the other data records. Our resulting risk-based weight for each record used in the risk-weighted pseudo posterior synthesizer is now based on the collection of pairwise identification risk probabilities, rather than the marginal identification risk probability used in Hu and Savitsky (2019). Using marginal identification risk probabilities treats each record independently from all others, which induced the unintentional risk increase for some records in the whack-a-mole phenomenon. The use of pairwise identification risk probabilities for formulating by-record weights ties together the downweighting of records that we show in the sequel mitigates the whack-a-mole phenomenon. Pairwise downweighting will not eliminate the whack-a-mole phenomenon entirely, but it substantially mitigates it and does provide further privacy protection with higher utility for the entire sample as compared to the marginal downweight approach. We demonstrate on our CE sample application that our pairwise downweight approach additionally helps to control the maximum identification risks over the data by “compressing” the distribution of the record identification risks. Our use of pairwise identification risk probabilities may be viewed as an adaptation of Williams and Savitsky (2018) from the survey sampling case (where the weights are based on unit inclusion probabilities into a sample of a finite population) to our risk-weighted pseudo posterior framework.

Having established the propriety of our pairwise downweight strategy, we construct and illustrate practical approaches for scaling and shifting the risk-based weights that allow the statistical agency or data-disseminating organization to achieve a desired utility-risk trade-off in the publicly-released synthetic microdata. We illustrate our practical approaches on the CE sample data.

Section 1.1

introduces the CE sample, and the goal of synthesizing the highly skewed continuous variable, the family income, of each consumer unit (CU) in the sample. Section

1.2 describes the finite mixture synthesizer developed for the purpose of synthesizing the family income variable, which has been demonstrated to produce synthetic data with high utility. Section 1.3 provides a succinct description for how we compute the identification risk probability of each CU given publicly available synthetic datasets based on assumptions of intruder’s knowledge. We subsequently use our approach for computing the identification risk probabilities for the records to compare the performances of our pairwise downweight approach to the marginal downweight approach.### 1.1 The CE Sample

The CE, published by the BLS, contain data on expenditures, income, and tax statistics about CUs across the U.S.. The CE public-use microdata (PUMD)^{3}^{3}3For for information about CE PUMD, visit https://www.bls.gov/cex/pumd.htm. is publicly available respondent-level data, published by the CE. The CE PUMD has undergone masking procedures to provide privacy protection of survey respondents. Notably, the family income variable, has undergone top-coding, a popular Statistical Disclosure Limitation (SDL) procedure which could result in reduced utility and insufficient privacy protection (An and Little, 2007; Hu and Savitsky, 2019).

The CE sample in our application contains CUs, coming from the 2017 1st quarter CE Interview Survey. It includes the family income variable, which is highly right-skewed and deemed sensitive; refer to Figure 2

for its density plot. The CE sample also contains 10 categorical variables, listed in Table

1. These categorical variables are deemed insensitive, and used as predictors in building a flexible synthesizer for the synthesis of the sensitive family income variable. We next provide details of this synthesizer.Variable | Description |
---|---|

Gender | Gender of the reference person; 2 categories |

Age | Age of the reference person; 5 categories |

Education Level | Education level of the reference person; 8 categories |

Region | Region of the CU; 4 categories |

Urban | Urban status of the CU; 2 categories |

Marital Status | Marital status of the reference person; 5 categories |

Urban Type | Urban area type of the CU; 3 categories |

CBSA | 2010 core-based statistical area (CBSA) status; 3 categories |

Family Size | Size of the CU; 11 categories |

Earner | Earner status of the reference person; 2 categories |

Family Income | Imputed and reported income before tax of the CU; |

approximate range: (-7K, 1,800K) |

### 1.2 Finite Mixture Synthesizer

To simulate partially synthetic data for the CE sample, where only the sensitive, continuous family income variable is synthesized, we propose using a flexible, parametric finite mixture synthesizer. As shown in Hu and Savitsky (2019), their truncated Dirichlet process (TDP) mixture synthesizer produces synthetic CE data with high utility, and can be used for partial synthesis of continuous variable(s) utilizing a number of available predictors.

Equation (2) and Equation (3) present the first two levels of our proposed hierarchical parametric finite mixture synthesizer: is the logarithm of the family income for CU , and is the predictor vector for CU . The TDP mixture utilizes a hyperparameter for the maximum number of mixture components (i.e., clusters), , that is to set to be over-determined to permit the flexible clustering of CUs. A subset of CUs that are assigned to cluster, , employ the same generating parameters for , , that we term a “location”. Locations, , and the vector of cluster indicators, , are all sampled for each CU, .

(2) | |||||

(3) |

where the matrix of regression locations, , denote cluster-indexed regression coefficients for predictors. The are, in turn, assigned a sparsity inducing Dirichlet distribution with hyperparameters specified as for . For brevity and to avoid repeating Hu and Savitsky (2019), we include the details of our prior specification in the Supplementary Material. Here, we describe how to generate partially synthetic data for the CE sample. To implement the TDP synthesizer, we first generate sample values of from the posterior distribution at MCMC iteration . Secondly, for CU , we generate cluster assignments, , from its full conditional posterior distribution given in Hu and Savitsky (2019) using the posterior samples of . Lastly, we generate synthetic family income for CU , , from Equation (2) given , and samples of and . We perform these draws for all CUs, and obtain a partially synthetic dataset, at MCMC iteration . We repeat this process for times, creating independent partially synthetic datasets .

### 1.3 Marginal Probability of Identification Risk

In Section 2, we define and utilize a collection of pairwise identification risk probabilities for each real data record to construct a weight to downweight its likelihood contribution for the risk-weighted pseudo posterior synthesizer estimated on the real data. We subsequently employ the marginal probability of identification risk for each record from Hu and Savitsky (2019) as a risk measure to assess the risk reduction of the synthetic data produced from a pairwise risk-weighted pseudo posterior synthesizer. We select the marginal probability of identification risk as our measure risk because it is based on our assumptions about the behavior of a putative intruder who intends to uncover the identity for a target record.

We provide a succinct description for the marginal probability of identification risk utilized in Hu and Savitsky (2019), to which we refer the reader for further detail. The intruder has access to publicly available synthetic datasets, , and suppose the intruder seeks identifies of CUs within synthetic dataset, . Furthermore, suppose the intruder has access to an external file, which contains a known pattern of the un-synthesized categorical variables, , of CU . In addition, we assume that the intruder knows , the true family income of CU .

Let index the collection of CUs sharing the pattern, , with CU , in the synthetic dataset . Let the cardinality, , denote the number of CUs in . Define as a ball of radius around the continuous, true family income, . The intruder will regard CUs in whose synthetic family income as candidate records to be identified as CU . Define the identification risk probability for CU as:

(4) |

where is the indicator function. The numerator on the right-hand size of Equation (4) measures the relative isolation of the value for CU , , among the values for the synthetic CUs in pattern, . The fewer are the number of data values for synthetic data CUs contained in , the more isolated is and the intruder has relatively fewer CUs from which to make a random guess, which in turn raises the marginal probability of identification risk. Indicator is set to if the synthetic family income for CU , , is among the candidate records for CU . It measures the mixing property of the synthesizer. Even if the CU is relatively isolated, if its synthetic data value, , is far away (i.e. lies outside ) from the true data value, , then the intruder will have a probability of identifying CU . The degree of mixing is a property of the synthesizer. The denominator on the right-hand side of Equation (4) counts the total number of CUs in the pattern containing CU , to produce a measure .

The average of is then taken across synthetic datasets, and is used as the final record-level marginal probability of identification risk for CU . The choice of in is a policy decision set by the statistical agency where smaller values will increase the number of Equation (4), though is more likely to be set to when is smaller.

The remainder of the paper is organized as follows: Section 2 describes the risk-weighted pseudo posterior synthesizers within the marginal and pairwise downweight frameworks. In Section 3, we apply the two proposed risk-weighted pseudo posterior synthesizers to the CE sample to generate synthetic family income values, and compare their identification risk and utility profiles. Section 4 presents practical approaches for scaling and shifting the pairwise risk-based weights that allows statistical agencies to achieve their desired utility-risk trade-off balance, demonstrated on the CE sample. The paper concludes with discussion in Section 5.

## 2 Risk-weighted Pseudo Posterior Synthesizers

We next discuss two frameworks we propose for constructing risk-weighted pseudo posterior synthesizers through surgical by-record downweighting: i) The marginal downweight framework; ii) The pairwise downweight framework. Within each framework, we construct identification risk probabilities from the confidential (real) data, , formulate by-record probability-based weights, estimate a risk-weighted pseudo posterior synthesizer from which we draw synthetic data. The resulting risk reduction is measured using the marginal probability of identification risk, outlined in Section 1.3, on the synthetic data.

For illustration purposes, we continue our description with the CE sample. The statistical agency, which is the BLS in the case of CE sample, intends to evaluate the identification risk probability of each CU, using the confidential CE sample. The BLS knows the pattern of categorical variables, , and , the true family income value of CU , both of which are assumed known by the intruder through an external file.

### 2.1 Marginal downweight framework

Within the marginal downweight framework, marginal identification risk probabilities are calculated on the confidential data using Equation (4) for each record, ; only, here we set all , since the calculations are performed on the confidential (real) data, not the resulting synthetic data, such that mixing does not apply. There is no synthetic value to account for when measuring the marginal probability for each record in the confidential data.

We proceed to construct marginal probability-based weight, , that will be utilized to weight the likelihood contribution for CU , where the weight is based on the marginal identification risk probability, . For CU , when is large and close to 1, the likelihood contribution of this CU should be substantially downweighted to induce further privacy protection. On the other hand, when is small and close to 0, the likelihood contribution does not need to be downweighted due to its low identification risk. We therefore construct the record-level marginal probability-based weights, to be inversely proportional to the record-level marginal identification risk probabilities, , as:

(5) |

where is a constant to scale the amount of downweighting of all CUs. For example, when , suppose CU expresses a marginal identification risk probability, , its weight is computed . If weighting with turns out to provide more risk protection than is required, such that the resulting synthetic data expresses a large degree of distortion that overly damages the synthetic data utility, the BLS may then increase , resulting in . All to say, employment of constant, , into the marginal probability-based weights formulation in Equation (5) allows the BLS to tune the amount of downweighting to achieve a desired utility-risk trade-off of the synthetic data. We note that Hu and Savitsky (2019) constructed a record-level instead of a file-level scalar . We utilize the scalar in our CE application as it works well and avoids any distortion in the risk-based weighting that may be implied by using a vector of constants.

### 2.2 Pairwise downweight framework

Within the pairwise downweight framework, we construct a pairwise identification risk probability for each pair of CUs that are in the same known pattern of un-synthesized categorical variables. Let index the collection of CUs in the confidential data sharing the same pattern, , with CUs and . As in Equation (4), denotes the number of CUs in the collection. As before, casts a ball of radius around the true family income of CU , ; similarly, with for CU . We next measure the probability of the event that the family income in the confidential data for each CU, , lies in the intersection defined by and , jointly. These intersections are used to construct a joint identification risk probability, for the pair of CUs as:

(6) |

The joint identification risk probability, for pairs of CUs assigned to *different* known patterns. It bears mention that constructing the joint measure of isolation in the numerator on the right-hand side of Equation (6) does not arise from an assumption of the intruder identification process as it does in Equation (4). We use these joint identification risk probabilities to next formulate dependent, by-record probability-based weights for the pairwise risk-weighted pseudo posterior synthesizer.

We proceed to construct pairwise probability-based weight, , for CU , using the collection of pairwise identification risk probabilities, , for each . Firstly, for each , define the pairwise weight for CUs as:

(7) |

This definition constructs the pairwise weight, , to be inversely proportional to the pairwise identification risk probability, : higher results in lower , and vice versa. Furthermore, .

Secondly, we construct the normalized weight, for CU , by summing over all for , and dividing by to account for the times that CU appears in the combination of pairs in pattern :

(8) |

The normalized weight, , reflects the amount of downweighting needed for CU , based on the sum over all pairwise identification risk probabilities associated with CU , . To see the inverse proportionality between and more clearly, we can rewrite Equation (8) in terms of :

(9) | |||||

Equation (9) shows that when the sum of pairwise identification risk probabilities associated with CU is high, the normalized weight will be low and closer to 0. Such inverse proportionality is desired, because we want to insert more downweighting on CUs with high identification risks for further privacy protection.

We recall that the whack-a-mole phenomenon arises as distribution mass is shifted in the synthetic data from the real data, due to shrinking of family income values for isolated high-risk records towards the main modes of the distribution. This shrinking of values for high-risk records may in turn reduce the number of records whose values are close to a moderate-risk record (measured for the confidential data), with the result that measured risk in the synthetic data for a record whose risk is measured as moderate under the unweighted synthesizer, may actually increase under the marginal risk-weighted pseudo posterior synthesizer. It is this increase in risks in the synthetic data for moderate-risk records in the confidential data that induces difficulty to control the overall risk level across records.

The set of within each pattern are now constructed as *dependent*. The probability-based weights within the pairwise downweight framework are therefore constructed to reduce the shrinking of high-risk records to leave moderate-risk records more covered in the synthetic data, such that the risks of these records increase less than those in the synthetic data within the marginal downweight framework. We demonstrate in Section 3 that the pairwise downweight framework induces a compression in the distribution of by-record identification risk probabilities (measured as marginal probabilities of identification risk), which is induced by the dependence among the . This compression in the distribution of identification risk probabilities helps reduce the whack-a-mole phenomenon.

The final pairwise probability-based weight, , for CU , is defined as:

(10) |

where, as with marginal probability-based weights, allows the scaling of weights to control the utility-risk trade-off.

The pairwise risk-weighted pseudo posterior synthesizer has the same form as specified in Hu and Savitsky (2019) for the marginal risk-weighted pseudo posterior synthesizer; namely,

(11) |

where .

## 3 Comparing Marginal and Pairwise Risk-weighted Synthesizers on CE Sample

We utilize three synthesizers on the CE sample to synthesize the sensitive family income variable using 10 categorical predictors: i) The finite mixture unweighted synthesizer from Section 1.2; ii) The marginal risk-weighted pseudo posterior synthesizer from Section 2.1; iii) The pairwise risk-weighted pseudo posterior synthesizer from Section 2.2.

For the two risk-weighted pseudo posterior synthesizers, we set the scale adjustment for comparison. The resulting by-record distribution of identification risks based on the marginal probability of identification risk in Section 1.3 are evaluated for synthetic data drawn under each of three synthesizers. We set the same radius, for the two risk-weighted synthesizers, and for evaluating the identification risks of all synthesizers. We assume the intruder has an external file with information on {Gender, Age, Region} for each of the CUs. The intersecting values of these known-to-intruder predictors produces 40 known patterns, and each pattern has more than 1 CUs (i.e. no pattern with singletons). For all three synthesizers, we let , the maximum number of clusters. We estimate our synthesizers using Stan (Stan Development Team, 2016) and assess convergence by measuring the effective sample size (ESS). We include our Stan script in the Supplementary Material. We generate synthetic datasets for each synthesizer. We next evaluate and compare the profiles of identification risks and utility for all synthesizers.

### 3.1 Identification risks

Side-by-side violin plots of the identification risk probability distributions for the three synthesizers are presented in Figure

3. For comparison, we include the risk profile of the confidential CE sample, labeled as “Data”. The unweighted synthesizer is labeled as “Synthesizer”. The two risk-weighted pseudo posterior synthesizers are labeled as “Marginal” and “Pairwise”, corresponding to the marginal and pairwise downweight frameworks, respectively. We will use these labels - Synthesizer, Marginal and Pairwise - in the remainder of the paper to refer these downweight frameworks.Each violin plot shows the distribution () of the calculated identification risk probabilities for all CUs in the CE sample. The higher the identification risk probability, the lower the privacy protection, and vice versa. Figure 3 indicates that the Synthesizer provides a huge improvement in privacy protection, compared to the Data. This indicates that the Synthesizer has induced a large amount of mixing in the synthesis process to offer risk reduction, i.e., privacy protection.

Between the two risk-weighted pseudo posterior synthesizers, the Pairwise provides a shorter tail, as well as a more concentrated identification risk distribution, compared to the Marginal. With similar average identification risk probabilities (the horizontal bars), the Pairwise has an inter quartile range (IQR) of 0.1385, compared to 0.1534 of the Marginal, highlighting that the Pairwise induces a compression in the by-record identification risks computed on the synthetic data within the Pairwise downweight framework. All to say, the Pairwise produces a relatively lower, more compressed identification risk distribution than that of the Marginal in a fashion that offers more control to the BLS. Constructing the by-record, Pairwise probability-based weights,

to be dependent ties the shrinking of records together in a fashion that reduces the loss of coverage for moderate-risk records after risk-weighted pseudo posterior synthesis due to the whack-a-mole phenomenon, compared to the Marginal. We also note that mean identification risk across the records is essentially the same for both the Marginal and Pairwise as shown in the horizontal line within in each violin plot, indicating that both distributions are centered similarly. It is the difference in the relative concentration of their masses and the maximum identification risk values that differentiate them.We also note that there is a substantial downward shift in the identification risk distributions for the Marginal and Pairwise, on the one hand, as compared to the Synthesizer, on the other hand. This shift may be seen by focusing on the bottom portion of the distributions where there is much more distribution mass for identification risks (measured as marginal probabilities of identification risk) .

However, the Pairwise has *not* fully resolved the whack-a-mole phenomenon, though it has notably lessened it. Figure 5 depicts a scatterplot highlighting the whack-a-mole phenomenon in the Pairwise. Compared to that in the Marginal in Figure 4, the whack-a-mole phenomenon in the Pairwise is less severe, as can be seen by the overall smaller values on the y-axis of the highlighted CUs (whose identification risk has increased by 0.25 from the Synthesizer to the Pairwise). We also observe in Figure 5 that there are fewer CUs with higher than 0.5 identification risk in the Pairwise than in the Marginal, a feature that also reduces the tail length in the identification risk distribution violin plot shown in Figure 3.

### 3.2 Utility

To evaluate utility, we report the point estimates and 95% confidence intervals of several key summary statistics: the mean, the median, and the 90% quantile of the family income variable, estimated from the collection of

synthetic datasets drawn from each synthesizer (Synthesizer, Marginal and Pairwise) and presented in Table 2 through Table 4. In addition, in Table 5, we report the regression coefficient of Earner 2, in a regression analysis of family income on three predictors, {Region, Urban, Earner}. In each table, the “Data” row corresponds to the point estimate and the 95% confidence interval based on the confidential CE sample; the “Synthesizer”, “Marginal”, and “Pairwise” rows correspond to the point estimates and the bootstrapped 95% confidence intervals for the three synthesizers.

All four tables show high level preservation of utility by the Synthesizer. Yet, while Figure 3 reveals that the Synthesizer substantially reduces the risk distribution compared to the Data, the Synthesizer still produces an average marginal probability of identification risk of , which may be deemed as too high by the BLS. Additional risk reduction is offered by the two risk-weighted pseudo posterior synthesizers, but at the cost of some loss of utility in their synthetic datasets. The utility results in Table 2 through Table 5

show that such privacy protection comes at a probably unacceptable utility reduction in the Marginal: none of the 95% credible interval contains the point estimate from the Data.

The Pairwise, by contrast, also expresses some utility reduction, though the reduction is very minor such that inference is unchanged and much less than that of the Marginal, while yet producing a similarly-reduced risk distribution, which is a notable reduction from the Synthesizer, shown in Figure 3. Based on these results, we recommend the Pairwise to the BLS as a solution that offers further privacy protection while maintaining a reasonably high level of utility preservation.

point estimate | 95% C.I. | |
---|---|---|

Data | 72090.26 | [70127.02, 74053.50] |

Synthesizer | 72377.12 | [70412.90, 74415.81] |

Marginal | 76641.95 | [72638.03, 81817.03] |

Pairwise | 73184.83 | [70887.88, 75626.18] |

point estimate | 95% C.I. | |
---|---|---|

Data | 50225.15 | [48995.01, 52000.00] |

Synthesizer | 50538.50 | [49043.63, 52115.76] |

Marginal | 54229.08 | [52877.58, 55537.29] |

Pairwise | 51692.10 | [50235.51, 53052.61] |

point estimate | 95% C.I. | |
---|---|---|

Data | 153916.30 | [147582.40, 159603.80] |

Synthesizer | 152597.10 | [147647.40, 157953.80] |

Marginal | 134582.40 | [130716.50, 138516.00] |

Pairwise | 145968.90 | [141137.70, 150867.30] |

point estimate | 95% C.I. | |
---|---|---|

Data | -45826.20 | [-49816.29, -41836.11] |

Synthesizer | -46017.29 | [-50239.20, -41795.37] |

Marginal | -34738.85 | [-46792.90, -22684.81] |

Pairwise | -44028.77 | [-49340.69, -38716.85] |

In summary, the Pairwise offers further privacy protection, compared to the unweighted Synthesizer. The extra privacy protection comes at the cost of a minor level of utility reduction. Overall, the Pairwise creates a better balance of utility-risk trade-off, compared to the Marginal. It is worth noting that we also examined three-way identification risk probabilities to assess whether further improvement in the whack-a-mole phenomenon was observed, but discovered little improvement at the price of a less scalable computation.

We now turn to methods of additional weights adjustments to improve the level of utility preservation with acceptable loss of privacy protection. The following proposed strategies could allow the BLS and other statistical agencies to further tune the risk-weighted pseudo posterior synthesizers to achieve their desired utility-risk trade-off.

## 4 Practical Approaches to Local Weights Adjustments

In Section 3, the two risk-weighted pseudo posterior synthesizers, the Marginal and the Pairwise, have been demonstrated to offer higher privacy protection, compared to the unweighted Synthesizer. The Pairwise gives better control of the overall risk profile and the tail of the record-level identification risk distribution by compression, while maintaining a high level of utility preservation. The Marginal, on the other hand, provides slightly lower privacy protection, with a bigger compromise on utility.

In this section, we assume that the BLS is satisfied with the privacy protection levels offered by the two risk-weighted pseudo posterior synthesizers, but not yet satisfied with their levels of utility preservation. We propose methods to increase their utility preservation levels, with acceptable loss of the privacy protection. We now proceed to describe the two strategies for creating such utility-risk trade-off balance. We focus on the Pairwise risk-weighted pseudo posterior synthesizer, due to its superior performance in the trade-off between utility and risk performances, offering notably better risk protection than the Synthesizer for slightly reduced utility. Results of the Marginal risk-weighted pseudo posterior synthesizer are included in the Supplementary Material for brevity, and a short discussion is presented at the end of this section.

### 4.1 Two Local Weights Adjustments Strategies

The first strategy relates to the scaling constant to be applied to the final pairwise probability-based weights, repeated in Equation (12). We earlier noted this scaling constant serves as a tuning parameter for the BLS to control the amount of downweighting of all CUs. For example, for CU with , when increasing to , we have increased the final pairwise weight of this CU, , from to , which translates to a decrease of the amount of downweighting of 0.25. Increasing the pairwise weight will increase the likelihood contribution of CU in the corresponding risk-weighted pseudo posterior synthesizer, and is expected to result in higher level of utility preservation.

(12) |

In the limit of increasing under the setup of Equation (12), each CU receives a weight of 1, which turns the Pairwise risk-weighted pseudo posterior synthesizer to the unweighted Synthesizer. We consider the Synthesizer (i.e. weights = 1) as the best scenario of utility preservation. A risk-weighted pseudo posterior synthesizer induces misspecification in the pseudo likelihood. It surgically distorts the portion of the distribution with high identification risks. The induced misspecification translates to less-than-1 weights for some records. Such reduction in weights in turn produces synthetic datasets with lower level of utility preservation.

However, it is important to note that the scaling constant affects the CUs differently. For example, for another CU with , the increase to is 0.1 when is increased from to . By contrast, an increase of 0.25 occurs for CU with , producing . Tuning affects all CUs, but to different degrees. Figure 7 plots the pairwise weights (y-axis) against family income for all CUs and shows the effects of on the final pairwise weights of all CUs, compared to Figure 6 where . A greater-than-1 value of induces a *stretching* in the final pairwise weights: it affects CUs with higher weights to a greater degree by producing larger magnitude increases of their weights, which is an obvious property of scaling (but, nevertheless, worth noting due its impact on the identification risk distribution).

Another approach to adjusting the pairwise weights applies a constant with *equal* effect on the final pairwise weights of all CUs. This can be done through adding a constant in the final pairwise weights construction, as in Equation (13). Figure 8 illustrates the case with while keeping . Compared to Figure 6, the pairwise weight of every CU is increased by 0.1 in Figure 8, showing that the effect of is equally applied to the final pairwise weight of each CU, because the additive constant, , *shifts* the distribution of by-record identification risks. A positive increases the likelihood contribution of all CUs by the same amount, and is expected to result in higher level utility preservation, as every weight is closer to 1 than before.

(13) |

It is important to note that when setting and in Equation (13), we obtain the Pairwise risk-weighted pseudo posterior synthesizer in the application in Section 3. Figure 6 illustrates this basic setup, and we can observe how utilizing the pairwise identification risk probabilities to construct weights produces a selective downweighting of CUs as compared to the unweighted Synthesizer where every CU receives a weight of 1. The majority of the CUs under the Pairwise shown in Figure 6 receive weights around 0.25, with just a few CUs having weights of 1 and many CUs with large or extremely large family income values having weights as low as 0.1. We utilize the marginal and pairwise weights that adjust the unweighted Synthesizer in order to achieve a relatively large or *global* effect on the utility-risk trade-off. The further tuning of the marginal or pairwise weights using and are designed to induce a relatively small or *local* effect on the resulting by-record weights to allow a more precise setting of the utility-risk trade-off balance sought by the statistical agency. A greater-than-1 value of has a stretching effect on the distribution of the (pairwise or marginal) weights, while a positive value of induces an upward shift in the weight distribution. It is possible to tune and at the same time. For simplicity and illustration purpose, we evaluate tuning only one of these two parameters.

It bears mention that one may set and in the case the statistical agency desires to locally adjust the risks further downward. We focused on the inverse case of allowing a bit more risk to improve utility because it is the situation faced by the BLS on the CE sample data.

We now turn to the utility and identification risk profiles of the risk-weighted pseudo posterior synthesizers with these local weights adjustments.

### 4.2 Utility and Risk Profiles under Local Weights Adjustments

For the Pairwise risk-weighted pseudo posterior synthesizers, we consider three variations of the final pairwise weights in Equation (13): i) ; ii) ; and iii) . The choice of the radius , the assumption of intruder’s knowledge, and the configurations of the synthesis, stay the same as in Section 3. As before, we keep the results of the confidential CE sample, and the unweighted Synthesizer, for comparison.

With increased weights through greater-than-1 values of or positive values of , we expect to see increased utility preservation by the Pairwise risk-weighted pseudo posterior synthesizers with these local weights adjustments. We report point estimates and 95% confidence intervals of several key summary statistics and a regression coefficient, from Table 6 to Table 9. These utility results suggest that setting offers improvement in all utility measures, compared to setting . Setting , on the other hand, offers smaller utility improvement.

point estimate | 95% C.I. | |
---|---|---|

Data | 72090.26 | [70127.02, 74053.50] |

Synthesizer | 72377.12 | [70412.90, 74415.81] |

Pairwise, | 73184.83 | [70887.88, 75626.18] |

Pairwise, | 71695.49 | [69780.83, 73708.90] |

Pairwise, | 72421.46 | [70393.96, 74544.11] |

point estimate | 95% C.I. | |
---|---|---|

Data | 50225.15 | [48995.01, 52000.00] |

Synthesizer | 50538.50 | [49043.63, 52115.76] |

Pairwise, | 51692.10 | [50235.51, 53052.61] |

Pairwise, | 51791.43 | [50339.41, 53283.75] |

Pairwise, | 51403.06 | [49975.29, 52869.75] |

point estimate | 95% C.I. | |
---|---|---|

Data | 153916.30 | [147582.40, 159603.80] |

Synthesizer | 152597.10 | [147647.40, 157953.80] |

Pairwise, | 145968.90 | [141137.70, 150867.30] |

Pairwise, | 145299.70 | [140864.00, 149948.70] |

Pairwise, | 149157.80 | [144024.70, 154207.60] |

point estimate | 95% C.I. | |
---|---|---|

Data | -45826.20 | [-49816.29, -41836.11] |

Synthesizer | -46017.29 | [-50239.20, -41795.37] |

Pairwise, | -44028.77 | [-49340.69, -38716.85] |

Pairwise, | -43544.92 | [-47654.14, -39435.71] |

Pairwise, | -44827.57 | [-49289.40, -40365.73] |

To find out whether such utility improvement comes at a price of reduced privacy protection, we create violin plots to show the identification risk probability distributions in Figure 9. The violin plots show different impacts on identification risks when setting or , compared to when setting . Increasing slightly increases the average identification risks (the horizontal bar), while producing a slightly shorter tail, indicating a better control of the maximum identification risks. Increasing , on the other hand, keeps a similar average identification risks, while producing a longer tail, indicating a worse control of the maximum identification risks. Both and provide higher privacy protection compared to the unweighted Synthesizer, in terms of the average and the tail.

In summary, setting or for the Pairwise risk-weighted pseudo posterior synthesizers offers improvement in utility. Tuning offers higher utility improvement, with a higher price of privacy protection reduction. Tuning offers reasonable utility improvement, with a slightly higher average identification risks and slightly smaller maximum identification risks. Depending on the microdata release policy set by the BLS, the proposed local weights adjustments could help the BLS to achieve their desired utility-risk trade-off balance when disseminating synthetic datasets through the Pairwise risk-weighted pseudo posterior synthesizers.

We also examined the impacts of and on the Marginal risk-weighted pseudo posterior synthesizers. For brevity, the detailed results are in the Supplementary Material. The main findings are: i) Setting or offers improvement in utility, with offering the biggest improvement in every utility measure; ii) Compared to the utility improvement with and on the Pairwise risk-weighted pseudo posterior synthesizers, the improvement offered by and on the Marginal risk-weighted pseudo posterior synthesizers is not as satisfactory, and might be deemed insufficient in some utility measures; iii) Both and create shorter tails of the identification risk distribution, while slightly increases the average and keeps a similar average of the identification risks. Based on these findings, we recommend the proposed local weights adjustments with or on the Pairwise risk-weighted pseudo posterior synthesizers for the BLS to achieve their desired utility-risk trade-off balance.

## 5 Conclusion

We propose a general framework for statistical agencies to achieve desired utility-risk trade-off balance, when disseminating microdata through synthetic data. Starting with a synthesizer with high utility but unacceptable level of identification risks, statistical agencies can proceed to create risk-weighted pseudo posterior synthesizers to provide higher privacy protection. Our proposed risk-weighted pseudo posterior synthesizers are designed based on record-indexed weight, , which is inversely proportional to the record-level identification risk probability. The likelihood contribution of each record is exponentiated with the record-indexed weight, and the resulted pseudo posterior creates the risk-weighted synthesizer, which downweights the likelihood contribution of records with high identification risks, providing higher privacy protection.

The agencies can utilize marginal identification risk probabilities, which treat records independent from each other. When risk-weighted pseudo posterior synthesizers based on the marginal identification risk probabilities do not provide sufficient privacy protection, especially when records with low identification risks are exposed to less privacy protection due to the whack-a-mole phenomenon, the statistical agencies may utilize pairwise identification risk probabilities in designing the by-record weights. These pairwise identification risk probabilities tie pairs of records together and induces dependence among the by-record weights, which offers an overall higher privacy protection and mitigates the whack-a-mole phenomenon, to some degree. To offer more flexibility in tuning the risk-utility tradeoff, we propose minor, local adjustments to weights. These adjustments to the weights should be relatively small in order to not distort the risk profile achieved using the Pairwise method of downweighting.

Our application to the CE sample shows that the Pairwise downweight framework creates risk-weighted pseudo posterior synthesizers with better control of the identification risks and little loss of utility. Local weights adjustments are shown to improve utility preservation level with little loss of privacy protection. These features provide general guidelines for statistical agencies to design risk-weighted pseudo posterior synthesizers to work towards disseminating synthetic data with desired utility-risk trade-off.

## Acknowledgements

This research is supported by ASA/NSF/BLS Senior Research Fellow Program.

## References

- Multiple imputation: an alternative to top coding for statistical disclosure control. Journal of the Royal Statistical Society, Series A 170, pp. 923–940. Cited by: §1.1.
- The creation and use of the sipp synthetic beta. Available at \(https://www.census.gov/content/dam/Census/programs-surveys/sipp/methodology/% SSBdescribe_{n}ontechnical.pdf\). Accessed February 2018. Cited by: §1.
- Comparing fully and partially synthetic datasets for statistical disclosure control in the german iab establishment panel. Transactions on Data Privacy, pp. 1002–1050. Cited by: §1.
- A new approach for disclosure control in the iab establishment panel - multiple imputation for a better data access. Advances in Statistical Analysis, pp. 439–458. Cited by: §1.
- Synthesizing geocodes to facilitate access to detailed geographical information in large scale administrative data. pp. arXiv:1803.05874. Cited by: §1, §1.
- Synthetic datasets for statistical disclosure control. Springer: New York. Cited by: §1.
- Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC’06, pp. 265–284. External Links: ISBN 3-540-32731-2, 978-3-540-32731-8 Cited by: §1.
- Dirichlet process mixture models for modeling and generating synthetic versions of nested categorical data. Bayesian Analysis 13, pp. 183–200. Cited by: §1.
- Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys. ArXiv e-prints. External Links: 1809.10074 Cited by: §1.
- Bayesian pseudo posterior synthesis for data privacy protection.. pp. arXiv:1901.06462. Cited by: §1.1, §1.2, §1.2, §1.3, §1.3, §1, §1, §1, §2.1, §2.2.
- A framework for evaluating the utility of data altered to protect confidentiality. The American Statistician 60, pp. 224–232. Cited by: §1.
- SynLBD 2.0: improving the synthetic longitudinal business database. Statistical Journal of the International Association for Official Statistics 30, pp. 129–135. Cited by: §1.
- Towards unrestricted public use business microdata: the synthetic longitudinal business database. International Statistical Review 79, pp. 363–384. Cited by: §1.
- Statistical analysis of masked data. Journal of Official Statistics 9, pp. 407–426. Cited by: §1.
- Privacy: theory meets practice on the map. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, pp. 277–286. Cited by: §1.
- Bayesian non-parametric generation of fully synthetic multivariate categorical data in the presence of structural zeros. Journal of the Royal Statistical Society, Series A 181, pp. 635–647. Cited by: §1.
- Multiple imputation for statistical disclosure limitation. Journal of Official Statistics 19, pp. 1–16. Cited by: §1.
- The multiple adaptations of multiple imputation. Journal of the American Statistical Association 102, pp. 1462–1471. Cited by: §1.
- Discussion statistical disclosure limitation. Journal of Official Statistics 9, pp. 461–468. Cited by: §1.
- General and specific utility measures for synthetic data. Journal of the Royal Statistical Society, Series A 181, pp. 663–688. Cited by: §1.
- RStan: the R interface to Stan. Note: R package version 2.14.1 External Links: Link Cited by: §3.
- Bayesian pairwise estimation under dependent informative sampling. Electronic Journal of Statistics 12, pp. 1631–1661. Cited by: §1.

## 6 Prior specification of the finite mixture synthesizer

We induce sparsity in the number of clusters with,

(14) | ||||

(15) |

We specify multivariate Normal priors for each regression coefficient vector of coefficient locations, ,

(16) |

where the correlation matrix, , receives a uniform prior over the space of correlation matrices, and each component of receives a Student-t prior with degrees of freedom,

(17) |

## 7 Stan script

functions{ real normalmix_lpdf(vector y, vector pi_prob, vector weights, matrix X, ΨΨmatrix beta, vector sigma_y, int N, int K){ real log_post; log_post = 0; for( i in 1:N ) { vector[K] ps; for( k in 1:K ) { real Bk_xi; Bk_xi = dot_product(beta[k], X[i]); ps[k] = log(pi_prob[k]) + normal_lpdf(y[i]| Bk_xi, sigma_y[k]); } /* end loop k over clusters / mixture components */ log_post += weights[i] * log_sum_exp(ps); } /* end loop i over N observations */ return log_post; } /* end function normalmixture_lpdf() */ } /* end function{} block */ data{ int<lower=1> N; int<lower=1> K; int<lower=1> R; matrix[N,R] X; vector[N] y; vector[N] weights; } /* end data block */ transformed data{ vector<lower=0>[K] ones_K; vector<lower=0>[R] zeros_beta; ones_K = rep_vector(1,K); zeros_beta = rep_vector(0,(R)); } /* end transformed data block */ parameters{ real alpha; positive_ordered[K] lambda; matrix[K, R] beta; vector<lower=0>[R] sigma_beta; cholesky_factor_corr[R] L_beta; positive_ordered[K] sigma_y; } /* end parameters block */ transformed parameters{ simplex[K] pi_prob = lambda / sum(lambda); } model{ alpha ~ gamma( 1.0, 1.0 ); lambda ~ gamma( alpha/K * ones_K, 1 ); { L_beta ~ lkj_corr_cholesky(4); sigma_beta ~ student_t(3,0,1); for (k in 1:K) { beta[k] ~ multi_normal_cholesky( zeros_beta, ΨΨΨΨΨdiag_pre_multiply(sigma_beta,L_beta) ); } } sigma_y ~ student_t(3,0,1); y ~ normalmix(pi_prob, weights, X, beta, sigma_y, N, K); } /* end model{} block */

## 8 Discussion and results of the Marginal risk-weighted pseudo posterior synthesizers

### 8.1 Strategies of weights adjustments

We propose the following local weights adjustments to the final marginal weights construction in Section 2.1 in the main document.

(18) |

where and are constants to tune the amount of weighting of all CUs.

### 8.2 Marginal weights plots

### 8.3 Utility

point estimate | 95% C.I. | |
---|---|---|

Data | 72090.26 | [70127.02, 74053.50] |

Synthesizer | 72377.12 | [70412.90, 74415.81] |

Marginal, | 76641.95 | [72638.03, 81817.03] |

Marginal, | 71527.38 | [69352.11, 73889.61] |

Marginal, | 72950.40 | [70645.43, 75375.39] |

point estimate | 95% C.I. | |
---|---|---|

Data | 50225.15 | [48995.01, 52000.00] |

Synthesizer | 50538.50 | [49043.63, 52115.76] |

Marginal, | 54229.08 | [52877.58, 55537.29] |

Marginal, | 54376.32 | [53081.17, 55702.35] |

Marginal, | 52664.07 | [51309.78, 54073.17] |

point estimate | 95% C.I. | |
---|---|---|

Data | 153916.30 | [147582.40, 159603.80] |

Synthesizer | 152597.10 | [147647.40, 157953.80] |

Marginal, | 134582.40 | [130716.50, 138516.00] |

Marginal, | 132463.70 | [129035.50, 136299.90] |

Marginal, | 142526.50 | [138046.80, 147128.40] |

point estimate | 95% C.I. | |
---|---|---|

Data | -45826.20 | [-49816.29, -41836.11] |

Synthesizer | -46017.29 | [-50239.20, -41795.37] |

Marginal, | -34738.85 | [-46792.90, -22684.81] |

Marginal, | -37687.76 | [-42755.65, -32619.88] |

Marginal, | -40759.53 | [-46084.41, -35434.64] |

Comments

There are no comments yet.